
NAIVE BAYES
OVERVIEW:
Naive Bayes is a probabilistic classification method based on the application of Bayes’ Theorem in conjunction with the simplification that features are independent. Through estimation of the probability that each class is represented by certain features, the classifier effectively chooses the most likely class for an input. The independence assumption, although not often valid in real application, supports surprisingly good performance in high-dimension spaces such as in text mining.
​
Implementation starts by projecting data we got from news data onto numeric vectors using CountVectorizer, with every term frequency being used as an input feature. This high-dimensional sparse data is suitable for use in conjunction with Naïve Bayes since it enables the system to process word occurrences as independent contributors towards making individual predictions. Such treatment is computationally efficient, scalable, as well as robust to irrelevant features, typical within textual data.
​
In such context, classification centers on determining articles based on voiced opinions regarding regulation-related material. Output labels involve stances such as anti-regulation, neutral stances, as well as multiple levels of pro-regulation. Classification accuracy aids in in-depth analysis on opinion trends in individual sources, toward bolstering the wider aim of assessing story framing, along with possible misinformation.
​
Utilizing Naïve Bayes also sets a solid performance baseline, even more so when compared to more computationally expensive, yet more flexible, algorithms. By determining areas of strength and weakness for this approach to handling complex opinion classification, the process can inform the choice in actual use cases where speed, explainability, and baseline performance is paramount.
DATA PREP:
Supervised models need to have labeled data in order to work. In such an architecture, classification hinges on learning term frequency-opinion category relationships. Only documents pre-annotated for one of the target class labels can be used for training as well as for evaluation. Text that is not labeled does not have ground truth to drive the learning process and is thus not used in supervised modeling processes.
​
Input features were extracted using CountVectorizer, which converts raw text to a sparse matrix where rows describe documents and columns describe unique terms from the corpus. Each cell in this matrix is indicative of the frequency of occurrence for a term in an individual document. This allows the model to have access to an uniform, numeric description of the text, making it possible to use class-conditional word probability estimation.
​
The data was divided into two mutually exclusive sets: 80% for training data and 20% for test data. The training data were used to derive probabilistic distributions needed by the Naive Bayes approach, while test data gave an unbiased measure of predictive accuracy. Stratified sampling provided for equal representation of all opinion classes in both subsets. This division is needed to avoid data leakage as well as to measure generalization to unseen data.
​
Here is an example of the input dataset after vectorizing, accompanied by small samples from the training and test subsets. These pictures demonstrate the numerical data format and confirm that both subsets have the same structure format.
​
Before Transformation:

After Transformation:

Training Data (Features):

Training Data (Label):

Testing Data (Features):

Testing Data (Label):

CODE:
​​Link to the code of Naive Bayes: https://github.com/saketh-saridena/TextMining_Project
RESULTS:
The Naïve Bayes classifier applied to countvectorized textual features produced an accuracy rate of 60.66% on the test data set. That is over half of articles classed as being in one or another of the predefined classes for opinions. Performance varied between classes, reflecting sensitivity to language used to express varying stances.​

As seen in the confusion matrix, it performed quite robustly in classifying the anti-regulation class, 88 out of 106 being classified with not that high a number of false positives in other classes. The strongly pro-regulation class closely followed with moderate accuracy, with some mixup with anti-regulation and weakly pro-regulation classes. The most difficult class was neutral, with 11 misclassifications over 12. This is to be anticipated as neutral messages have certain common terms applied by positive as well as negative opinions.

Precision and recall values from the classification report tell more. Its precision at 0.65 along with recall at 0.83 for anti-regulation gave an F1-score of 0.73, the highest across all classes. Precisions at 0.55 were equated for strongly pro-regulation, but weakly pro-regulation had recall at an abysmal 0.24, meaning lots of misclassifications to stronger or even opposing ones. Its precision being perfect (with one positive prediction being certain), neutral class was recalled at 0.08, confirming that it's being underrepresented in actual positive ones.
​
In general, the model performed steadily on more polarized classes, particularly ones that possess more prominent linguistic features. In future, refinement can include better class imbalance management or more sophisticated feature engineering to perform better in more ambiguous classes such as neutral or weak pro-regulation.
CONCLUSION:
The Naïve Bayes analysis gave an insightful first glimpse at how opinions in news articles could be modeled from the language used. The model was able to discover strong language patterns, particularly when opinions were explicitly espoused. This facilitated in detecting articles that were strongly in favor or strongly against regulation. Such clear stances have certain words and expressions that shine through.
​
Conversely, however, more difficulty was experienced with less extreme categories by the model, such as neutral or weakly pro-regulation. These articles use more mixed or balanced language, making it more difficult for the model to place them unambiguously in one particular group. This reveals that even where an article is presented overtly, not all opinions are framed in ways that we can classify very readily. Some opinions will come across obliquely or will have been buried in more sophisticated sentence structures.
​
Even with its limitations, the model was still able to reveal how language expresses certain attitudes toward regulation. It indicated which kinds of articles were more straightforward to identify and which needed to be discerned more subtly. As an illustration, that neutral articles tended to get misclassified implies that individuals might use emotionally charged language even in attempts to remain neutral — an occurrence that might be worthwhile to explore in terms of misinformation. In total, the Naïve Bayes methodology proved to be useful as an initial indicator for detecting trends in opinions in the dataset. Though far from ideal, it gave basis for comparative purposes to more sophisticated models and yielded some useful insights regarding whether opinions in language are strong or not. This directly assists in the overall purpose of this project to know how opinions about regulation are constructed and presented.
Github Repo (Code and Data): https://github.com/saketh-saridena/TextMining_Project