News Sentiment Analysis

FRAMEWORK: Python’s NLTK toolkit and its sentiment analyzer module.

Part of this project was training our Naive Bayes Classifier on a manually tagged set of articles about a particular political figure. In our case, we chose Trump because of the immense media attention given to him.

We collected around 2000 articles about Trump, one month before and after his inauguration from the following news sites: Chicago Tribune, CNN, FOX, LA Times, New York Times, Slate, Washington Post and Washington Times. We randomly selected 20% of our corpus and manually tagged the articles we read as Positive, Negative or Neutral. The final tag assigned to each article in the training set was the majority sentiment that was tagged by us. E.g. If an article was tagged: Positive, Positive, Negative by the three of us individually, the final tag of that article was Positive. If we encountered a tie we would sit and reread the article together and come to a consensus about the final tag.

After we had our tagged set, we ran a preliminary analysis on it to get an idea of what we were building our hypothesis on.

Tag set
Create your own infographics

Tagged set

As we thought the number of negative articles increased after his inauguration, we hypothesized that this would be true for our entire data set as well.

We created a few feature sets that we thought would help analyze sentiment in a news article. As mentioned above, sentiment analysis on news is very subjective and each model will be different from the next. For our model, we created the following feature sets:

Quotes: Does the number of quotes in an article determine its sentiment

No Punctuation: Do punctuations bear any weight on sentiment

Exclamations and Question Marks: Does having an exclamation mark in the text affect the polarity of that text

Word Polarity: Each word in the article is given a polarity score based on the MPQA lexicon and then the scores are added to determine the sentiment of the article

Adjective Polarity: Each adjective in the article is given a polarity score based on the MPQA lexicon and then the scores are added to determine the sentiment of the article

Stopwords: Do words like 'then','a','is','an' and so on have any say in the sentiment of an article?

Bigram: Taking two consecutive words to analyze sentiment. For example, the word-set ‘mexico’, ‘border’ has a negative connotation in our data set.

Unigram: Counts every single word in the article. For example, if the word “wrong” appears in many articles tagged Negative, then the machine will assume that a new article with the word “wrong” in it will also be negative.

After defining our feature sets, we went on to test the accuracy of our model. In the end, the Adjective Polarity feature set scored the highest in terms of accuracy. That is what our model is based on. It takes every adjective in the text and assigns it a polarity score based on the MPQA word lexicon. Then the Negative, Positive and Neutral scores are tallied for each article and the sentiment with the highest score is the final tag for the article. We ran the rest of the database through our program and the results are what you see on this site.

While we were satisfied with our results, the sentiment analysis model itself is very subjective and hence the results here should be taken as nothing but than the outcome of an educational quest.

Trump and the media

Text Analysis

Sentiment in Media

Source Break UP

Methodology

Sample set of Data