Data Science

Sentiment Analysis on Reddit Tech News with Python

A quick guide to sentiment analysis with NLTK on the subreddit r/technews.

Benedict Neo
bitgrit Data Science Publication
10 min readJul 15, 2021

--

Sentiment Analysis is the process of determining whether a piece of text is considered to be positive, negative, or neutral.

It’s an application of Natural Language Processing that has tons of use cases.

As stated in Wikipedia:

Sentiment analysis is widely applied to voice of the customer materials such as reviews and survey responses, online and social media, and healthcare materials for applications that range from marketing to customer service to clinical medicine.

Imagine you’re a business owner, and you have over 10,000 product reviews for your product. You want to know what your customers think about your product, but you don’t have the time to sift through them one by one.

With sentiment analysis, you can automate that process or even have real-time monitoring to deal with feedbacks swiftly.

Below is an example of sentiment analysis in action on product reviews.

source: monkey learn sentiment analysis guide article

To showcase how you can perform sentiment analysis in Python, in this article, I will use the PRAW library to interact with the Reddit API to grab posts from the subreddit technews.

Then, I’ll use the NLTK library, specifically using the VADER sentiment analysis to perform sentiment analysis on the post titles.

As always, here’s where you can find the code for this article:

This post was inspired by the article “Sentiment Analysis on Reddit News Headlines with Python’s Natural Language Toolkit (NLTK)” on learndatasci.com.

Create a Reddit application

The first step is to create a Reddit app. To do so, you would first need a Reddit account. If you don’t have one, you can register one here.

After you’re logged in, head over to reddit.com/prefs/apps, and you will see this interface.

There are 3 essential things you need to do:

1. select the script option
2. name: your_reddit_username
3. redirect url: http://localhost

After that, you can hit create app, and on the upper left corner, you will see something like this.

Note you shouldn’t expose your credentials online, I already deleted mine so it’s fine.

From the above image, what you want to note down is the client_id and client_secret, which you’ll use to build a Reddit client.

Now that you have the credentials, we can move on to the code!

Load Libraries

First things first, we import all the necessary libraries for this project.

  • pprint — a Data pretty printer that outputs data structures in a cleaner format.
  • itertools — iterators for efficient looping, one of which is chain which I used to join chain together multiple lists into a single list.
  • NLTKNatural Language Toolkit, an open-source Python library for NLP, containing a set of text processing libraries for classification, tokenization, stemming, and tagging.
  • PRAW — The Python Reddit API wrapper allows you to interact with Reddit API using Python.

Downloading NLTK’s databases

nltk.download() is used to download a particular dataset/model. For this article, there are three things to download.

  • Vader lexicon — Dataset of lexicons containing the sentiments of specific texts which powers the Vader Sentiment Analysis
  • punkt — Pre-trained models that help us tokenize sentences.
  • stopwords — Dataset of common stopwords in English.

With that, we can set up the client.

Setting up Reddit client

With the credentials you generated earlier, you can pass in the user_agent your Reddit user name and the rest as follows. Note that the check_for_async was set to False just so that it won’t generate warnings later on.

Selecting subreddit and sorting type

As mentioned in the subtitle of this article, we’ll be scraping the subreddit r/technews, but you can choose any subreddit you want to analyze, replace 'technews' with the subreddit name of your choosing.

Here I’m getting the top posts all time, and I set the limit to None to get the maximum amount of posts possible (the limit is 1000 posts).

You can find more options, such as sorting by new, hot, rising, etc., in PRAW’s quick start guide.

Notice the * symbol, this is known as the star expression, and it has the functionality to unpack iterables. In this case, what it does is unpack the output generated by the function into a list.

Printing the length, tells us we obtained a total of 967 posts.

Grabbing the first post we scraped by indexing 0, you can see that you can get various kinds of information from the — number of upvotes, date and time, number of comments, total upvotes, and number of awards given.

You can run vars on the first post object to get all the information to contain within a single post (warning: the output is huge).

For this article, we only need the title, so what we’ll do is extract the title for each post and dump it into a list.

With this list of headlines, we can now form a Pandas data frame.

Going to the subreddit on Reddit, you can see we grabbed the post titles!

screenshot of subreddit technews on reddit.com

With over 900 post titles in a data frame, it’s time for some sentiment analysis!

Sentiment Analysis with VADER

What is VADER?

According to their Github:

VADER (Valence Aware Dictionary and sEntiment Reasoner) is a lexicon and rule-based sentiment analysis tool that is specifically attuned to sentiments expressed in social media, and works well on texts from other domains.

In other words, it’s a pre-trained sentiment analysis model for text sentiment analysis. This model relies on the vader_lexicon the dataset we downloaded earlier, which will map lexical features to the sentiment scores.

When given a string of words, VADER returns a dictionary containing the four scores:

  • neg — negative
  • neu — neutral
  • pos — positive
  • compound (normalization of three scores above)

Below you see examples of VADER in action.

Notice that the words ‘awesome’ and ‘bad’ skews towards positive and negative polarity based on their respective sentiment.

Also, the intensity of emotion is considered as capitalizing the word ‘awesome,’ and adding an exclamation mark increases the positive score.

You can view more examples on their Github.

Now you know a little bit about what Vader is and what it can do, let’s apply it to our data frame.

With the scores calculated in dictionaries, we create a data frame using from_records and then concatenate it to our data frame on an inner join.

Now that we have the scores, the next step is to choose a threshold to label the text as positive, negative, or neutral.

Choosing the threshold

The VADER Github readme tells us that the typical threshold is 0.05. But following this article, which also did sentiment analysis on news headlines, I’ll use the value 0.2

VADER on individual words

If you’re curious about how VADER ended up labeling the sentiment of the titles, here’s a broken-down version that shows which word it categorizes as positive, neutral, and negative.

Notice there were no positive words in this sentence, and there were three negative words. Since there are more negatives than positives, it makes sense that this was labeled as negative.

If you want to go a step further and learn how the compound score is calculated, check out this StackOverflow post.

Now that we have our labels, we can do a quick value count on each label.

With our selected threshold, we have mostly neutral titles and more negative titles than positive titles.

Are the labels accurate?

Taking random samples of each label and using a custom function that outputs the news titles, we can get a sense of how well our threshold performs in categorizing news as positive, neutral, and negative.

From the output, the labels seem to be pretty accurate.

A side tangent: Usually sentiment analysis makes more sense when applied on a “target subject”, such as reviews on a book, or comments on a YouTube video. News headlines are, on the other hand, pretty descriptive and neutral, so sentiment analysis might be misleading.

Let’s now move on to tokenization.

Tokenization

What is it?

Tokenization is the process of breaking down a piece of text into smaller components known as tokens. A token can be a word, a part of a word, or any character like punctuation, symbol or even emojis 🤯.

Why we do it?

Tokenization builds the foundation for any NLP tasks, as these tokens provide context and help computers interpret the meaning of the text. Different kinds of tokens can serve different purposes, but the main idea is to turn them into a usable form for computers.

You can use many different tools to tokenize strings, but NLTK already has a set of tokenizers we can utilize.

NLTK tokenizers

NLTK has many built-in tokenizers that you can use for specific purposes.

A few notable tokenizers are:

  • word_tokenize — Splits string by punctuation other than periods
  • sent_tokenize — Splits a string into sentences
  • RegexpTokenize — Splits string based on a regular expression.
  • more in their documentation

Above, you can see an example of a text being split by the tokenizers.

Notice how each of the tokenizers words differently based on how it’s split.

The first one splits by punctuation, which splits the word “Let’s” into "Let"and "'s", whereas the second one that splits by whitespace keeps the word Let's. As for the last one, splitting by word results in the punctuation being removed.

One thing that comes up when you learn about tokenization is stop words. They’re basically the most common words in the English language, and we remove them so we can focus on more important features (words) instead.

By downloading the ‘stopwords’ database with NLTK earlier on, we have access to a total of 179 of them, which we will use to filter them out from our text.

Custom tokenize

In some cases, you would also do further preprocessing to get the result that you want.

In this function, I remove the single quote so that words like “Let’s” will become “Lets”, and I also removed hyphens so the word “covid-19” would be “covid19”, instead of being separate as “covid” and “19”.

Note: I removed the single quote because I’m only using the tokens for visualization. If you decide to use it to build a model, it would destroy the meaning behind the original words, i.e. from it’s to its, which are two different things.

The text was also lowercased, and stop words are filtered with a list comprehension.

Using Pandas’ nifty apply function, we can apply our custom function onto each title in our data frame.

The tokens object is a nested list (multiple lists within a list). Since we want all the words in a single list, the method chain which comes from the itertools library helps us do exactly that.

The end result is two lists, containing the words of titles that were labelled as positive and negative.

Visualize tokens

Top 20 words

With our list of words, we can utilize NLTK’s built-in function FreqDist as a counter for the words within our list, and most_common to return the top words based on the count.

From our list of positive words, we see the word “apple” and “google” are the top words. Notice how the numbers 5 and 000 are present in our list, they can also be filtered if you want to with more preprocessing.

Usually, when visualizing tokens, a better option is to use word clouds, as the size of the words correlates with their count, so you have a better idea of which words are important.

Word clouds

Here is the word cloud generated for the positive and negative words list.

Words from post titles labeled as positive from the subreddit r/technews

We can imagine what positive news was related to these words from the positive word cloud.

The words “Apple” and “Google” could be the good deeds that the big tech companies are doing.

We also see the words “Elon Musk”, “Tesla”, and “SpaceX” amongst the top positive words, which is most likely some technological advancements or maybe philanthropy works of Elon.

To find out the exact news, I wrote up a function to extract the titles.

When given the words Elon Musk, these titles were extracted.

Now let’s have a look at the negative words.

Words from post titles labeled as negative from the subreddit r/technews

At first glance, we can tell the big tech companies are more prominent in the negative words, along with the words “ban”, “internet”, “data”, and “Trump”. This suggests it was the news about Donald Trump being banned from social media companies.

In this word cloud, negative words are also more evident. As words like “fake”, “misinformation”, “lawsuit”, “hacked”, “attack”, “blocking”, etc. are popping up.

Extracting the titles on the word “Facebook”, and sure enough, it was about him being banned.

Notice the second title — being positive news — is labeled as negative because of the words “banning” and “misinformation”, which shows you the limitation of VADER.

There you go! You scraped Reddit tech news headlines, did sentiment analysis on them, tokenize the titles, and generated word clouds!

This was just a glimpse into what NLTK can achieve in terms of NLP, and there are definitely improvements you can make to the sentiment analysis to label the posts more accurately.

If you want to know more, I listed a few articles below for you to dive deeper into this topic!

That’s all for this article, and I hope you learn something new from it!

Thanks for reading 😉 !

Links

Further readings

Liked what you read? Here are some articles you may enjoy:

If you like these kinds of articles, be sure to follow the bitgrit Data Science Publication for more!

Follow bitgrit’s socials 📱 to stay updated on talks and upcoming competitions!

--

--