Home » Projects » 8410 Computational Text Analysis » Jupyter Notebook

Subreddit Sentiment Analysis notebook

Table of Contents

Part I: Data Acquisition and Loading

This is a two-part data science project in which involved acquisition/loading of data in the first, and then followed by analytics in the second. I will not go through the first part on this notebook as the focus of this notebook is pure text analysis (if enough requests are made otherwise, I can upload it).

Part II: Analytics

  1. Produce interesting visualizations of the linguistic data.
    • Try to look for trends (within a subreddit) and variations of topics across subreddits
    • Some comparative plots across feeds
  2. Write a summary of your findings!

Part II: Analytics

Task: Produce interesting visualizations of the lingustic data.

## Your code in this cell
## ------------------------
import re
import spacy
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

%matplotlib inline
%%sql

SELECT DISTINCT subreddit
FROM ydn3f.redditposts;
 * postgres://dsa_ro_user:***@pgsql.dsa.lan/dsa_student
4 rows affected.
subreddit
solotravel
dataengineering
LawSchool
COVID19positive
## Let's retrieve our data that was loaded in our PGSQL database
## ------------------------

credentials = "creds"


dataeng = pd.read_sql("""
            SELECT *
            FROM ydn3f.redditposts
            WHERE subreddit = 'dataengineering'
            """, con = credentials)

lawschool = pd.read_sql("""
            SELECT *
            FROM ydn3f.redditposts
            WHERE subreddit = 'LawSchool'
            """, con = credentials)

covid19 = pd.read_sql("""
            SELECT *
            FROM ydn3f.redditposts
            WHERE subreddit = 'COVID19positive'
            """, con = credentials)

solotravel = pd.read_sql("""
            SELECT *
            FROM ydn3f.redditposts
            WHERE subreddit = 'solotravel'
            """, con = credentials)

dataeng.head()
id score title link author subreddit flair published comments content neg neu pos compound sentiment content_tsv_gin content_tsv_gist
0 151xsis 567 Data Scientists -- Ok, now I get it. https://www.reddit.com/r/dataengineering/comme... tarzanboy76 dataengineering Discussion 2023-07-17 220 a few days ago, our data scientist gave me som... 0.035 0.861 0.104 0.9340 POS 'access':193 'actual':62,110,147 'admin':192 '... 'access':193 'actual':62,110,147 'admin':192 '...
1 10kl6lg 374 Finally got a job https://www.reddit.com/r/dataengineering/comme... 1000gratitudepunches dataengineering Career 2023-01-25 100 i did it! after 8 months of working as a budte... 0.000 0.950 0.050 0.5093 POS '12':24 '400':20 '8':5 'applic':22 'believ':42... '12':24 '400':20 '8':5 'applic':22 'believ':42...
2 yyh6l9 381 What are your favourite GitHub repos that show... https://www.reddit.com/r/dataengineering/comme... theoriginalmantooth dataengineering Discussion 2022-11-18 40 looking to level up my skills and want to know... 0.000 0.899 0.101 0.5775 POS 'accounts/repos':20 'alreadi':46 'data':17 'di... 'accounts/repos':20 'alreadi':46 'data':17 'di...
3 14663ur 294 r/dataengineering will be joining the blackout... https://www.reddit.com/r/dataengineering/comme... AutoModerator dataengineering Meta 2023-06-10 21 [see here for the original r/dataengineering t... 0.087 0.840 0.073 -0.8688 NEG '/)*.':536 '/hc/en-us/requests/new):':352 '/r/... '/)*.':536 '/hc/en-us/requests/new):':352 '/r/...
4 10fg07o 286 just got laid off (FAANG) https://www.reddit.com/r/dataengineering/comme... Foodwithfloyd dataengineering Career 2023-01-18 84 hi all, its been a pretty awful day. two month... 0.032 0.808 0.160 0.9118 POS 'ago':11 'anoth':26 'anyon':98 'aw':7 'beyond'... 'ago':11 'anoth':26 'anyon':98 'aw':7 'beyond'...
lawschool.head()
id score title link author subreddit flair published comments content neg neu pos compound sentiment content_tsv_gin content_tsv_gist
0 13c2x19 6172 I promised my mom on her death bed that I woul... https://www.reddit.com/r/LawSchool/comments/13... cinnamorolloing LawSchool None 2023-05-08 192 this one is for you, mom. 0.000 1.000 0.000 0.0000 NEU 'mom':6 'one':2 'mom':6 'one':2
1 14fhvdj 1590 Not in law school (Econ undergrad) but I am cu... https://www.reddit.com/r/LawSchool/comments/14... om-om LawSchool None 2023-06-21 74 0.000 0.000 0.000 0.0000 NEU
2 13dw7mo 1531 A Sigma Male Law School Schedule https://www.reddit.com/r/LawSchool/comments/13... Equivalent-Editor697 LawSchool None 2023-05-10 110 2:00 am- wake up2.05am-cold shower2.15am-break... 0.026 0.974 0.000 -0.2960 NEG '-2':123 '00':2,124 '00am':42,64 '00am-arrive'... '-2':123 '00':2,124 '00am':42,64 '00am-arrive'...
3 151geb6 1458 Sex during the bar? https://www.reddit.com/r/LawSchool/comments/15... Decent_Situation_952 LawSchool None 2023-07-16 219 i’m sitting for the bar this month. during the... 0.051 0.890 0.059 -0.4836 NEG '3l':22 'alreadi':76 'anoth':21 'bar':6 'bathr... '3l':22 'alreadi':76 'anoth':21 'bar':6 'bathr...
4 12k0vjz 1282 I passed the bar exam! https://www.reddit.com/r/LawSchool/comments/12... Organic-Ad-86 LawSchool None 2023-04-12 74 ....and i'm stoked. that's all. 0.000 1.000 0.000 0.0000 NEU 'm':3 'stoke':4 'm':3 'stoke':4
covid19.head()
id score title link author subreddit flair published comments content neg neu pos compound sentiment content_tsv_gin content_tsv_gist
0 yjrg0a 907 Up vote if you're currently positive with your... https://www.reddit.com/r/COVID19positive/comme... Hailabigail COVID19positive Tested Positive - Breakthrough 2022-11-02 311 i'm seeing an overwhelming amount of posts wit... 0.106 0.727 0.167 0.7293 POS '10':48 'amount':6,29 'breakthrough':31 'covid... '10':48 'amount':6,29 'breakthrough':31 'covid...
1 13p6qrm 597 Why is everyone pretending the pandemic disapp... https://www.reddit.com/r/COVID19positive/comme... marconas1_ COVID19positive Rant 2023-05-22 271 i work in a tech company, and it has become co... 0.154 0.776 0.069 -0.8343 NEG 'accept':66 'affect':33 'back':40 'becom':10 '... 'accept':66 'affect':33 'back':40 'becom':10 '...
2 12lw075 461 What is….happening here? https://www.reddit.com/r/COVID19positive/comme... brutallyhonestkitten COVID19positive Rant 2023-04-14 201 like the title says, i feel like i am living i... 0.027 0.853 0.119 0.9247 POS 'absolut':37 'alien':126 'altern':13 'anymor':... 'absolut':37 'alien':126 'altern':13 'anymor':...
3 zw72uc 418 This new variant was one of the worst experien... https://www.reddit.com/r/COVID19positive/comme... Throwawayacount5093 COVID19positive Tested Positive - Me 2022-12-27 145 i’m in my early twenties, fully vaxed and boos... 0.121 0.790 0.089 -0.8072 NEG '104':120 '4':233 '60':206 '60mg':201 'abl':21... '104':120 '4':233 '60':206 '60mg':201 'abl':21...
4 zji350 396 The pandemic's over they said. You don't need ... https://www.reddit.com/r/COVID19positive/comme... None COVID19positive Tested Positive - Me 2022-12-12 216 i haven't slept in nearly 40 hours, was in the... 0.069 0.839 0.091 0.2023 POS '15':14 '40':7 '4x':77 'ach':54 'ash':34 'cong... '15':14 '40':7 '4x':77 'ach':54 'ash':34 'cong...
solotravel.head()
id score title link author subreddit flair published comments content neg neu pos compound sentiment content_tsv_gin content_tsv_gist
0 16c1of1 5356 The number of old sex tourists in Bangkok is i... https://www.reddit.com/r/solotravel/comments/1... Weekly-Patience4087 solotravel None 2023-09-07 629 i am currently in bangkok and the number of se... 0.075 0.857 0.068 0.2326 POS '2':211 'advantag':143 'also':61 'area':20,185... '2':211 'advantag':143 'also':61 'area':20,185...
1 11zkusj 2967 I encountered my first begpacker today https://www.reddit.com/r/solotravel/comments/1... Northerner6 solotravel None 2023-03-23 332 i encountered my first begpacker today. i was ... 0.074 0.876 0.050 -0.6554 NEG 'accent':38 'afford':176 'american':37 'approa... 'accent':38 'afford':176 'american':37 'approa...
2 13uw9tn 2205 REMINDER: Unwanted sexual attention is NEVER O... https://www.reddit.com/r/solotravel/comments/1... unsuspectingmuggle solotravel Accommodation 2023-05-29 301 report people who make you feel unsafe!i've be... 0.128 0.826 0.047 -0.9464 NEG '11':34 '25':254 '99.99':256 'alon':124,248 'a... '11':34 '25':254 '99.99':256 'alon':124,248 'a...
3 11ccux4 2124 I have been in India for a month and so far I ... https://www.reddit.com/r/solotravel/comments/1... Big-Assist-5 solotravel None 2023-02-26 476 one of the times i was staying at a guest hous... 0.165 0.828 0.007 -0.9831 NEG '30':20 'answer':63 'appar':29 'away':22 'basi... '30':20 'answer':63 'appar':29 'away':22 'basi...
4 146y8eu 1900 The first time I have ever felt unsafe in SE A... https://www.reddit.com/r/solotravel/comments/1... ihatemycohort solotravel Asia 2023-06-11 232 i just had a complete scare. im still shaking ... 0.102 0.811 0.087 -0.9189 NEG '1':554,1018 '10':71 '15':199 '2':64,562,986 '... '1':554,1018 '10':71 '15':199 '2':64,562,986 '...

Sentiment counts of each subreddit

dataeng['sentiment'].value_counts()
POS    78
NEG    17
NEU     5
Name: sentiment, dtype: int64
lawschool['sentiment'].value_counts()
POS    49
NEG    38
NEU    13
Name: sentiment, dtype: int64
covid19['sentiment'].value_counts()
NEG    58
POS    36
NEU     6
Name: sentiment, dtype: int64
twosentence['sentiment'].value_counts()
NEG    43
POS    29
NEU    27
Name: sentiment, dtype: int64
solotravel['sentiment'].value_counts()
POS    71
NEG    29
Name: sentiment, dtype: int64

Exploratory Visualizations

Polarity distributions

# Create a figure and axis
plt.figure(figsize=(10, 6))


# Plot the distribution of the 'compound' score for the 'dataeng' subreddit
sns.distplot(dataeng['compound'], color='green', kde=False)

# Set plot labels and title
plt.xlabel('Compound Score')
plt.ylabel('Frequency')
plt.title('Distribution of Compound Scores in r/dataengineering')

# Show the plot
plt.show()

Imgur

# Create a figure and axis
plt.figure(figsize=(10, 6))

# Plot the distribution of the 'compound' score for the 'dataeng' subreddit
sns.distplot(lawschool['compound'], color='orange', kde=False)

# Set plot labels and title
plt.xlabel('Compound Score')
plt.ylabel('Frequency')
plt.title('Distribution of Compound Scores in r/lawschool')

# Show the plot
plt.show()

Imgur

# Create a figure and axis
plt.figure(figsize=(10, 6))

# Plot the distribution of the 'compound' score for the 'dataeng' subreddit
sns.distplot(covid19['compound'], color='brown', kde=False)

# Set plot labels and title
plt.xlabel('Compound Score')
plt.ylabel('Frequency')
plt.title('Distribution of Compound Scores in r/covid19')

# Show the plot
plt.show()

Imgur

# Create a figure and axis
plt.figure(figsize=(10, 6))

# Plot the distribution of the 'compound' score for the 'dataeng' subreddit
sns.distplot(solotravel['compound'], color='purple', kde=False)

# Set plot labels and title
plt.xlabel('Compound Score')
plt.ylabel('Frequency')
plt.title('Distribution of Compound Scores in r/solotravel')

# Show the plot
plt.show()

Imgur

Lawschool subreddit

sns.set(style='whitegrid')
import nltk
from nltk import word_tokenize
from nltk import FreqDist
from nltk.corpus import stopwords

lawschool_list = []

for row in lawschool["content"]:
    lawschool_list.append(row)

lawschool_content = ' '.join(lawschool_list)

lawschool_tokens = word_tokenize(lawschool_content)
total_word_count = len(lawschool_tokens)

nltk.download("stopwords")
stop_words = set(stopwords.words("english"))

# Also need to remove some other random punctuations
other_removals = ["'", ";", "?", "(", ")", "'m", "'s", "n't", ":", "!", "`", '''"''', "'ve", "*", "`", ",", ""]
stop_words_updated = stop_words.union(other_removals)

# Filter out stopwords and short words
tokens_wo_stopwords = [word.lower() for word in lawschool_tokens if word.lower() not in stop_words_updated and len(word) > 2]
freq_dist = nltk.FreqDist(tokens_wo_stopwords)

# Calculate the percentage share of words
word_freq = freq_dist.most_common(10)
percentage_share = [(word, freq / total_word_count * 100) for word, freq in word_freq]

# Create the plot
plt.figure(figsize=(12, 6))
x, y = zip(*percentage_share)
plt.bar(x, y)
plt.xlabel("Words")
plt.ylabel("Percentage Share")
plt.xticks(size=15, rotation=75)
plt.show()

[nltk_data] Downloading package stopwords to /home/ydn3f/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.

Imgur

from nltk.util import ngrams

def plot_ngram_percentage_share(tokens, num, total_word_count, num_results=25):
    ngram = list(ngrams(tokens, num))
    ngram_dist = nltk.FreqDist(ngram)
    
    # Calculate the percentage share of bigrams
    ngram_freq = ngram_dist.most_common(num_results)
    percentage_share = [(bigram, freq / total_word_count * 100) for bigram, freq in ngram_freq]

    # Create the plot
    x, y = zip(*percentage_share)
    plt.figure(figsize=(10, 6))
    plt.bar([" ".join(bigram) for bigram in x], y)
    plt.xlabel(f"Top {num_results} {num}-grams")
    plt.ylabel("Percentage Share")
    plt.xticks(fontsize=15, rotation=75)
    plt.show()

plot_ngram_percentage_share(tokens_wo_stopwords, 3, total_word_count, num_results=10)

Imgur

The first chart doesn’t really reveal anything beyond the obvious, so I decided to dig a little deeper by exploring the frequent tri-grams that occur in the dataset for the lawschool subreddit. I was immediately surprised to see the top 3 to be mirror reflections of one another. I’m not a law student so I don’t really know why those 3 words are mentioned as a group often, but I’m curious to hear why. The rest of the tri-grams in the top 10 are pretty self-explanatory.

#Topic Modelling

from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.decomposition import NMF, LatentDirichletAllocation

count_vectorizer = CountVectorizer(stop_words='english')
term_frequency = count_vectorizer.fit_transform(lawschool_list)
feature_names = count_vectorizer.get_feature_names()

print(f"Shape of term freq matrix = {term_frequency.shape}")
print(f"Num of features identified = {len(feature_names)}")

#LDA model with 5 topics
lda = LatentDirichletAllocation(n_components=5, random_state=0)  
lda.fit(term_frequency)  

def display_topics(model, feature_names, no_top_words):
    for topic_idx, term_weights in enumerate(model.components_):
        
        sorted_indx = term_weights.argsort()

        topk_words = [feature_names[i] for i in sorted_indx[-no_top_words :]]
        print(f"Topic {topic_idx}:", end=None)
        print(";".join(topk_words))


display_topics(lda, feature_names, 10)
Shape of term freq matrix = (100, 2340)
Num of features identified = 2340
Topic 0:
need;ranking;student;people;students;just;law;tax;corporate;did
Topic 1:
doesn;westlaw;foster;partner;know;care;getting;school;bar;summer
Topic 2:
probably;school;say;did;people;law;like;just;student;gen
Topic 3:
students;firm;school;class;know;time;just;like;people;law
Topic 4:
got;like;people;firm;going;know;bar;just;school;law
#TFIDF VEC

tfidf_vectorizer = TfidfVectorizer(stop_words='english')
tfidf = tfidf_vectorizer.fit_transform(lawschool_list)
tfidf_feature_names = tfidf_vectorizer.get_feature_names()

print(f"Shape of tfidf matrix = {tfidf.shape}")
print(f"Num of features identified = {len(tfidf_feature_names)}")

#6 topics
nmf = NMF(n_components=5, random_state=0)
nmf.fit(term_frequency)

#Top 10 words per topic
display_topics(nmf, tfidf_feature_names, 10)
Shape of tfidf matrix = (100, 2340)
Num of features identified = 2340
Topic 0:
feeling;don;measure;like;paperwork;adhd;just;tax;corporate;did
Topic 1:
hard;students;going;just;make;exam;prep;bar;school;law
Topic 2:
let;really;just;tax;law;real;firm;people;know;like
Topic 3:
probably;youre;law;okay;like;student;say;people;just;gen
Topic 4:
got;heard;event;offer;minecraft;firm;just;getting;summer;partner


/opt/conda/lib/python3.7/site-packages/sklearn/decomposition/_nmf.py:315: FutureWarning: The 'init' value, when 'init=None' and n_components is less than n_samples and n_features, will be changed from 'nndsvd' to 'nndsvda' in 1.1 (renaming of 0.26).
  "'nndsvda' in 1.1 (renaming of 0.26)."), FutureWarning)

TFIDF Vectorizer seemed to detect more distinct topic selections in the lawschool subreddit. I can easily infer Topic 1 to be about exam preparation like the bar and all the difficult long hours studying for it. Topic 3 sounds like users are posting submissions related to being a student and maybe words of support or encouragement. Topic 4 can easily be described as discussion topics related to working at a firm over the summer or a job offer of sorts.

# Lawschool Sentiment Over Time
lawschool['published'] = pd.to_datetime(lawschool['published'])

lawschool_monthly = lawschool.copy()
lawschool_monthly.set_index('published', inplace=True)
lawschool_monthly = lawschool_monthly.resample('W').agg({'neg': 'mean', 'neu': 'mean', 'pos': 'mean', 'compound': 'mean'})
# Plot the sentiment scores over time
plt.figure(figsize=(16, 5))
sns.lineplot(x=lawschool_monthly.index, y=lawschool_monthly['neg'], label='Negative', linewidth=4)
sns.lineplot(x=lawschool_monthly.index, y=lawschool_monthly['neu'], label='Neutral', linewidth=4)
sns.lineplot(x=lawschool_monthly.index, y=lawschool_monthly['pos'], label='Positive', linewidth=4)
sns.lineplot(x=lawschool_monthly.index, y=lawschool_monthly['compound'], label='Compound', linewidth=3, color='gray', alpha=0.8, style=True, dashes=[(1,1)], legend=False)
plt.xlabel('Time')
plt.ylabel('Sentiment Score')
plt.title('Sentiment Analysis Over Time for r/lawschool (past 12 months)')
plt.legend()
plt.grid(True)
plt.show()

Imgur

For visualizing sentiment over time, I chose to use the line chart for interpretability. The most significant portion of the time-series is the month of march, especially at the beginning, where a substantial collection of negatively sentimented posts were submitted. Could it be the period before bar exams? 🤔

Solotravel subreddit

solotravel.head(3)
id score title link author subreddit flair published comments content neg neu pos compound sentiment content_tsv_gin content_tsv_gist
0 16c1of1 5356 The number of old sex tourists in Bangkok is i... https://www.reddit.com/r/solotravel/comments/1... Weekly-Patience4087 solotravel None 2023-09-07 629 i am currently in bangkok and the number of se... 0.075 0.857 0.068 0.2326 POS '2':211 'advantag':143 'also':61 'area':20,185... '2':211 'advantag':143 'also':61 'area':20,185...
1 11zkusj 2967 I encountered my first begpacker today https://www.reddit.com/r/solotravel/comments/1... Northerner6 solotravel None 2023-03-23 332 i encountered my first begpacker today. i was ... 0.074 0.876 0.050 -0.6554 NEG 'accent':38 'afford':176 'american':37 'approa... 'accent':38 'afford':176 'american':37 'approa...
2 13uw9tn 2205 REMINDER: Unwanted sexual attention is NEVER O... https://www.reddit.com/r/solotravel/comments/1... unsuspectingmuggle solotravel Accommodation 2023-05-29 301 report people who make you feel unsafe!i've be... 0.128 0.826 0.047 -0.9464 NEG '11':34 '25':254 '99.99':256 'alon':124,248 'a... '11':34 '25':254 '99.99':256 'alon':124,248 'a...
solotravel_list = []

for row in solotravel["content"]:
    solotravel_list.append(row)

solotravel_content = ' '.join(solotravel_list)

solotravel_tokens = word_tokenize(solotravel_content)
total_word_count = len(solotravel_tokens)

nltk.download("stopwords")
stop_words = set(stopwords.words("english"))

# Also need to remove some other random punctuations
other_removals = ["'", ";", "?", "(", ")", "'m", "'s", "n't", ":", "!", "`", '''"''', "'ve", "*", "`", ",", ""]
stop_words_updated = stop_words.union(other_removals)

# Filter out stopwords and short words
tokens_wo_stopwords = [word.lower() for word in solotravel_tokens if word.lower() not in stop_words_updated and len(word) > 2]
freq_dist = nltk.FreqDist(tokens_wo_stopwords)

# Calculate the percentage share of words
word_freq = freq_dist.most_common(10)
percentage_share = [(word, freq / total_word_count * 100) for word, freq in word_freq]

# Create the plot
plt.figure(figsize=(12, 6))
x, y = zip(*percentage_share)
plt.bar(x, y)
plt.xlabel("Words")
plt.ylabel("Percentage Share")
plt.xticks(size=15, rotation=75)
plt.show()
[nltk_data] Downloading package stopwords to /home/ydn3f/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!

Imgur

plot_ngram_percentage_share(tokens_wo_stopwords, 3, total_word_count, num_results=10)

Imgur

{solotravel} Sentiment analysis by geographic location detected

from collections import defaultdict

nlp = spacy.load("en_core_web_sm")
# Function to extract location mentions
def extract_locations(text):
    doc = nlp(text)
    locations = [ent.text for ent in doc.ents if ent.label_ == "GPE"]
    return locations

travel_posts = solotravel[["content", "neg", "neu", "pos", "compound"]]

# Define a dictionary to store sentiment scores by location
location_sentiment = defaultdict(list)

# Iterate through each post and extract location mentions
for index, row in travel_posts.iterrows():
    content = row["content"]
    sentiment = {
        "neg": row["neg"],
        "neu": row["neu"],
        "pos": row["pos"],
        "compound": row["compound"],
    }
    locations = extract_locations(content)
    for location in locations:
        location_sentiment[location].append(sentiment)

# Calculate summary statistics for sentiment by location
location_summary = {}
for location, sentiment_scores in location_sentiment.items():
    num_posts = len(sentiment_scores)
    if num_posts > 0:
        summary = {
            "num_posts": num_posts,
            "avg_neg": sum(score["neg"] for score in sentiment_scores) / num_posts,
            "avg_neu": sum(score["neu"] for score in sentiment_scores) / num_posts,
            "avg_pos": sum(score["pos"] for score in sentiment_scores) / num_posts,
            "avg_compound": sum(score["compound"] for score in sentiment_scores) / num_posts,
        }
        location_summary[location] = summary
geo_sentiments = pd.DataFrame.from_dict(location_summary)
geo_sentiments = geo_sentiments.transpose().rename_axis('geo-entity').reset_index()
# Sort the DataFrame and select the top and bottom rows
sorted_geo_sentiments = geo_sentiments.query("num_posts > 5").sort_values(by='avg_compound') # minimum of 5 self-post
top_geo_sentiments = sorted_geo_sentiments.head(5)
bottom_geo_sentiments = sorted_geo_sentiments.tail(5)
tb_geo_sentiments = pd.concat([top_geo_sentiments, bottom_geo_sentiments])

colors = ['#2E8BC0' if avg_compound > 0 else '#AE0000' for avg_compound in tb_geo_sentiments['avg_compound']]

# Create the bar chart
plt.figure(figsize=(12, 6))  # Adjust the figure size as needed
sns.barplot(y='avg_compound', x='geo-entity', data=tb_geo_sentiments, palette=colors)

# Add the abline for y=0 (neutral sentiment)
plt.axhline(0, color='black', linewidth=2, linestyle='-')

# Customize the labels and titles
plt.xlabel('Average Compound Sentiment Score')
plt.ylabel('Location')
plt.title('Average Compound Sentiment by Geo-entity')

# Display the chart
plt.show()

Imgur

Deeper look at the disparity of scores among the east vs west divide

# Create an empty list to store the records
location_data = []

# Iterate through location_sentiment and convert it into records
for location, sentiment_scores in location_sentiment.items():
    for sentiment_score in sentiment_scores:
        record = {
            'location': location,
            'neg': sentiment_score['neg'],
            'neu': sentiment_score['neu'],
            'pos': sentiment_score['pos'],
            'compound': sentiment_score['compound']
        }
        location_data.append(record)

# Create a DataFrame from the list of records
location_df = pd.DataFrame(location_data)
# List of specified countries
specified_countries = ['india', 'japan', 'thailand', 'vietnam', 'paris', 'romania', 'berlin', 'venice', 'madrid', 'rome']

# Filter the DataFrame to include only the specified countries
popular_countries = location_df[location_df['location'].isin(specified_countries)]
# List of sentiment columns
sentiment_columns = ['compound', 'neg', 'neu', 'pos']

# Create box plots
plt.figure(figsize=(12, 8))
for sentiment_column in sentiment_columns:
    sns.boxplot(data=popular_countries, x='location', y=sentiment_column, palette='Set3')
    plt.xlabel('Countries')
    plt.ylabel(sentiment_column.capitalize() + ' Score')
    plt.title('Sentiment Analysis by Geo-entity')
    plt.xticks(rotation=45)
    plt.show()

Imgur

Imgur

Imgur

Imgur

Covid19 subreddit

covid19.head()
id score title link author subreddit flair published comments content neg neu pos compound sentiment content_tsv_gin content_tsv_gist
0 yjrg0a 907 Up vote if you're currently positive with your... https://www.reddit.com/r/COVID19positive/comme... Hailabigail COVID19positive Tested Positive - Breakthrough 2022-11-02 311 i'm seeing an overwhelming amount of posts wit... 0.106 0.727 0.167 0.7293 POS '10':48 'amount':6,29 'breakthrough':31 'covid... '10':48 'amount':6,29 'breakthrough':31 'covid...
1 13p6qrm 597 Why is everyone pretending the pandemic disapp... https://www.reddit.com/r/COVID19positive/comme... marconas1_ COVID19positive Rant 2023-05-22 271 i work in a tech company, and it has become co... 0.154 0.776 0.069 -0.8343 NEG 'accept':66 'affect':33 'back':40 'becom':10 '... 'accept':66 'affect':33 'back':40 'becom':10 '...
2 12lw075 461 What is….happening here? https://www.reddit.com/r/COVID19positive/comme... brutallyhonestkitten COVID19positive Rant 2023-04-14 201 like the title says, i feel like i am living i... 0.027 0.853 0.119 0.9247 POS 'absolut':37 'alien':126 'altern':13 'anymor':... 'absolut':37 'alien':126 'altern':13 'anymor':...
3 zw72uc 418 This new variant was one of the worst experien... https://www.reddit.com/r/COVID19positive/comme... Throwawayacount5093 COVID19positive Tested Positive - Me 2022-12-27 145 i’m in my early twenties, fully vaxed and boos... 0.121 0.790 0.089 -0.8072 NEG '104':120 '4':233 '60':206 '60mg':201 'abl':21... '104':120 '4':233 '60':206 '60mg':201 'abl':21...
4 zji350 396 The pandemic's over they said. You don't need ... https://www.reddit.com/r/COVID19positive/comme... None COVID19positive Tested Positive - Me 2022-12-12 216 i haven't slept in nearly 40 hours, was in the... 0.069 0.839 0.091 0.2023 POS '15':14 '40':7 '4x':77 'ach':54 'ash':34 'cong... '15':14 '40':7 '4x':77 'ach':54 'ash':34 'cong...
covid19_list = []

for row in covid19["content"]:
    covid19_list.append(row)

covid19_content = ' '.join(covid19_list)

covid19_tokens = word_tokenize(covid19_content)
total_word_count = len(covid19_tokens)

nltk.download("stopwords")
stop_words = set(stopwords.words("english"))

# Also need to remove some other random punctuations
other_removals = ["'", ";", "?", "(", ")", "'m", "'s", "n't", ":", "!", "`", '''"''', "'ve", "*", "`", ",", ""]
stop_words_updated = stop_words.union(other_removals)

# Filter out stopwords and short words
tokens_wo_stopwords = [word.lower() for word in covid19_tokens if word.lower() not in stop_words_updated and len(word) > 2]
freq_dist = nltk.FreqDist(tokens_wo_stopwords)

# Calculate the percentage share of words
word_freq = freq_dist.most_common(10)
percentage_share = [(word, freq / total_word_count * 100) for word, freq in word_freq]

# Create the plot
plt.figure(figsize=(12, 6))
x, y = zip(*percentage_share)
plt.bar(x, y)
plt.xlabel("Words")
plt.ylabel("Percentage Share")
plt.xticks(size=15, rotation=75)
plt.show()
[nltk_data] Downloading package stopwords to /home/ydn3f/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!

Imgur

plot_ngram_percentage_share(tokens_wo_stopwords, 3, total_word_count, num_results=10)

Imgur

covid19.head()
id score title link author subreddit flair published comments content neg neu pos compound sentiment content_tsv_gin content_tsv_gist
0 yjrg0a 907 Up vote if you're currently positive with your... https://www.reddit.com/r/COVID19positive/comme... Hailabigail COVID19positive Tested Positive - Breakthrough 2022-11-02 311 i'm seeing an overwhelming amount of posts wit... 0.106 0.727 0.167 0.7293 POS '10':48 'amount':6,29 'breakthrough':31 'covid... '10':48 'amount':6,29 'breakthrough':31 'covid...
1 13p6qrm 597 Why is everyone pretending the pandemic disapp... https://www.reddit.com/r/COVID19positive/comme... marconas1_ COVID19positive Rant 2023-05-22 271 i work in a tech company, and it has become co... 0.154 0.776 0.069 -0.8343 NEG 'accept':66 'affect':33 'back':40 'becom':10 '... 'accept':66 'affect':33 'back':40 'becom':10 '...
2 12lw075 461 What is….happening here? https://www.reddit.com/r/COVID19positive/comme... brutallyhonestkitten COVID19positive Rant 2023-04-14 201 like the title says, i feel like i am living i... 0.027 0.853 0.119 0.9247 POS 'absolut':37 'alien':126 'altern':13 'anymor':... 'absolut':37 'alien':126 'altern':13 'anymor':...
3 zw72uc 418 This new variant was one of the worst experien... https://www.reddit.com/r/COVID19positive/comme... Throwawayacount5093 COVID19positive Tested Positive - Me 2022-12-27 145 i’m in my early twenties, fully vaxed and boos... 0.121 0.790 0.089 -0.8072 NEG '104':120 '4':233 '60':206 '60mg':201 'abl':21... '104':120 '4':233 '60':206 '60mg':201 'abl':21...
4 zji350 396 The pandemic's over they said. You don't need ... https://www.reddit.com/r/COVID19positive/comme... None COVID19positive Tested Positive - Me 2022-12-12 216 i haven't slept in nearly 40 hours, was in the... 0.069 0.839 0.091 0.2023 POS '15':14 '40':7 '4x':77 'ach':54 'ash':34 'cong... '15':14 '40':7 '4x':77 'ach':54 'ash':34 'cong...
covid19['published'] = pd.to_datetime(covid19['published'])
from nltk import bigrams, trigrams

#list(bigrams(tokens_wo_stopwords))
nltk.FreqDist(list(bigrams(tokens_wo_stopwords)))
FreqDist({('feel', 'like'): 27, ('tested', 'positive'): 20, ('felt', 'like'): 16, ('first', 'time'): 14, ('got', 'covid'): 14, ('taste', 'smell'): 13, ('feels', 'like'): 12, ('wearing', 'mask'): 11, ('sense', 'smell'): 11, ('last', 'week'): 10, ...})
# Convert the 'published' column to datetime
covid19['published'] = pd.to_datetime(covid19['published'])

# Define a time period for grouping, e.g., 'W' for weekly
covid19['time_period'] = covid19['published'].dt.isocalendar().week

# Initialize a list to store tokenized text for each time period
tokenized_by_time = []

# Iterate over time periods and tokenize the text for each period
for time_period, group in covid19.groupby('time_period'):
    covid19_content = ' '.join(group['content'])
    covid19_tokens = word_tokenize(covid19_content)
    tokens_wo_stopwords = [word.lower() for word in covid19_tokens if word.lower() not in stop_words_updated and len(word) > 2]
    tokenized_by_time.append(tokens_wo_stopwords)
top_bigrams_by_time = []

# Iterate over time periods and calculate the top 3 bigrams for each period
for tokens in tokenized_by_time:
    bigram_fd = FreqDist(ngrams(tokens, 2))
    top_bigrams = bigram_fd.most_common(3)
    top_bigrams_by_time.append(top_bigrams)
df = pd.DataFrame({'time_period': covid19['time_period'].unique(), 'Top Bigrams': top_bigrams_by_time})
top_bigrams_by_time[0]
[(('give', 'space'), 2), (('work', 'get'), 2), (('time', 'covid'), 2)]
# Initialize lists to store data for the DataFrame
time_periods = []
bigrams = []
counters = []

# Iterate over time periods and bigrams to extract data for the DataFrame
for i, time_period in enumerate(covid19['time_period'].unique()):
    for bigram, counter in top_bigrams_by_time[i]:
        time_periods.append(time_period)
        bigrams.append(' '.join(bigram))
        counters.append(counter)

# Create a DataFrame with the extracted data
df = pd.DataFrame({'time_period': time_periods, 'Bigram': bigrams, 'Counter': counters})
df.sort_values('Counter', ascending=False).head(10)
time_period Bigram Counter
96 7 dry cough 7
97 7 taste smell 6
98 7 night sweats 5
81 38 high fever 5
82 38 pretty much 4
25 42 wearing mask 4
24 42 still one 4
105 24 feels like 4
60 27 nasal spray 4
63 14 feel like 4
display(df.shape)

# Filter the DataFrame to include only bigrams containing 'covid'
covid_bigrams = df[df['Bigram'].str.contains('covid|dry cough|taste smell|night sweats|nasal spray|high fever')]

    
# Define a color mapping for each bigram
def map_color(bigram):
    if 'covid' in bigram:
        return 'teal'
    elif 'dry cough' in bigram:
        return 'red'
    elif 'taste smell' in bigram:
        return 'green'
    elif 'night sweats' in bigram:
        return 'orange'
    elif 'nasal spray' in bigram:
        return 'hotpink'
    elif 'high fever' in bigram:
        return 'maroon'
    else:
        return 'gray'  # Default color for unmatched bigrams
(114, 3)
covid_bigrams = covid_bigrams.assign(Color=covid_bigrams['Bigram'].apply(map_color))
import matplotlib.patches as mpatches

# Create the scatter plot with jitter and alpha for the filtered data
plt.figure(figsize=(14, 6))

for bigram in covid_bigrams['Bigram'].unique():
    subset = covid_bigrams[covid_bigrams['Bigram'] == bigram]
    color = map_color(bigram)
    jitter = np.random.normal(0, 0.9, len(subset))  # Add jitter for each unique bigram
    plt.scatter(
        subset['time_period'] + jitter,  # Match the length of jitter to the subset
        subset['Counter'],
        s=400,  # Adjust the size here (e.g., 100)
        c=color,  # Set the color based on the mapping
        alpha=0.6,
        edgecolor='black',  # Add a black outline
        linewidth=1.5,  # Control the thickness of the outline
        label=bigram
    )

plt.xlabel('Week #')
plt.ylabel('Bigram Counter')
plt.title('Top Bigrams Over Time (Week 0 = Oct 2022)')
plt.xticks()
plt.grid(True)

# Create custom legend patches for each color category
legend_labels = [
    mpatches.Patch(color='teal', label='Covid-Related Bigrams'),
    mpatches.Patch(color='red', label='Dry Cough Bigrams'),
    mpatches.Patch(color='green', label='Taste and Smell Bigrams'),
    mpatches.Patch(color='orange', label='Night Sweats Bigrams'),
    mpatches.Patch(color='hotpink', label='Nasal Spray Bigrams'),
    mpatches.Patch(color='maroon', label='High Fever Bigrams')
]

plt.legend(handles=legend_labels, bbox_to_anchor=(1.05, 1), loc='upper left')
plt.show()

Imgur

Data engineering subreddit

dataeng.head()
id score title link author subreddit flair published comments content neg neu pos compound sentiment content_tsv_gin content_tsv_gist
0 151xsis 567 Data Scientists -- Ok, now I get it. https://www.reddit.com/r/dataengineering/comme... tarzanboy76 dataengineering Discussion 2023-07-17 220 a few days ago, our data scientist gave me som... 0.035 0.861 0.104 0.9340 POS 'access':193 'actual':62,110,147 'admin':192 '... 'access':193 'actual':62,110,147 'admin':192 '...
1 10kl6lg 374 Finally got a job https://www.reddit.com/r/dataengineering/comme... 1000gratitudepunches dataengineering Career 2023-01-25 100 i did it! after 8 months of working as a budte... 0.000 0.950 0.050 0.5093 POS '12':24 '400':20 '8':5 'applic':22 'believ':42... '12':24 '400':20 '8':5 'applic':22 'believ':42...
2 yyh6l9 381 What are your favourite GitHub repos that show... https://www.reddit.com/r/dataengineering/comme... theoriginalmantooth dataengineering Discussion 2022-11-18 40 looking to level up my skills and want to know... 0.000 0.899 0.101 0.5775 POS 'accounts/repos':20 'alreadi':46 'data':17 'di... 'accounts/repos':20 'alreadi':46 'data':17 'di...
3 14663ur 294 r/dataengineering will be joining the blackout... https://www.reddit.com/r/dataengineering/comme... AutoModerator dataengineering Meta 2023-06-10 21 [see here for the original r/dataengineering t... 0.087 0.840 0.073 -0.8688 NEG '/)*.':536 '/hc/en-us/requests/new):':352 '/r/... '/)*.':536 '/hc/en-us/requests/new):':352 '/r/...
4 10fg07o 286 just got laid off (FAANG) https://www.reddit.com/r/dataengineering/comme... Foodwithfloyd dataengineering Career 2023-01-18 84 hi all, its been a pretty awful day. two month... 0.032 0.808 0.160 0.9118 POS 'ago':11 'anoth':26 'anyon':98 'aw':7 'beyond'... 'ago':11 'anoth':26 'anyon':98 'aw':7 'beyond'...
#Topic modelling

dataeng_list = []

for row in dataeng["content"]:
    dataeng_list.append(row)

dataeng_content = ' '.join(dataeng_list)

dataeng_tokens = word_tokenize(dataeng_content)
total_word_count = len(dataeng_tokens)

nltk.download("stopwords")
stop_words = set(stopwords.words("english"))

# Also need to remove some other random punctuations
other_removals = ["'", ";", "?", "(", ")", "'m", "'s", "n't", ":", "!", "`", '''"''', "'ve", "*", "`", ",", ""]
stop_words_updated = stop_words.union(other_removals)

# Filter out stopwords and short words
tokens_wo_stopwords = [word.lower() for word in dataeng_tokens if word.lower() not in stop_words_updated and len(word) > 2]
freq_dist = nltk.FreqDist(tokens_wo_stopwords)

# Calculate the percentage share of words
word_freq = freq_dist.most_common(10)
percentage_share = [(word, freq / total_word_count * 100) for word, freq in word_freq]

# Create the plot
plt.figure(figsize=(12, 6))
x, y = zip(*percentage_share)
plt.bar(x, y)
plt.xlabel("Words")
plt.ylabel("Percentage Share")
plt.xticks(size=15, rotation=75)
plt.show()
[nltk_data] Downloading package stopwords to /home/ydn3f/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!

Imgur

plot_ngram_percentage_share(tokens_wo_stopwords, 3, total_word_count, num_results=10)

Imgur

#Topic Modelling

from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.decomposition import NMF, LatentDirichletAllocation

count_vectorizer = CountVectorizer(stop_words='english')
term_frequency = count_vectorizer.fit_transform(dataeng_list)
feature_names = count_vectorizer.get_feature_names()

print(f"Shape of term freq matrix = {term_frequency.shape}")
print(f"Num of features identified = {len(feature_names)}")

#LDA model with 5 topics
lda = LatentDirichletAllocation(n_components=5, random_state=0)  
lda.fit(term_frequency)  

def display_topics(model, feature_names, no_top_words):
    for topic_idx, term_weights in enumerate(model.components_):
        
        sorted_indx = term_weights.argsort()

        topk_words = [feature_names[i] for i in sorted_indx[-no_top_words :]]
        print(f"Topic {topic_idx}:", end=None)
        print(";".join(topk_words))


display_topics(lda, feature_names, 10)
Shape of term freq matrix = (100, 2888)
Num of features identified = 2888
Topic 0:
pipeline;sql;like;use;just;api;need;https;cloud;data
Topic 1:
isn;app;databricks;data;make;comments;www;https;com;reddit
Topic 2:
years;engineering;people;job;just;time;sql;like;company;data
Topic 3:
years;team;ve;learn;know;really;just;like;databricks;data
Topic 4:
team;blog;dbt;snowflake;databricks;spark;data;instacart;com;https
#TFIDF VEC

tfidf_vectorizer = TfidfVectorizer(stop_words='english')
tfidf = tfidf_vectorizer.fit_transform(dataeng_list)
tfidf_feature_names = tfidf_vectorizer.get_feature_names()

print(f"Shape of tfidf matrix = {tfidf.shape}")
print(f"Num of features identified = {len(tfidf_feature_names)}")

#6 topics
nmf = NMF(n_components=5, random_state=0)
nmf.fit(term_frequency)

#Top 10 words per topic
display_topics(nmf, tfidf_feature_names, 10)
Shape of tfidf matrix = (100, 2888)
Num of features identified = 2888


/opt/conda/lib/python3.7/site-packages/sklearn/decomposition/_nmf.py:315: FutureWarning: The 'init' value, when 'init=None' and n_components is less than n_samples and n_features, will be changed from 'nndsvd' to 'nndsvda' in 1.1 (renaming of 0.26).
  "'nndsvda' in 1.1 (renaming of 0.26)."), FutureWarning)


Topic 0:
support;isn;official;app;make;comments;www;https;com;reddit
Topic 1:
business;work;files;engineering;years;ve;sql;learn;just;data
Topic 2:
time;data;excel;just;extremely;team;people;job;like;company
Topic 3:
spark;etl;understand;really;platform;cloud;lot;data;snowflake;databricks
Topic 4:
blog;data;course;spark;snowflake;www;instacart;databricks;com;https
# Sample function to assign topics based on keywords
def assign_topic(content):
    if "career" in content.lower():
        return "Career"
    elif "projects" in content.lower():
        return "Projects"
    elif "personal" in content.lower():
        return "Personal"
    elif "people" in content.lower():
        return "People"
    elif "company" in content.lower():
        return "Company"  
    else:
        return "Other"

def assign_topic_data(content):
    if "sql" in content.lower():
        return "SQL"
    elif "snowflake" in content.lower():
        return "Snowflake"
    elif "databricks" in content.lower():
        return "Databricks"
    elif "apache" in content.lower():
        return "Apache"
    elif "Spark" in content.lower():
        return "Spark"    
    else:
        return "Other"
topics = ["Career", "Projects", "Personal"]
data_topics = ['sql', 'databricks', 'snowflake', 'people', 'company']
    # Apply the function to the DataFrame
dataeng['topic'] = dataeng['content'].apply(assign_topic)
dataeng['data_topic'] = dataeng['content'].apply(assign_topic_data)
# Group by topic and calculate the average sentiment score
role_sentiments = dataeng.groupby('topic')['compound'].mean().reset_index()
# Group by topic and calculate summary statistics
topic_summary = dataeng.groupby('topic').agg({
    'compound': ['mean', 'min', 'max', 'median', 'std'],
    'neg': 'mean',
    'neu': 'mean',
    'pos': 'mean'
}).reset_index()

# Flatten the multi-index columns
topic_summary.columns = ['_'.join(col).strip() for col in topic_summary.columns.values]

# Define the sentiment score columns
sentiment_columns = ['compound', 'neg', 'neu', 'pos']

# Create box plots
plt.figure(figsize=(12, 8))
for sentiment_column in sentiment_columns:
    sns.boxplot(data=dataeng, x='topic', y=sentiment_column, palette='Set2')
    plt.xlabel('Topics')
    plt.ylabel(sentiment_column.capitalize() + ' Score')
    plt.title('Sentiment Analysis by Topic')
    plt.xticks(rotation=45)
    plt.show()

Imgur

Imgur

Imgur

Imgur

tools_sentiments = dataeng.groupby('data_topic')['compound'].mean().reset_index()
# Define the sentiment score columns
sentiment_columns = ['compound', 'neg', 'neu', 'pos']

# Create box plots
plt.figure(figsize=(12, 8))
for sentiment_column in sentiment_columns:
    sns.boxplot(data=dataeng, x='data_topic', y=sentiment_column, palette='Set3')
    plt.xlabel('Topics')
    plt.ylabel(sentiment_column.capitalize() + ' Score')
    plt.title('Sentiment Analysis by Topic')
    plt.xticks(rotation=45)
    plt.show()

Imgur

Imgur

Imgur

Imgur

Concluding task: Write a summary of your findings!

Write your summary in this cell

——————————–

–Distribution Plots– Looking at the distribution of compound sentiments, we see immediately how balanced (or skewed) our datasets are according to their shape. I was a little disappointed to see so many positively-scored posts in the r/solotravel dataset. I was hoping to see a slight more mix of positives and negatives. This goes the same for the r/dataengineering subreddit

–{solotravel} sentiment analysis by location– it was interesting to see which geographic areas mentioned ranked the most/least by users in the solotravel subreddit based on their compound score. i made sure to set the minumum post count threshold to 5 to ensure that there is sufficient sample for a somewhat general consensus. it is not fair to have 1 negative review of a country/place represent the whole country in this analysis if there was only 1 post about said country! it was also interesting to see the bias that the subreddit seemingly had towards european travel destinations as opposed to non-european travel destinations, namely asia. i examined this disparity close and expanded it to representing the range of values via boxplots - building off the absolute average scores. what we see then is geographic locations who have strong vs weak consensus (pos, neg, neu, and compound) based on the box length. for example, from the boxplots, we can confirm that the european cities of Berlin, Venice, and Madrid are squarely considered among users in the subreddit to be associated with a positive experience.

—{data engingeer} sentiments by topic buckets— similar to what was done for solotravel dataset, i grouped up the contents of each individual post into selected topics. from there, we can see how posts relating to or talking about said selected topics are generally ranked based on sentimental polarity scores. for the most part, posts relating to “projects”, “career”, and “company” are fairly positive. an interesting point about the “people” topic bucket is that while it is positive on average, it has a longer body than the rest, which means there are also mixed sentiments. is this representative of the actual data? i don’t think so because “people” is a very general term and it can mean many different things based on context. so without context, i would argue that it is not conveying much as opposed to the other topic buckets.

then, i decided to do the same for data tools topics. it was really cool to see that Snowflake is the most well-received out of the bunch when people mentioned it in their posts - judging by the median from the compound boxplot. conversely, apache appears to be less than well-received but still relatively positive. databricks on the other hand is fairly positive but runs into the familiar problem of having a lot of mixed reviews to it.

—{covid19} top bri-grams over time— for the covid19 subreddit, i knew i wanted to see time being an important element in my analysis. so i conducted a time-series plot, using week’s rather than months, to analyze the shift in bi-gram frequencies or mentions over time. from the scatter plot, we can see that covid-related bigrams are showing up in recent weeks - the right side of the plot around where we are in the year (october 2023). but i wouldn’t be alarmed as these are mere bi-grams and not covid tests themselves which means this is just an indicator of how many times covid-related discussion is being brought up. what’s also interesting about this scatterplot is the seeming gap in discussion around week 20 mark - which around march and april of last year. so around spring time of last year, there was a gap of covid-related discussion among users and submissions alike in the covid19 subreddit - interesting!

altogether, i think this was a fantastic project to perform some NLP and text analysis given the time constraint. i feel like i have only just scratched the surface with my insights. thank you for reading.

the end


top