Home » Projects » 8410 Computational Text Analysis » Jupyter Notebook

Subreddit Sentiment Analysis notebook

Introduction
Project Description
Data Description
Conclusion
License

Part I: Data Acquisition and Loading

This is a two-part data science project in which involved acquisition/loading of data in the first, and then followed by analytics in the second. I will not go through the first part on this notebook as the focus of this notebook is pure text analysis (if enough requests are made otherwise, I can upload it).

Part II: Analytics

Produce interesting visualizations of the linguistic data.
- Try to look for trends (within a subreddit) and variations of topics across subreddits
- Some comparative plots across feeds
Write a summary of your findings!

Part II: Analytics

Task: Produce interesting visualizations of the lingustic data.

Examples:
- Try to look for trends (within a subreddit)
- Topic variations across subreddits
- Some comparative plots across subreddits

## Your code in this cell
## ------------------------
import re
import spacy
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

%matplotlib inline

%%sql

SELECT DISTINCT subreddit
FROM ydn3f.redditposts;

 * postgres://dsa_ro_user:***@pgsql.dsa.lan/dsa_student
4 rows affected.

subreddit
solotravel
dataengineering
LawSchool
COVID19positive

## Let's retrieve our data that was loaded in our PGSQL database
## ------------------------

credentials = "creds"


dataeng = pd.read_sql("""
            SELECT *
            FROM ydn3f.redditposts
            WHERE subreddit = 'dataengineering'
            """, con = credentials)

lawschool = pd.read_sql("""
            SELECT *
            FROM ydn3f.redditposts
            WHERE subreddit = 'LawSchool'
            """, con = credentials)

covid19 = pd.read_sql("""
            SELECT *
            FROM ydn3f.redditposts
            WHERE subreddit = 'COVID19positive'
            """, con = credentials)

solotravel = pd.read_sql("""
            SELECT *
            FROM ydn3f.redditposts
            WHERE subreddit = 'solotravel'
            """, con = credentials)

dataeng.head()

	id	score	title	link	author	subreddit	flair	published	comments	content	neg	neu	pos	compound	sentiment	content_tsv_gin	content_tsv_gist
0	151xsis	567	Data Scientists -- Ok, now I get it.	https://www.reddit.com/r/dataengineering/comme...	tarzanboy76	dataengineering	Discussion	2023-07-17	220	a few days ago, our data scientist gave me som...	0.035	0.861	0.104	0.9340	POS	'access':193 'actual':62,110,147 'admin':192 '...	'access':193 'actual':62,110,147 'admin':192 '...
1	10kl6lg	374	Finally got a job	https://www.reddit.com/r/dataengineering/comme...	1000gratitudepunches	dataengineering	Career	2023-01-25	100	i did it! after 8 months of working as a budte...	0.000	0.950	0.050	0.5093	POS	'12':24 '400':20 '8':5 'applic':22 'believ':42...	'12':24 '400':20 '8':5 'applic':22 'believ':42...
2	yyh6l9	381	What are your favourite GitHub repos that show...	https://www.reddit.com/r/dataengineering/comme...	theoriginalmantooth	dataengineering	Discussion	2022-11-18	40	looking to level up my skills and want to know...	0.000	0.899	0.101	0.5775	POS	'accounts/repos':20 'alreadi':46 'data':17 'di...	'accounts/repos':20 'alreadi':46 'data':17 'di...
3	14663ur	294	r/dataengineering will be joining the blackout...	https://www.reddit.com/r/dataengineering/comme...	AutoModerator	dataengineering	Meta	2023-06-10	21	[see here for the original r/dataengineering t...	0.087	0.840	0.073	-0.8688	NEG	'/)*.':536 '/hc/en-us/requests/new):':352 '/r/...	'/)*.':536 '/hc/en-us/requests/new):':352 '/r/...
4	10fg07o	286	just got laid off (FAANG)	https://www.reddit.com/r/dataengineering/comme...	Foodwithfloyd	dataengineering	Career	2023-01-18	84	hi all, its been a pretty awful day. two month...	0.032	0.808	0.160	0.9118	POS	'ago':11 'anoth':26 'anyon':98 'aw':7 'beyond'...	'ago':11 'anoth':26 'anyon':98 'aw':7 'beyond'...

lawschool.head()

	id	score	title	link	author	subreddit	flair	published	comments	content	neg	neu	pos	compound	sentiment	content_tsv_gin	content_tsv_gist
0	13c2x19	6172	I promised my mom on her death bed that I woul...	https://www.reddit.com/r/LawSchool/comments/13...	cinnamorolloing	LawSchool	None	2023-05-08	192	this one is for you, mom.	0.000	1.000	0.000	0.0000	NEU	'mom':6 'one':2	'mom':6 'one':2
1	14fhvdj	1590	Not in law school (Econ undergrad) but I am cu...	https://www.reddit.com/r/LawSchool/comments/14...	om-om	LawSchool	None	2023-06-21	74		0.000	0.000	0.000	0.0000	NEU
2	13dw7mo	1531	A Sigma Male Law School Schedule	https://www.reddit.com/r/LawSchool/comments/13...	Equivalent-Editor697	LawSchool	None	2023-05-10	110	2:00 am- wake up2.05am-cold shower2.15am-break...	0.026	0.974	0.000	-0.2960	NEG	'-2':123 '00':2,124 '00am':42,64 '00am-arrive'...	'-2':123 '00':2,124 '00am':42,64 '00am-arrive'...
3	151geb6	1458	Sex during the bar?	https://www.reddit.com/r/LawSchool/comments/15...	Decent_Situation_952	LawSchool	None	2023-07-16	219	i’m sitting for the bar this month. during the...	0.051	0.890	0.059	-0.4836	NEG	'3l':22 'alreadi':76 'anoth':21 'bar':6 'bathr...	'3l':22 'alreadi':76 'anoth':21 'bar':6 'bathr...
4	12k0vjz	1282	I passed the bar exam!	https://www.reddit.com/r/LawSchool/comments/12...	Organic-Ad-86	LawSchool	None	2023-04-12	74	....and i'm stoked. that's all.	0.000	1.000	0.000	0.0000	NEU	'm':3 'stoke':4	'm':3 'stoke':4

covid19.head()

	id	score	title	link	author	subreddit	flair	published	comments	content	neg	neu	pos	compound	sentiment	content_tsv_gin	content_tsv_gist
0	yjrg0a	907	Up vote if you're currently positive with your...	https://www.reddit.com/r/COVID19positive/comme...	Hailabigail	COVID19positive	Tested Positive - Breakthrough	2022-11-02	311	i'm seeing an overwhelming amount of posts wit...	0.106	0.727	0.167	0.7293	POS	'10':48 'amount':6,29 'breakthrough':31 'covid...	'10':48 'amount':6,29 'breakthrough':31 'covid...
1	13p6qrm	597	Why is everyone pretending the pandemic disapp...	https://www.reddit.com/r/COVID19positive/comme...	marconas1_	COVID19positive	Rant	2023-05-22	271	i work in a tech company, and it has become co...	0.154	0.776	0.069	-0.8343	NEG	'accept':66 'affect':33 'back':40 'becom':10 '...	'accept':66 'affect':33 'back':40 'becom':10 '...
2	12lw075	461	What is….happening here?	https://www.reddit.com/r/COVID19positive/comme...	brutallyhonestkitten	COVID19positive	Rant	2023-04-14	201	like the title says, i feel like i am living i...	0.027	0.853	0.119	0.9247	POS	'absolut':37 'alien':126 'altern':13 'anymor':...	'absolut':37 'alien':126 'altern':13 'anymor':...
3	zw72uc	418	This new variant was one of the worst experien...	https://www.reddit.com/r/COVID19positive/comme...	Throwawayacount5093	COVID19positive	Tested Positive - Me	2022-12-27	145	i’m in my early twenties, fully vaxed and boos...	0.121	0.790	0.089	-0.8072	NEG	'104':120 '4':233 '60':206 '60mg':201 'abl':21...	'104':120 '4':233 '60':206 '60mg':201 'abl':21...
4	zji350	396	The pandemic's over they said. You don't need ...	https://www.reddit.com/r/COVID19positive/comme...	None	COVID19positive	Tested Positive - Me	2022-12-12	216	i haven't slept in nearly 40 hours, was in the...	0.069	0.839	0.091	0.2023	POS	'15':14 '40':7 '4x':77 'ach':54 'ash':34 'cong...	'15':14 '40':7 '4x':77 'ach':54 'ash':34 'cong...

solotravel.head()

	id	score	title	link	author	subreddit	flair	published	comments	content	neg	neu	pos	compound	sentiment	content_tsv_gin	content_tsv_gist
0	16c1of1	5356	The number of old sex tourists in Bangkok is i...	https://www.reddit.com/r/solotravel/comments/1...	Weekly-Patience4087	solotravel	None	2023-09-07	629	i am currently in bangkok and the number of se...	0.075	0.857	0.068	0.2326	POS	'2':211 'advantag':143 'also':61 'area':20,185...	'2':211 'advantag':143 'also':61 'area':20,185...
1	11zkusj	2967	I encountered my first begpacker today	https://www.reddit.com/r/solotravel/comments/1...	Northerner6	solotravel	None	2023-03-23	332	i encountered my first begpacker today. i was ...	0.074	0.876	0.050	-0.6554	NEG	'accent':38 'afford':176 'american':37 'approa...	'accent':38 'afford':176 'american':37 'approa...
2	13uw9tn	2205	REMINDER: Unwanted sexual attention is NEVER O...	https://www.reddit.com/r/solotravel/comments/1...	unsuspectingmuggle	solotravel	Accommodation	2023-05-29	301	report people who make you feel unsafe!i've be...	0.128	0.826	0.047	-0.9464	NEG	'11':34 '25':254 '99.99':256 'alon':124,248 'a...	'11':34 '25':254 '99.99':256 'alon':124,248 'a...
3	11ccux4	2124	I have been in India for a month and so far I ...	https://www.reddit.com/r/solotravel/comments/1...	Big-Assist-5	solotravel	None	2023-02-26	476	one of the times i was staying at a guest hous...	0.165	0.828	0.007	-0.9831	NEG	'30':20 'answer':63 'appar':29 'away':22 'basi...	'30':20 'answer':63 'appar':29 'away':22 'basi...
4	146y8eu	1900	The first time I have ever felt unsafe in SE A...	https://www.reddit.com/r/solotravel/comments/1...	ihatemycohort	solotravel	Asia	2023-06-11	232	i just had a complete scare. im still shaking ...	0.102	0.811	0.087	-0.9189	NEG	'1':554,1018 '10':71 '15':199 '2':64,562,986 '...	'1':554,1018 '10':71 '15':199 '2':64,562,986 '...

Sentiment counts of each subreddit

dataeng['sentiment'].value_counts()

POS    78
NEG    17
NEU     5
Name: sentiment, dtype: int64

lawschool['sentiment'].value_counts()

POS    49
NEG    38
NEU    13
Name: sentiment, dtype: int64

covid19['sentiment'].value_counts()

NEG    58
POS    36
NEU     6
Name: sentiment, dtype: int64

twosentence['sentiment'].value_counts()

NEG    43
POS    29
NEU    27
Name: sentiment, dtype: int64

solotravel['sentiment'].value_counts()

POS    71
NEG    29
Name: sentiment, dtype: int64

Exploratory Visualizations

Polarity distributions

# Create a figure and axis
plt.figure(figsize=(10, 6))


# Plot the distribution of the 'compound' score for the 'dataeng' subreddit
sns.distplot(dataeng['compound'], color='green', kde=False)

# Set plot labels and title
plt.xlabel('Compound Score')
plt.ylabel('Frequency')
plt.title('Distribution of Compound Scores in r/dataengineering')

# Show the plot
plt.show()

Imgur

# Create a figure and axis
plt.figure(figsize=(10, 6))

# Plot the distribution of the 'compound' score for the 'dataeng' subreddit
sns.distplot(lawschool['compound'], color='orange', kde=False)

# Set plot labels and title
plt.xlabel('Compound Score')
plt.ylabel('Frequency')
plt.title('Distribution of Compound Scores in r/lawschool')

# Show the plot
plt.show()

Imgur

# Create a figure and axis
plt.figure(figsize=(10, 6))

# Plot the distribution of the 'compound' score for the 'dataeng' subreddit
sns.distplot(covid19['compound'], color='brown', kde=False)

# Set plot labels and title
plt.xlabel('Compound Score')
plt.ylabel('Frequency')
plt.title('Distribution of Compound Scores in r/covid19')

# Show the plot
plt.show()

Imgur

# Create a figure and axis
plt.figure(figsize=(10, 6))

# Plot the distribution of the 'compound' score for the 'dataeng' subreddit
sns.distplot(solotravel['compound'], color='purple', kde=False)

# Set plot labels and title
plt.xlabel('Compound Score')
plt.ylabel('Frequency')
plt.title('Distribution of Compound Scores in r/solotravel')

# Show the plot
plt.show()

Imgur

Lawschool subreddit

sns.set(style='whitegrid')

import nltk
from nltk import word_tokenize
from nltk import FreqDist
from nltk.corpus import stopwords

lawschool_list = []

for row in lawschool["content"]:
    lawschool_list.append(row)

lawschool_content = ' '.join(lawschool_list)

lawschool_tokens = word_tokenize(lawschool_content)
total_word_count = len(lawschool_tokens)

nltk.download("stopwords")
stop_words = set(stopwords.words("english"))

# Also need to remove some other random punctuations
other_removals = ["'", ";", "?", "(", ")", "'m", "'s", "n't", ":", "!", "`", '''"''', "'ve", "*", "`", ",", ""]
stop_words_updated = stop_words.union(other_removals)

# Filter out stopwords and short words
tokens_wo_stopwords = [word.lower() for word in lawschool_tokens if word.lower() not in stop_words_updated and len(word) > 2]
freq_dist = nltk.FreqDist(tokens_wo_stopwords)

# Calculate the percentage share of words
word_freq = freq_dist.most_common(10)
percentage_share = [(word, freq / total_word_count * 100) for word, freq in word_freq]

# Create the plot
plt.figure(figsize=(12, 6))
x, y = zip(*percentage_share)
plt.bar(x, y)
plt.xlabel("Words")
plt.ylabel("Percentage Share")
plt.xticks(size=15, rotation=75)
plt.show()

[nltk_data] Downloading package stopwords to /home/ydn3f/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.

Imgur

from nltk.util import ngrams

def plot_ngram_percentage_share(tokens, num, total_word_count, num_results=25):
    ngram = list(ngrams(tokens, num))
    ngram_dist = nltk.FreqDist(ngram)
    
    # Calculate the percentage share of bigrams
    ngram_freq = ngram_dist.most_common(num_results)
    percentage_share = [(bigram, freq / total_word_count * 100) for bigram, freq in ngram_freq]

    # Create the plot
    x, y = zip(*percentage_share)
    plt.figure(figsize=(10, 6))
    plt.bar([" ".join(bigram) for bigram in x], y)
    plt.xlabel(f"Top {num_results} {num}-grams")
    plt.ylabel("Percentage Share")
    plt.xticks(fontsize=15, rotation=75)
    plt.show()

plot_ngram_percentage_share(tokens_wo_stopwords, 3, total_word_count, num_results=10)

Imgur

The first chart doesn’t really reveal anything beyond the obvious, so I decided to dig a little deeper by exploring the frequent tri-grams that occur in the dataset for the lawschool subreddit. I was immediately surprised to see the top 3 to be mirror reflections of one another. I’m not a law student so I don’t really know why those 3 words are mentioned as a group often, but I’m curious to hear why. The rest of the tri-grams in the top 10 are pretty self-explanatory.

#Topic Modelling

from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.decomposition import NMF, LatentDirichletAllocation

count_vectorizer = CountVectorizer(stop_words='english')
term_frequency = count_vectorizer.fit_transform(lawschool_list)
feature_names = count_vectorizer.get_feature_names()

print(f"Shape of term freq matrix = {term_frequency.shape}")
print(f"Num of features identified = {len(feature_names)}")

#LDA model with 5 topics
lda = LatentDirichletAllocation(n_components=5, random_state=0)  
lda.fit(term_frequency)  

def display_topics(model, feature_names, no_top_words):
    for topic_idx, term_weights in enumerate(model.components_):
        
        sorted_indx = term_weights.argsort()

        topk_words = [feature_names[i] for i in sorted_indx[-no_top_words :]]
        print(f"Topic {topic_idx}:", end=None)
        print(";".join(topk_words))


display_topics(lda, feature_names, 10)

Shape of term freq matrix = (100, 2340)
Num of features identified = 2340
Topic 0:
need;ranking;student;people;students;just;law;tax;corporate;did
Topic 1:
doesn;westlaw;foster;partner;know;care;getting;school;bar;summer
Topic 2:
probably;school;say;did;people;law;like;just;student;gen
Topic 3:
students;firm;school;class;know;time;just;like;people;law
Topic 4:
got;like;people;firm;going;know;bar;just;school;law

#TFIDF VEC

tfidf_vectorizer = TfidfVectorizer(stop_words='english')
tfidf = tfidf_vectorizer.fit_transform(lawschool_list)
tfidf_feature_names = tfidf_vectorizer.get_feature_names()

print(f"Shape of tfidf matrix = {tfidf.shape}")
print(f"Num of features identified = {len(tfidf_feature_names)}")

#6 topics
nmf = NMF(n_components=5, random_state=0)
nmf.fit(term_frequency)

#Top 10 words per topic
display_topics(nmf, tfidf_feature_names, 10)

Shape of tfidf matrix = (100, 2340)
Num of features identified = 2340
Topic 0:
feeling;don;measure;like;paperwork;adhd;just;tax;corporate;did
Topic 1:
hard;students;going;just;make;exam;prep;bar;school;law
Topic 2:
let;really;just;tax;law;real;firm;people;know;like
Topic 3:
probably;youre;law;okay;like;student;say;people;just;gen
Topic 4:
got;heard;event;offer;minecraft;firm;just;getting;summer;partner


/opt/conda/lib/python3.7/site-packages/sklearn/decomposition/_nmf.py:315: FutureWarning: The 'init' value, when 'init=None' and n_components is less than n_samples and n_features, will be changed from 'nndsvd' to 'nndsvda' in 1.1 (renaming of 0.26).
  "'nndsvda' in 1.1 (renaming of 0.26)."), FutureWarning)

TFIDF Vectorizer seemed to detect more distinct topic selections in the lawschool subreddit. I can easily infer Topic 1 to be about exam preparation like the bar and all the difficult long hours studying for it. Topic 3 sounds like users are posting submissions related to being a student and maybe words of support or encouragement. Topic 4 can easily be described as discussion topics related to working at a firm over the summer or a job offer of sorts.

# Lawschool Sentiment Over Time
lawschool['published'] = pd.to_datetime(lawschool['published'])

lawschool_monthly = lawschool.copy()
lawschool_monthly.set_index('published', inplace=True)
lawschool_monthly = lawschool_monthly.resample('W').agg({'neg': 'mean', 'neu': 'mean', 'pos': 'mean', 'compound': 'mean'})

# Plot the sentiment scores over time
plt.figure(figsize=(16, 5))
sns.lineplot(x=lawschool_monthly.index, y=lawschool_monthly['neg'], label='Negative', linewidth=4)
sns.lineplot(x=lawschool_monthly.index, y=lawschool_monthly['neu'], label='Neutral', linewidth=4)
sns.lineplot(x=lawschool_monthly.index, y=lawschool_monthly['pos'], label='Positive', linewidth=4)
sns.lineplot(x=lawschool_monthly.index, y=lawschool_monthly['compound'], label='Compound', linewidth=3, color='gray', alpha=0.8, style=True, dashes=[(1,1)], legend=False)
plt.xlabel('Time')
plt.ylabel('Sentiment Score')
plt.title('Sentiment Analysis Over Time for r/lawschool (past 12 months)')
plt.legend()
plt.grid(True)
plt.show()

Imgur

For visualizing sentiment over time, I chose to use the line chart for interpretability. The most significant portion of the time-series is the month of march, especially at the beginning, where a substantial collection of negatively sentimented posts were submitted. Could it be the period before bar exams? 🤔

Solotravel subreddit

solotravel.head(3)

	id	score	title	link	author	subreddit	flair	published	comments	content	neg	neu	pos	compound	sentiment	content_tsv_gin	content_tsv_gist
0	16c1of1	5356	The number of old sex tourists in Bangkok is i...	https://www.reddit.com/r/solotravel/comments/1...	Weekly-Patience4087	solotravel	None	2023-09-07	629	i am currently in bangkok and the number of se...	0.075	0.857	0.068	0.2326	POS	'2':211 'advantag':143 'also':61 'area':20,185...	'2':211 'advantag':143 'also':61 'area':20,185...
1	11zkusj	2967	I encountered my first begpacker today	https://www.reddit.com/r/solotravel/comments/1...	Northerner6	solotravel	None	2023-03-23	332	i encountered my first begpacker today. i was ...	0.074	0.876	0.050	-0.6554	NEG	'accent':38 'afford':176 'american':37 'approa...	'accent':38 'afford':176 'american':37 'approa...
2	13uw9tn	2205	REMINDER: Unwanted sexual attention is NEVER O...	https://www.reddit.com/r/solotravel/comments/1...	unsuspectingmuggle	solotravel	Accommodation	2023-05-29	301	report people who make you feel unsafe!i've be...	0.128	0.826	0.047	-0.9464	NEG	'11':34 '25':254 '99.99':256 'alon':124,248 'a...	'11':34 '25':254 '99.99':256 'alon':124,248 'a...

solotravel_list = []

for row in solotravel["content"]:
    solotravel_list.append(row)

solotravel_content = ' '.join(solotravel_list)

solotravel_tokens = word_tokenize(solotravel_content)
total_word_count = len(solotravel_tokens)

nltk.download("stopwords")
stop_words = set(stopwords.words("english"))

# Also need to remove some other random punctuations
other_removals = ["'", ";", "?", "(", ")", "'m", "'s", "n't", ":", "!", "`", '''"''', "'ve", "*", "`", ",", ""]
stop_words_updated = stop_words.union(other_removals)

# Filter out stopwords and short words
tokens_wo_stopwords = [word.lower() for word in solotravel_tokens if word.lower() not in stop_words_updated and len(word) > 2]
freq_dist = nltk.FreqDist(tokens_wo_stopwords)

# Calculate the percentage share of words
word_freq = freq_dist.most_common(10)
percentage_share = [(word, freq / total_word_count * 100) for word, freq in word_freq]

# Create the plot
plt.figure(figsize=(12, 6))
x, y = zip(*percentage_share)
plt.bar(x, y)
plt.xlabel("Words")
plt.ylabel("Percentage Share")
plt.xticks(size=15, rotation=75)
plt.show()

[nltk_data] Downloading package stopwords to /home/ydn3f/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!

Imgur

plot_ngram_percentage_share(tokens_wo_stopwords, 3, total_word_count, num_results=10)

Imgur

{solotravel} Sentiment analysis by geographic location detected

from collections import defaultdict

nlp = spacy.load("en_core_web_sm")

# Function to extract location mentions
def extract_locations(text):
    doc = nlp(text)
    locations = [ent.text for ent in doc.ents if ent.label_ == "GPE"]
    return locations

travel_posts = solotravel[["content", "neg", "neu", "pos", "compound"]]

# Define a dictionary to store sentiment scores by location
location_sentiment = defaultdict(list)

# Iterate through each post and extract location mentions
for index, row in travel_posts.iterrows():
    content = row["content"]
    sentiment = {
        "neg": row["neg"],
        "neu": row["neu"],
        "pos": row["pos"],
        "compound": row["compound"],
    }
    locations = extract_locations(content)
    for location in locations:
        location_sentiment[location].append(sentiment)

# Calculate summary statistics for sentiment by location
location_summary = {}
for location, sentiment_scores in location_sentiment.items():
    num_posts = len(sentiment_scores)
    if num_posts > 0:
        summary = {
            "num_posts": num_posts,
            "avg_neg": sum(score["neg"] for score in sentiment_scores) / num_posts,
            "avg_neu": sum(score["neu"] for score in sentiment_scores) / num_posts,
            "avg_pos": sum(score["pos"] for score in sentiment_scores) / num_posts,
            "avg_compound": sum(score["compound"] for score in sentiment_scores) / num_posts,
        }
        location_summary[location] = summary

geo_sentiments = pd.DataFrame.from_dict(location_summary)

geo_sentiments = geo_sentiments.transpose().rename_axis('geo-entity').reset_index()

# Sort the DataFrame and select the top and bottom rows
sorted_geo_sentiments = geo_sentiments.query("num_posts > 5").sort_values(by='avg_compound') # minimum of 5 self-post
top_geo_sentiments = sorted_geo_sentiments.head(5)
bottom_geo_sentiments = sorted_geo_sentiments.tail(5)

tb_geo_sentiments = pd.concat([top_geo_sentiments, bottom_geo_sentiments])

colors = ['#2E8BC0' if avg_compound > 0 else '#AE0000' for avg_compound in tb_geo_sentiments['avg_compound']]

# Create the bar chart
plt.figure(figsize=(12, 6))  # Adjust the figure size as needed
sns.barplot(y='avg_compound', x='geo-entity', data=tb_geo_sentiments, palette=colors)

# Add the abline for y=0 (neutral sentiment)
plt.axhline(0, color='black', linewidth=2, linestyle='-')

# Customize the labels and titles
plt.xlabel('Average Compound Sentiment Score')
plt.ylabel('Location')
plt.title('Average Compound Sentiment by Geo-entity')

# Display the chart
plt.show()

Imgur

Deeper look at the disparity of scores among the east vs west divide

# Create an empty list to store the records
location_data = []

# Iterate through location_sentiment and convert it into records
for location, sentiment_scores in location_sentiment.items():
    for sentiment_score in sentiment_scores:
        record = {
            'location': location,
            'neg': sentiment_score['neg'],
            'neu': sentiment_score['neu'],
            'pos': sentiment_score['pos'],
            'compound': sentiment_score['compound']
        }
        location_data.append(record)

# Create a DataFrame from the list of records
location_df = pd.DataFrame(location_data)

# List of specified countries
specified_countries = ['india', 'japan', 'thailand', 'vietnam', 'paris', 'romania', 'berlin', 'venice', 'madrid', 'rome']

# Filter the DataFrame to include only the specified countries
popular_countries = location_df[location_df['location'].isin(specified_countries)]

# List of sentiment columns
sentiment_columns = ['compound', 'neg', 'neu', 'pos']

# Create box plots
plt.figure(figsize=(12, 8))
for sentiment_column in sentiment_columns:
    sns.boxplot(data=popular_countries, x='location', y=sentiment_column, palette='Set3')
    plt.xlabel('Countries')
    plt.ylabel(sentiment_column.capitalize() + ' Score')
    plt.title('Sentiment Analysis by Geo-entity')
    plt.xticks(rotation=45)
    plt.show()

Imgur

Covid19 subreddit

covid19.head()

	id	score	title	link	author	subreddit	flair	published	comments	content	neg	neu	pos	compound	sentiment	content_tsv_gin	content_tsv_gist
0	yjrg0a	907	Up vote if you're currently positive with your...	https://www.reddit.com/r/COVID19positive/comme...	Hailabigail	COVID19positive	Tested Positive - Breakthrough	2022-11-02	311	i'm seeing an overwhelming amount of posts wit...	0.106	0.727	0.167	0.7293	POS	'10':48 'amount':6,29 'breakthrough':31 'covid...	'10':48 'amount':6,29 'breakthrough':31 'covid...
1	13p6qrm	597	Why is everyone pretending the pandemic disapp...	https://www.reddit.com/r/COVID19positive/comme...	marconas1_	COVID19positive	Rant	2023-05-22	271	i work in a tech company, and it has become co...	0.154	0.776	0.069	-0.8343	NEG	'accept':66 'affect':33 'back':40 'becom':10 '...	'accept':66 'affect':33 'back':40 'becom':10 '...
2	12lw075	461	What is….happening here?	https://www.reddit.com/r/COVID19positive/comme...	brutallyhonestkitten	COVID19positive	Rant	2023-04-14	201	like the title says, i feel like i am living i...	0.027	0.853	0.119	0.9247	POS	'absolut':37 'alien':126 'altern':13 'anymor':...	'absolut':37 'alien':126 'altern':13 'anymor':...
3	zw72uc	418	This new variant was one of the worst experien...	https://www.reddit.com/r/COVID19positive/comme...	Throwawayacount5093	COVID19positive	Tested Positive - Me	2022-12-27	145	i’m in my early twenties, fully vaxed and boos...	0.121	0.790	0.089	-0.8072	NEG	'104':120 '4':233 '60':206 '60mg':201 'abl':21...	'104':120 '4':233 '60':206 '60mg':201 'abl':21...
4	zji350	396	The pandemic's over they said. You don't need ...	https://www.reddit.com/r/COVID19positive/comme...	None	COVID19positive	Tested Positive - Me	2022-12-12	216	i haven't slept in nearly 40 hours, was in the...	0.069	0.839	0.091	0.2023	POS	'15':14 '40':7 '4x':77 'ach':54 'ash':34 'cong...	'15':14 '40':7 '4x':77 'ach':54 'ash':34 'cong...

covid19_list = []

for row in covid19["content"]:
    covid19_list.append(row)

covid19_content = ' '.join(covid19_list)

covid19_tokens = word_tokenize(covid19_content)
total_word_count = len(covid19_tokens)

nltk.download("stopwords")
stop_words = set(stopwords.words("english"))

# Also need to remove some other random punctuations
other_removals = ["'", ";", "?", "(", ")", "'m", "'s", "n't", ":", "!", "`", '''"''', "'ve", "*", "`", ",", ""]
stop_words_updated = stop_words.union(other_removals)

# Filter out stopwords and short words
tokens_wo_stopwords = [word.lower() for word in covid19_tokens if word.lower() not in stop_words_updated and len(word) > 2]
freq_dist = nltk.FreqDist(tokens_wo_stopwords)

# Calculate the percentage share of words
word_freq = freq_dist.most_common(10)
percentage_share = [(word, freq / total_word_count * 100) for word, freq in word_freq]

# Create the plot
plt.figure(figsize=(12, 6))
x, y = zip(*percentage_share)
plt.bar(x, y)
plt.xlabel("Words")
plt.ylabel("Percentage Share")
plt.xticks(size=15, rotation=75)
plt.show()

[nltk_data] Downloading package stopwords to /home/ydn3f/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!

Imgur

plot_ngram_percentage_share(tokens_wo_stopwords, 3, total_word_count, num_results=10)

Imgur

covid19.head()

	id	score	title	link	author	subreddit	flair	published	comments	content	neg	neu	pos	compound	sentiment	content_tsv_gin	content_tsv_gist
0	yjrg0a	907	Up vote if you're currently positive with your...	https://www.reddit.com/r/COVID19positive/comme...	Hailabigail	COVID19positive	Tested Positive - Breakthrough	2022-11-02	311	i'm seeing an overwhelming amount of posts wit...	0.106	0.727	0.167	0.7293	POS	'10':48 'amount':6,29 'breakthrough':31 'covid...	'10':48 'amount':6,29 'breakthrough':31 'covid...
1	13p6qrm	597	Why is everyone pretending the pandemic disapp...	https://www.reddit.com/r/COVID19positive/comme...	marconas1_	COVID19positive	Rant	2023-05-22	271	i work in a tech company, and it has become co...	0.154	0.776	0.069	-0.8343	NEG	'accept':66 'affect':33 'back':40 'becom':10 '...	'accept':66 'affect':33 'back':40 'becom':10 '...
2	12lw075	461	What is….happening here?	https://www.reddit.com/r/COVID19positive/comme...	brutallyhonestkitten	COVID19positive	Rant	2023-04-14	201	like the title says, i feel like i am living i...	0.027	0.853	0.119	0.9247	POS	'absolut':37 'alien':126 'altern':13 'anymor':...	'absolut':37 'alien':126 'altern':13 'anymor':...
3	zw72uc	418	This new variant was one of the worst experien...	https://www.reddit.com/r/COVID19positive/comme...	Throwawayacount5093	COVID19positive	Tested Positive - Me	2022-12-27	145	i’m in my early twenties, fully vaxed and boos...	0.121	0.790	0.089	-0.8072	NEG	'104':120 '4':233 '60':206 '60mg':201 'abl':21...	'104':120 '4':233 '60':206 '60mg':201 'abl':21...
4	zji350	396	The pandemic's over they said. You don't need ...	https://www.reddit.com/r/COVID19positive/comme...	None	COVID19positive	Tested Positive - Me	2022-12-12	216	i haven't slept in nearly 40 hours, was in the...	0.069	0.839	0.091	0.2023	POS	'15':14 '40':7 '4x':77 'ach':54 'ash':34 'cong...	'15':14 '40':7 '4x':77 'ach':54 'ash':34 'cong...

covid19['published'] = pd.to_datetime(covid19['published'])

from nltk import bigrams, trigrams

#list(bigrams(tokens_wo_stopwords))

nltk.FreqDist(list(bigrams(tokens_wo_stopwords)))

FreqDist({('feel', 'like'): 27, ('tested', 'positive'): 20, ('felt', 'like'): 16, ('first', 'time'): 14, ('got', 'covid'): 14, ('taste', 'smell'): 13, ('feels', 'like'): 12, ('wearing', 'mask'): 11, ('sense', 'smell'): 11, ('last', 'week'): 10, ...})

# Convert the 'published' column to datetime
covid19['published'] = pd.to_datetime(covid19['published'])

# Define a time period for grouping, e.g., 'W' for weekly
covid19['time_period'] = covid19['published'].dt.isocalendar().week

# Initialize a list to store tokenized text for each time period
tokenized_by_time = []

# Iterate over time periods and tokenize the text for each period
for time_period, group in covid19.groupby('time_period'):
    covid19_content = ' '.join(group['content'])
    covid19_tokens = word_tokenize(covid19_content)
    tokens_wo_stopwords = [word.lower() for word in covid19_tokens if word.lower() not in stop_words_updated and len(word) > 2]
    tokenized_by_time.append(tokens_wo_stopwords)

top_bigrams_by_time = []

# Iterate over time periods and calculate the top 3 bigrams for each period
for tokens in tokenized_by_time:
    bigram_fd = FreqDist(ngrams(tokens, 2))
    top_bigrams = bigram_fd.most_common(3)
    top_bigrams_by_time.append(top_bigrams)

df = pd.DataFrame({'time_period': covid19['time_period'].unique(), 'Top Bigrams': top_bigrams_by_time})

top_bigrams_by_time[0]

[(('give', 'space'), 2), (('work', 'get'), 2), (('time', 'covid'), 2)]

# Initialize lists to store data for the DataFrame
time_periods = []
bigrams = []
counters = []

# Iterate over time periods and bigrams to extract data for the DataFrame
for i, time_period in enumerate(covid19['time_period'].unique()):
    for bigram, counter in top_bigrams_by_time[i]:
        time_periods.append(time_period)
        bigrams.append(' '.join(bigram))
        counters.append(counter)

# Create a DataFrame with the extracted data
df = pd.DataFrame({'time_period': time_periods, 'Bigram': bigrams, 'Counter': counters})

df.sort_values('Counter', ascending=False).head(10)

	time_period	Bigram	Counter
96	7	dry cough	7
97	7	taste smell	6
98	7	night sweats	5
81	38	high fever	5
82	38	pretty much	4
25	42	wearing mask	4
24	42	still one	4
105	24	feels like	4
60	27	nasal spray	4
63	14	feel like	4

display(df.shape)

# Filter the DataFrame to include only bigrams containing 'covid'
covid_bigrams = df[df['Bigram'].str.contains('covid|dry cough|taste smell|night sweats|nasal spray|high fever')]

    
# Define a color mapping for each bigram
def map_color(bigram):
    if 'covid' in bigram:
        return 'teal'
    elif 'dry cough' in bigram:
        return 'red'
    elif 'taste smell' in bigram:
        return 'green'
    elif 'night sweats' in bigram:
        return 'orange'
    elif 'nasal spray' in bigram:
        return 'hotpink'
    elif 'high fever' in bigram:
        return 'maroon'
    else:
        return 'gray'  # Default color for unmatched bigrams

(114, 3)

covid_bigrams = covid_bigrams.assign(Color=covid_bigrams['Bigram'].apply(map_color))

import matplotlib.patches as mpatches

# Create the scatter plot with jitter and alpha for the filtered data
plt.figure(figsize=(14, 6))

for bigram in covid_bigrams['Bigram'].unique():
    subset = covid_bigrams[covid_bigrams['Bigram'] == bigram]
    color = map_color(bigram)
    jitter = np.random.normal(0, 0.9, len(subset))  # Add jitter for each unique bigram
    plt.scatter(
        subset['time_period'] + jitter,  # Match the length of jitter to the subset
        subset['Counter'],
        s=400,  # Adjust the size here (e.g., 100)
        c=color,  # Set the color based on the mapping
        alpha=0.6,
        edgecolor='black',  # Add a black outline
        linewidth=1.5,  # Control the thickness of the outline
        label=bigram
    )

plt.xlabel('Week #')
plt.ylabel('Bigram Counter')
plt.title('Top Bigrams Over Time (Week 0 = Oct 2022)')
plt.xticks()
plt.grid(True)

# Create custom legend patches for each color category
legend_labels = [
    mpatches.Patch(color='teal', label='Covid-Related Bigrams'),
    mpatches.Patch(color='red', label='Dry Cough Bigrams'),
    mpatches.Patch(color='green', label='Taste and Smell Bigrams'),
    mpatches.Patch(color='orange', label='Night Sweats Bigrams'),
    mpatches.Patch(color='hotpink', label='Nasal Spray Bigrams'),
    mpatches.Patch(color='maroon', label='High Fever Bigrams')
]

plt.legend(handles=legend_labels, bbox_to_anchor=(1.05, 1), loc='upper left')
plt.show()

Imgur

Data engineering subreddit

dataeng.head()

	id	score	title	link	author	subreddit	flair	published	comments	content	neg	neu	pos	compound	sentiment	content_tsv_gin	content_tsv_gist
0	151xsis	567	Data Scientists -- Ok, now I get it.	https://www.reddit.com/r/dataengineering/comme...	tarzanboy76	dataengineering	Discussion	2023-07-17	220	a few days ago, our data scientist gave me som...	0.035	0.861	0.104	0.9340	POS	'access':193 'actual':62,110,147 'admin':192 '...	'access':193 'actual':62,110,147 'admin':192 '...
1	10kl6lg	374	Finally got a job	https://www.reddit.com/r/dataengineering/comme...	1000gratitudepunches	dataengineering	Career	2023-01-25	100	i did it! after 8 months of working as a budte...	0.000	0.950	0.050	0.5093	POS	'12':24 '400':20 '8':5 'applic':22 'believ':42...	'12':24 '400':20 '8':5 'applic':22 'believ':42...
2	yyh6l9	381	What are your favourite GitHub repos that show...	https://www.reddit.com/r/dataengineering/comme...	theoriginalmantooth	dataengineering	Discussion	2022-11-18	40	looking to level up my skills and want to know...	0.000	0.899	0.101	0.5775	POS	'accounts/repos':20 'alreadi':46 'data':17 'di...	'accounts/repos':20 'alreadi':46 'data':17 'di...
3	14663ur	294	r/dataengineering will be joining the blackout...	https://www.reddit.com/r/dataengineering/comme...	AutoModerator	dataengineering	Meta	2023-06-10	21	[see here for the original r/dataengineering t...	0.087	0.840	0.073	-0.8688	NEG	'/)*.':536 '/hc/en-us/requests/new):':352 '/r/...	'/)*.':536 '/hc/en-us/requests/new):':352 '/r/...
4	10fg07o	286	just got laid off (FAANG)	https://www.reddit.com/r/dataengineering/comme...	Foodwithfloyd	dataengineering	Career	2023-01-18	84	hi all, its been a pretty awful day. two month...	0.032	0.808	0.160	0.9118	POS	'ago':11 'anoth':26 'anyon':98 'aw':7 'beyond'...	'ago':11 'anoth':26 'anyon':98 'aw':7 'beyond'...

#Topic modelling

dataeng_list = []

for row in dataeng["content"]:
    dataeng_list.append(row)

dataeng_content = ' '.join(dataeng_list)

dataeng_tokens = word_tokenize(dataeng_content)
total_word_count = len(dataeng_tokens)

nltk.download("stopwords")
stop_words = set(stopwords.words("english"))

# Also need to remove some other random punctuations
other_removals = ["'", ";", "?", "(", ")", "'m", "'s", "n't", ":", "!", "`", '''"''', "'ve", "*", "`", ",", ""]
stop_words_updated = stop_words.union(other_removals)

# Filter out stopwords and short words
tokens_wo_stopwords = [word.lower() for word in dataeng_tokens if word.lower() not in stop_words_updated and len(word) > 2]
freq_dist = nltk.FreqDist(tokens_wo_stopwords)

# Calculate the percentage share of words
word_freq = freq_dist.most_common(10)
percentage_share = [(word, freq / total_word_count * 100) for word, freq in word_freq]

# Create the plot
plt.figure(figsize=(12, 6))
x, y = zip(*percentage_share)
plt.bar(x, y)
plt.xlabel("Words")
plt.ylabel("Percentage Share")
plt.xticks(size=15, rotation=75)
plt.show()

[nltk_data] Downloading package stopwords to /home/ydn3f/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!

Imgur

plot_ngram_percentage_share(tokens_wo_stopwords, 3, total_word_count, num_results=10)

Imgur

#Topic Modelling

from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.decomposition import NMF, LatentDirichletAllocation

count_vectorizer = CountVectorizer(stop_words='english')
term_frequency = count_vectorizer.fit_transform(dataeng_list)
feature_names = count_vectorizer.get_feature_names()

print(f"Shape of term freq matrix = {term_frequency.shape}")
print(f"Num of features identified = {len(feature_names)}")

#LDA model with 5 topics
lda = LatentDirichletAllocation(n_components=5, random_state=0)  
lda.fit(term_frequency)  

def display_topics(model, feature_names, no_top_words):
    for topic_idx, term_weights in enumerate(model.components_):
        
        sorted_indx = term_weights.argsort()

        topk_words = [feature_names[i] for i in sorted_indx[-no_top_words :]]
        print(f"Topic {topic_idx}:", end=None)
        print(";".join(topk_words))


display_topics(lda, feature_names, 10)

Shape of term freq matrix = (100, 2888)
Num of features identified = 2888
Topic 0:
pipeline;sql;like;use;just;api;need;https;cloud;data
Topic 1:
isn;app;databricks;data;make;comments;www;https;com;reddit
Topic 2:
years;engineering;people;job;just;time;sql;like;company;data
Topic 3:
years;team;ve;learn;know;really;just;like;databricks;data
Topic 4:
team;blog;dbt;snowflake;databricks;spark;data;instacart;com;https

#TFIDF VEC

tfidf_vectorizer = TfidfVectorizer(stop_words='english')
tfidf = tfidf_vectorizer.fit_transform(dataeng_list)
tfidf_feature_names = tfidf_vectorizer.get_feature_names()

print(f"Shape of tfidf matrix = {tfidf.shape}")
print(f"Num of features identified = {len(tfidf_feature_names)}")

#6 topics
nmf = NMF(n_components=5, random_state=0)
nmf.fit(term_frequency)

#Top 10 words per topic
display_topics(nmf, tfidf_feature_names, 10)

Shape of tfidf matrix = (100, 2888)
Num of features identified = 2888


/opt/conda/lib/python3.7/site-packages/sklearn/decomposition/_nmf.py:315: FutureWarning: The 'init' value, when 'init=None' and n_components is less than n_samples and n_features, will be changed from 'nndsvd' to 'nndsvda' in 1.1 (renaming of 0.26).
  "'nndsvda' in 1.1 (renaming of 0.26)."), FutureWarning)


Topic 0:
support;isn;official;app;make;comments;www;https;com;reddit
Topic 1:
business;work;files;engineering;years;ve;sql;learn;just;data
Topic 2:
time;data;excel;just;extremely;team;people;job;like;company
Topic 3:
spark;etl;understand;really;platform;cloud;lot;data;snowflake;databricks
Topic 4:
blog;data;course;spark;snowflake;www;instacart;databricks;com;https

# Sample function to assign topics based on keywords
def assign_topic(content):
    if "career" in content.lower():
        return "Career"
    elif "projects" in content.lower():
        return "Projects"
    elif "personal" in content.lower():
        return "Personal"
    elif "people" in content.lower():
        return "People"
    elif "company" in content.lower():
        return "Company"  
    else:
        return "Other"

def assign_topic_data(content):
    if "sql" in content.lower():
        return "SQL"
    elif "snowflake" in content.lower():
        return "Snowflake"
    elif "databricks" in content.lower():
        return "Databricks"
    elif "apache" in content.lower():
        return "Apache"
    elif "Spark" in content.lower():
        return "Spark"    
    else:
        return "Other"

topics = ["Career", "Projects", "Personal"]
data_topics = ['sql', 'databricks', 'snowflake', 'people', 'company']

    # Apply the function to the DataFrame
dataeng['topic'] = dataeng['content'].apply(assign_topic)
dataeng['data_topic'] = dataeng['content'].apply(assign_topic_data)

# Group by topic and calculate the average sentiment score
role_sentiments = dataeng.groupby('topic')['compound'].mean().reset_index()

# Group by topic and calculate summary statistics
topic_summary = dataeng.groupby('topic').agg({
    'compound': ['mean', 'min', 'max', 'median', 'std'],
    'neg': 'mean',
    'neu': 'mean',
    'pos': 'mean'
}).reset_index()

# Flatten the multi-index columns
topic_summary.columns = ['_'.join(col).strip() for col in topic_summary.columns.values]

# Define the sentiment score columns
sentiment_columns = ['compound', 'neg', 'neu', 'pos']

# Create box plots
plt.figure(figsize=(12, 8))
for sentiment_column in sentiment_columns:
    sns.boxplot(data=dataeng, x='topic', y=sentiment_column, palette='Set2')
    plt.xlabel('Topics')
    plt.ylabel(sentiment_column.capitalize() + ' Score')
    plt.title('Sentiment Analysis by Topic')
    plt.xticks(rotation=45)
    plt.show()

Imgur

tools_sentiments = dataeng.groupby('data_topic')['compound'].mean().reset_index()

# Define the sentiment score columns
sentiment_columns = ['compound', 'neg', 'neu', 'pos']

# Create box plots
plt.figure(figsize=(12, 8))
for sentiment_column in sentiment_columns:
    sns.boxplot(data=dataeng, x='data_topic', y=sentiment_column, palette='Set3')
    plt.xlabel('Topics')
    plt.ylabel(sentiment_column.capitalize() + ' Score')
    plt.title('Sentiment Analysis by Topic')
    plt.xticks(rotation=45)
    plt.show()

Imgur

Concluding task: Write a summary of your findings!

Write your summary in this cell

——————————–

–Distribution Plots– Looking at the distribution of compound sentiments, we see immediately how balanced (or skewed) our datasets are according to their shape. I was a little disappointed to see so many positively-scored posts in the r/solotravel dataset. I was hoping to see a slight more mix of positives and negatives. This goes the same for the r/dataengineering subreddit

–{solotravel} sentiment analysis by location– it was interesting to see which geographic areas mentioned ranked the most/least by users in the solotravel subreddit based on their compound score. i made sure to set the minumum post count threshold to 5 to ensure that there is sufficient sample for a somewhat general consensus. it is not fair to have 1 negative review of a country/place represent the whole country in this analysis if there was only 1 post about said country! it was also interesting to see the bias that the subreddit seemingly had towards european travel destinations as opposed to non-european travel destinations, namely asia. i examined this disparity close and expanded it to representing the range of values via boxplots - building off the absolute average scores. what we see then is geographic locations who have strong vs weak consensus (pos, neg, neu, and compound) based on the box length. for example, from the boxplots, we can confirm that the european cities of Berlin, Venice, and Madrid are squarely considered among users in the subreddit to be associated with a positive experience.

—{data engingeer} sentiments by topic buckets— similar to what was done for solotravel dataset, i grouped up the contents of each individual post into selected topics. from there, we can see how posts relating to or talking about said selected topics are generally ranked based on sentimental polarity scores. for the most part, posts relating to “projects”, “career”, and “company” are fairly positive. an interesting point about the “people” topic bucket is that while it is positive on average, it has a longer body than the rest, which means there are also mixed sentiments. is this representative of the actual data? i don’t think so because “people” is a very general term and it can mean many different things based on context. so without context, i would argue that it is not conveying much as opposed to the other topic buckets.

then, i decided to do the same for data tools topics. it was really cool to see that Snowflake is the most well-received out of the bunch when people mentioned it in their posts - judging by the median from the compound boxplot. conversely, apache appears to be less than well-received but still relatively positive. databricks on the other hand is fairly positive but runs into the familiar problem of having a lot of mixed reviews to it.

—{covid19} top bri-grams over time— for the covid19 subreddit, i knew i wanted to see time being an important element in my analysis. so i conducted a time-series plot, using week’s rather than months, to analyze the shift in bi-gram frequencies or mentions over time. from the scatter plot, we can see that covid-related bigrams are showing up in recent weeks - the right side of the plot around where we are in the year (october 2023). but i wouldn’t be alarmed as these are mere bi-grams and not covid tests themselves which means this is just an indicator of how many times covid-related discussion is being brought up. what’s also interesting about this scatterplot is the seeming gap in discussion around week 20 mark - which around march and april of last year. so around spring time of last year, there was a gap of covid-related discussion among users and submissions alike in the covid19 subreddit - interesting!

altogether, i think this was a fantastic project to perform some NLP and text analysis given the time constraint. i feel like i have only just scratched the surface with my insights. thank you for reading.

the end

top

8410-subreddit-sentiment-analysis

Contents:

Contents:

Subreddit Sentiment Analysis notebook

Table of Contents

Part I: Data Acquisition and Loading

Part II: Analytics

Part II: Analytics

Task: Produce interesting visualizations of the lingustic data.

Sentiment counts of each subreddit

Exploratory Visualizations

Polarity distributions

Lawschool subreddit

Solotravel subreddit

{solotravel} Sentiment analysis by geographic location detected

Deeper look at the disparity of scores among the east vs west divide

Covid19 subreddit

Data engineering subreddit

Concluding task: Write a summary of your findings!

Write your summary in this cell

——————————–