Home » Projects » 8410 Computational Text Analysis » Jupyter Notebook
This is a two-part data science project in which involved acquisition/loading of data in the first, and then followed by analytics in the second. I will not go through the first part on this notebook as the focus of this notebook is pure text analysis (if enough requests are made otherwise, I can upload it).
## Your code in this cell
## ------------------------
import re
import spacy
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
%%sql
SELECT DISTINCT subreddit
FROM ydn3f.redditposts;
* postgres://dsa_ro_user:***@pgsql.dsa.lan/dsa_student
4 rows affected.
subreddit |
---|
solotravel |
dataengineering |
LawSchool |
COVID19positive |
## Let's retrieve our data that was loaded in our PGSQL database
## ------------------------
credentials = "creds"
dataeng = pd.read_sql("""
SELECT *
FROM ydn3f.redditposts
WHERE subreddit = 'dataengineering'
""", con = credentials)
lawschool = pd.read_sql("""
SELECT *
FROM ydn3f.redditposts
WHERE subreddit = 'LawSchool'
""", con = credentials)
covid19 = pd.read_sql("""
SELECT *
FROM ydn3f.redditposts
WHERE subreddit = 'COVID19positive'
""", con = credentials)
solotravel = pd.read_sql("""
SELECT *
FROM ydn3f.redditposts
WHERE subreddit = 'solotravel'
""", con = credentials)
dataeng.head()
id | score | title | link | author | subreddit | flair | published | comments | content | neg | neu | pos | compound | sentiment | content_tsv_gin | content_tsv_gist | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 151xsis | 567 | Data Scientists -- Ok, now I get it. | https://www.reddit.com/r/dataengineering/comme... | tarzanboy76 | dataengineering | Discussion | 2023-07-17 | 220 | a few days ago, our data scientist gave me som... | 0.035 | 0.861 | 0.104 | 0.9340 | POS | 'access':193 'actual':62,110,147 'admin':192 '... | 'access':193 'actual':62,110,147 'admin':192 '... |
1 | 10kl6lg | 374 | Finally got a job | https://www.reddit.com/r/dataengineering/comme... | 1000gratitudepunches | dataengineering | Career | 2023-01-25 | 100 | i did it! after 8 months of working as a budte... | 0.000 | 0.950 | 0.050 | 0.5093 | POS | '12':24 '400':20 '8':5 'applic':22 'believ':42... | '12':24 '400':20 '8':5 'applic':22 'believ':42... |
2 | yyh6l9 | 381 | What are your favourite GitHub repos that show... | https://www.reddit.com/r/dataengineering/comme... | theoriginalmantooth | dataengineering | Discussion | 2022-11-18 | 40 | looking to level up my skills and want to know... | 0.000 | 0.899 | 0.101 | 0.5775 | POS | 'accounts/repos':20 'alreadi':46 'data':17 'di... | 'accounts/repos':20 'alreadi':46 'data':17 'di... |
3 | 14663ur | 294 | r/dataengineering will be joining the blackout... | https://www.reddit.com/r/dataengineering/comme... | AutoModerator | dataengineering | Meta | 2023-06-10 | 21 | [see here for the original r/dataengineering t... | 0.087 | 0.840 | 0.073 | -0.8688 | NEG | '/)*.':536 '/hc/en-us/requests/new):':352 '/r/... | '/)*.':536 '/hc/en-us/requests/new):':352 '/r/... |
4 | 10fg07o | 286 | just got laid off (FAANG) | https://www.reddit.com/r/dataengineering/comme... | Foodwithfloyd | dataengineering | Career | 2023-01-18 | 84 | hi all, its been a pretty awful day. two month... | 0.032 | 0.808 | 0.160 | 0.9118 | POS | 'ago':11 'anoth':26 'anyon':98 'aw':7 'beyond'... | 'ago':11 'anoth':26 'anyon':98 'aw':7 'beyond'... |
lawschool.head()
id | score | title | link | author | subreddit | flair | published | comments | content | neg | neu | pos | compound | sentiment | content_tsv_gin | content_tsv_gist | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 13c2x19 | 6172 | I promised my mom on her death bed that I woul... | https://www.reddit.com/r/LawSchool/comments/13... | cinnamorolloing | LawSchool | None | 2023-05-08 | 192 | this one is for you, mom. | 0.000 | 1.000 | 0.000 | 0.0000 | NEU | 'mom':6 'one':2 | 'mom':6 'one':2 |
1 | 14fhvdj | 1590 | Not in law school (Econ undergrad) but I am cu... | https://www.reddit.com/r/LawSchool/comments/14... | om-om | LawSchool | None | 2023-06-21 | 74 | 0.000 | 0.000 | 0.000 | 0.0000 | NEU | |||
2 | 13dw7mo | 1531 | A Sigma Male Law School Schedule | https://www.reddit.com/r/LawSchool/comments/13... | Equivalent-Editor697 | LawSchool | None | 2023-05-10 | 110 | 2:00 am- wake up2.05am-cold shower2.15am-break... | 0.026 | 0.974 | 0.000 | -0.2960 | NEG | '-2':123 '00':2,124 '00am':42,64 '00am-arrive'... | '-2':123 '00':2,124 '00am':42,64 '00am-arrive'... |
3 | 151geb6 | 1458 | Sex during the bar? | https://www.reddit.com/r/LawSchool/comments/15... | Decent_Situation_952 | LawSchool | None | 2023-07-16 | 219 | i’m sitting for the bar this month. during the... | 0.051 | 0.890 | 0.059 | -0.4836 | NEG | '3l':22 'alreadi':76 'anoth':21 'bar':6 'bathr... | '3l':22 'alreadi':76 'anoth':21 'bar':6 'bathr... |
4 | 12k0vjz | 1282 | I passed the bar exam! | https://www.reddit.com/r/LawSchool/comments/12... | Organic-Ad-86 | LawSchool | None | 2023-04-12 | 74 | ....and i'm stoked. that's all. | 0.000 | 1.000 | 0.000 | 0.0000 | NEU | 'm':3 'stoke':4 | 'm':3 'stoke':4 |
covid19.head()
id | score | title | link | author | subreddit | flair | published | comments | content | neg | neu | pos | compound | sentiment | content_tsv_gin | content_tsv_gist | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | yjrg0a | 907 | Up vote if you're currently positive with your... | https://www.reddit.com/r/COVID19positive/comme... | Hailabigail | COVID19positive | Tested Positive - Breakthrough | 2022-11-02 | 311 | i'm seeing an overwhelming amount of posts wit... | 0.106 | 0.727 | 0.167 | 0.7293 | POS | '10':48 'amount':6,29 'breakthrough':31 'covid... | '10':48 'amount':6,29 'breakthrough':31 'covid... |
1 | 13p6qrm | 597 | Why is everyone pretending the pandemic disapp... | https://www.reddit.com/r/COVID19positive/comme... | marconas1_ | COVID19positive | Rant | 2023-05-22 | 271 | i work in a tech company, and it has become co... | 0.154 | 0.776 | 0.069 | -0.8343 | NEG | 'accept':66 'affect':33 'back':40 'becom':10 '... | 'accept':66 'affect':33 'back':40 'becom':10 '... |
2 | 12lw075 | 461 | What is….happening here? | https://www.reddit.com/r/COVID19positive/comme... | brutallyhonestkitten | COVID19positive | Rant | 2023-04-14 | 201 | like the title says, i feel like i am living i... | 0.027 | 0.853 | 0.119 | 0.9247 | POS | 'absolut':37 'alien':126 'altern':13 'anymor':... | 'absolut':37 'alien':126 'altern':13 'anymor':... |
3 | zw72uc | 418 | This new variant was one of the worst experien... | https://www.reddit.com/r/COVID19positive/comme... | Throwawayacount5093 | COVID19positive | Tested Positive - Me | 2022-12-27 | 145 | i’m in my early twenties, fully vaxed and boos... | 0.121 | 0.790 | 0.089 | -0.8072 | NEG | '104':120 '4':233 '60':206 '60mg':201 'abl':21... | '104':120 '4':233 '60':206 '60mg':201 'abl':21... |
4 | zji350 | 396 | The pandemic's over they said. You don't need ... | https://www.reddit.com/r/COVID19positive/comme... | None | COVID19positive | Tested Positive - Me | 2022-12-12 | 216 | i haven't slept in nearly 40 hours, was in the... | 0.069 | 0.839 | 0.091 | 0.2023 | POS | '15':14 '40':7 '4x':77 'ach':54 'ash':34 'cong... | '15':14 '40':7 '4x':77 'ach':54 'ash':34 'cong... |
solotravel.head()
id | score | title | link | author | subreddit | flair | published | comments | content | neg | neu | pos | compound | sentiment | content_tsv_gin | content_tsv_gist | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 16c1of1 | 5356 | The number of old sex tourists in Bangkok is i... | https://www.reddit.com/r/solotravel/comments/1... | Weekly-Patience4087 | solotravel | None | 2023-09-07 | 629 | i am currently in bangkok and the number of se... | 0.075 | 0.857 | 0.068 | 0.2326 | POS | '2':211 'advantag':143 'also':61 'area':20,185... | '2':211 'advantag':143 'also':61 'area':20,185... |
1 | 11zkusj | 2967 | I encountered my first begpacker today | https://www.reddit.com/r/solotravel/comments/1... | Northerner6 | solotravel | None | 2023-03-23 | 332 | i encountered my first begpacker today. i was ... | 0.074 | 0.876 | 0.050 | -0.6554 | NEG | 'accent':38 'afford':176 'american':37 'approa... | 'accent':38 'afford':176 'american':37 'approa... |
2 | 13uw9tn | 2205 | REMINDER: Unwanted sexual attention is NEVER O... | https://www.reddit.com/r/solotravel/comments/1... | unsuspectingmuggle | solotravel | Accommodation | 2023-05-29 | 301 | report people who make you feel unsafe!i've be... | 0.128 | 0.826 | 0.047 | -0.9464 | NEG | '11':34 '25':254 '99.99':256 'alon':124,248 'a... | '11':34 '25':254 '99.99':256 'alon':124,248 'a... |
3 | 11ccux4 | 2124 | I have been in India for a month and so far I ... | https://www.reddit.com/r/solotravel/comments/1... | Big-Assist-5 | solotravel | None | 2023-02-26 | 476 | one of the times i was staying at a guest hous... | 0.165 | 0.828 | 0.007 | -0.9831 | NEG | '30':20 'answer':63 'appar':29 'away':22 'basi... | '30':20 'answer':63 'appar':29 'away':22 'basi... |
4 | 146y8eu | 1900 | The first time I have ever felt unsafe in SE A... | https://www.reddit.com/r/solotravel/comments/1... | ihatemycohort | solotravel | Asia | 2023-06-11 | 232 | i just had a complete scare. im still shaking ... | 0.102 | 0.811 | 0.087 | -0.9189 | NEG | '1':554,1018 '10':71 '15':199 '2':64,562,986 '... | '1':554,1018 '10':71 '15':199 '2':64,562,986 '... |
dataeng['sentiment'].value_counts()
POS 78
NEG 17
NEU 5
Name: sentiment, dtype: int64
lawschool['sentiment'].value_counts()
POS 49
NEG 38
NEU 13
Name: sentiment, dtype: int64
covid19['sentiment'].value_counts()
NEG 58
POS 36
NEU 6
Name: sentiment, dtype: int64
twosentence['sentiment'].value_counts()
NEG 43
POS 29
NEU 27
Name: sentiment, dtype: int64
solotravel['sentiment'].value_counts()
POS 71
NEG 29
Name: sentiment, dtype: int64
# Create a figure and axis
plt.figure(figsize=(10, 6))
# Plot the distribution of the 'compound' score for the 'dataeng' subreddit
sns.distplot(dataeng['compound'], color='green', kde=False)
# Set plot labels and title
plt.xlabel('Compound Score')
plt.ylabel('Frequency')
plt.title('Distribution of Compound Scores in r/dataengineering')
# Show the plot
plt.show()
# Create a figure and axis
plt.figure(figsize=(10, 6))
# Plot the distribution of the 'compound' score for the 'dataeng' subreddit
sns.distplot(lawschool['compound'], color='orange', kde=False)
# Set plot labels and title
plt.xlabel('Compound Score')
plt.ylabel('Frequency')
plt.title('Distribution of Compound Scores in r/lawschool')
# Show the plot
plt.show()
# Create a figure and axis
plt.figure(figsize=(10, 6))
# Plot the distribution of the 'compound' score for the 'dataeng' subreddit
sns.distplot(covid19['compound'], color='brown', kde=False)
# Set plot labels and title
plt.xlabel('Compound Score')
plt.ylabel('Frequency')
plt.title('Distribution of Compound Scores in r/covid19')
# Show the plot
plt.show()
# Create a figure and axis
plt.figure(figsize=(10, 6))
# Plot the distribution of the 'compound' score for the 'dataeng' subreddit
sns.distplot(solotravel['compound'], color='purple', kde=False)
# Set plot labels and title
plt.xlabel('Compound Score')
plt.ylabel('Frequency')
plt.title('Distribution of Compound Scores in r/solotravel')
# Show the plot
plt.show()
sns.set(style='whitegrid')
import nltk
from nltk import word_tokenize
from nltk import FreqDist
from nltk.corpus import stopwords
lawschool_list = []
for row in lawschool["content"]:
lawschool_list.append(row)
lawschool_content = ' '.join(lawschool_list)
lawschool_tokens = word_tokenize(lawschool_content)
total_word_count = len(lawschool_tokens)
nltk.download("stopwords")
stop_words = set(stopwords.words("english"))
# Also need to remove some other random punctuations
other_removals = ["'", ";", "?", "(", ")", "'m", "'s", "n't", ":", "!", "`", '''"''', "'ve", "*", "`", ",", ""]
stop_words_updated = stop_words.union(other_removals)
# Filter out stopwords and short words
tokens_wo_stopwords = [word.lower() for word in lawschool_tokens if word.lower() not in stop_words_updated and len(word) > 2]
freq_dist = nltk.FreqDist(tokens_wo_stopwords)
# Calculate the percentage share of words
word_freq = freq_dist.most_common(10)
percentage_share = [(word, freq / total_word_count * 100) for word, freq in word_freq]
# Create the plot
plt.figure(figsize=(12, 6))
x, y = zip(*percentage_share)
plt.bar(x, y)
plt.xlabel("Words")
plt.ylabel("Percentage Share")
plt.xticks(size=15, rotation=75)
plt.show()
[nltk_data] Downloading package stopwords to /home/ydn3f/nltk_data...
[nltk_data] Unzipping corpora/stopwords.zip.
from nltk.util import ngrams
def plot_ngram_percentage_share(tokens, num, total_word_count, num_results=25):
ngram = list(ngrams(tokens, num))
ngram_dist = nltk.FreqDist(ngram)
# Calculate the percentage share of bigrams
ngram_freq = ngram_dist.most_common(num_results)
percentage_share = [(bigram, freq / total_word_count * 100) for bigram, freq in ngram_freq]
# Create the plot
x, y = zip(*percentage_share)
plt.figure(figsize=(10, 6))
plt.bar([" ".join(bigram) for bigram in x], y)
plt.xlabel(f"Top {num_results} {num}-grams")
plt.ylabel("Percentage Share")
plt.xticks(fontsize=15, rotation=75)
plt.show()
plot_ngram_percentage_share(tokens_wo_stopwords, 3, total_word_count, num_results=10)
The first chart doesn’t really reveal anything beyond the obvious, so I decided to dig a little deeper by exploring the frequent tri-grams that occur in the dataset for the lawschool subreddit. I was immediately surprised to see the top 3 to be mirror reflections of one another. I’m not a law student so I don’t really know why those 3 words are mentioned as a group often, but I’m curious to hear why. The rest of the tri-grams in the top 10 are pretty self-explanatory.
#Topic Modelling
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.decomposition import NMF, LatentDirichletAllocation
count_vectorizer = CountVectorizer(stop_words='english')
term_frequency = count_vectorizer.fit_transform(lawschool_list)
feature_names = count_vectorizer.get_feature_names()
print(f"Shape of term freq matrix = {term_frequency.shape}")
print(f"Num of features identified = {len(feature_names)}")
#LDA model with 5 topics
lda = LatentDirichletAllocation(n_components=5, random_state=0)
lda.fit(term_frequency)
def display_topics(model, feature_names, no_top_words):
for topic_idx, term_weights in enumerate(model.components_):
sorted_indx = term_weights.argsort()
topk_words = [feature_names[i] for i in sorted_indx[-no_top_words :]]
print(f"Topic {topic_idx}:", end=None)
print(";".join(topk_words))
display_topics(lda, feature_names, 10)
Shape of term freq matrix = (100, 2340)
Num of features identified = 2340
Topic 0:
need;ranking;student;people;students;just;law;tax;corporate;did
Topic 1:
doesn;westlaw;foster;partner;know;care;getting;school;bar;summer
Topic 2:
probably;school;say;did;people;law;like;just;student;gen
Topic 3:
students;firm;school;class;know;time;just;like;people;law
Topic 4:
got;like;people;firm;going;know;bar;just;school;law
#TFIDF VEC
tfidf_vectorizer = TfidfVectorizer(stop_words='english')
tfidf = tfidf_vectorizer.fit_transform(lawschool_list)
tfidf_feature_names = tfidf_vectorizer.get_feature_names()
print(f"Shape of tfidf matrix = {tfidf.shape}")
print(f"Num of features identified = {len(tfidf_feature_names)}")
#6 topics
nmf = NMF(n_components=5, random_state=0)
nmf.fit(term_frequency)
#Top 10 words per topic
display_topics(nmf, tfidf_feature_names, 10)
Shape of tfidf matrix = (100, 2340)
Num of features identified = 2340
Topic 0:
feeling;don;measure;like;paperwork;adhd;just;tax;corporate;did
Topic 1:
hard;students;going;just;make;exam;prep;bar;school;law
Topic 2:
let;really;just;tax;law;real;firm;people;know;like
Topic 3:
probably;youre;law;okay;like;student;say;people;just;gen
Topic 4:
got;heard;event;offer;minecraft;firm;just;getting;summer;partner
/opt/conda/lib/python3.7/site-packages/sklearn/decomposition/_nmf.py:315: FutureWarning: The 'init' value, when 'init=None' and n_components is less than n_samples and n_features, will be changed from 'nndsvd' to 'nndsvda' in 1.1 (renaming of 0.26).
"'nndsvda' in 1.1 (renaming of 0.26)."), FutureWarning)
TFIDF Vectorizer seemed to detect more distinct topic selections in the lawschool subreddit. I can easily infer Topic 1 to be about exam preparation like the bar and all the difficult long hours studying for it. Topic 3 sounds like users are posting submissions related to being a student and maybe words of support or encouragement. Topic 4 can easily be described as discussion topics related to working at a firm over the summer or a job offer of sorts.
# Lawschool Sentiment Over Time
lawschool['published'] = pd.to_datetime(lawschool['published'])
lawschool_monthly = lawschool.copy()
lawschool_monthly.set_index('published', inplace=True)
lawschool_monthly = lawschool_monthly.resample('W').agg({'neg': 'mean', 'neu': 'mean', 'pos': 'mean', 'compound': 'mean'})
# Plot the sentiment scores over time
plt.figure(figsize=(16, 5))
sns.lineplot(x=lawschool_monthly.index, y=lawschool_monthly['neg'], label='Negative', linewidth=4)
sns.lineplot(x=lawschool_monthly.index, y=lawschool_monthly['neu'], label='Neutral', linewidth=4)
sns.lineplot(x=lawschool_monthly.index, y=lawschool_monthly['pos'], label='Positive', linewidth=4)
sns.lineplot(x=lawschool_monthly.index, y=lawschool_monthly['compound'], label='Compound', linewidth=3, color='gray', alpha=0.8, style=True, dashes=[(1,1)], legend=False)
plt.xlabel('Time')
plt.ylabel('Sentiment Score')
plt.title('Sentiment Analysis Over Time for r/lawschool (past 12 months)')
plt.legend()
plt.grid(True)
plt.show()
For visualizing sentiment over time, I chose to use the line chart for interpretability. The most significant portion of the time-series is the month of march, especially at the beginning, where a substantial collection of negatively sentimented posts were submitted. Could it be the period before bar exams? 🤔
solotravel.head(3)
id | score | title | link | author | subreddit | flair | published | comments | content | neg | neu | pos | compound | sentiment | content_tsv_gin | content_tsv_gist | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 16c1of1 | 5356 | The number of old sex tourists in Bangkok is i... | https://www.reddit.com/r/solotravel/comments/1... | Weekly-Patience4087 | solotravel | None | 2023-09-07 | 629 | i am currently in bangkok and the number of se... | 0.075 | 0.857 | 0.068 | 0.2326 | POS | '2':211 'advantag':143 'also':61 'area':20,185... | '2':211 'advantag':143 'also':61 'area':20,185... |
1 | 11zkusj | 2967 | I encountered my first begpacker today | https://www.reddit.com/r/solotravel/comments/1... | Northerner6 | solotravel | None | 2023-03-23 | 332 | i encountered my first begpacker today. i was ... | 0.074 | 0.876 | 0.050 | -0.6554 | NEG | 'accent':38 'afford':176 'american':37 'approa... | 'accent':38 'afford':176 'american':37 'approa... |
2 | 13uw9tn | 2205 | REMINDER: Unwanted sexual attention is NEVER O... | https://www.reddit.com/r/solotravel/comments/1... | unsuspectingmuggle | solotravel | Accommodation | 2023-05-29 | 301 | report people who make you feel unsafe!i've be... | 0.128 | 0.826 | 0.047 | -0.9464 | NEG | '11':34 '25':254 '99.99':256 'alon':124,248 'a... | '11':34 '25':254 '99.99':256 'alon':124,248 'a... |
solotravel_list = []
for row in solotravel["content"]:
solotravel_list.append(row)
solotravel_content = ' '.join(solotravel_list)
solotravel_tokens = word_tokenize(solotravel_content)
total_word_count = len(solotravel_tokens)
nltk.download("stopwords")
stop_words = set(stopwords.words("english"))
# Also need to remove some other random punctuations
other_removals = ["'", ";", "?", "(", ")", "'m", "'s", "n't", ":", "!", "`", '''"''', "'ve", "*", "`", ",", ""]
stop_words_updated = stop_words.union(other_removals)
# Filter out stopwords and short words
tokens_wo_stopwords = [word.lower() for word in solotravel_tokens if word.lower() not in stop_words_updated and len(word) > 2]
freq_dist = nltk.FreqDist(tokens_wo_stopwords)
# Calculate the percentage share of words
word_freq = freq_dist.most_common(10)
percentage_share = [(word, freq / total_word_count * 100) for word, freq in word_freq]
# Create the plot
plt.figure(figsize=(12, 6))
x, y = zip(*percentage_share)
plt.bar(x, y)
plt.xlabel("Words")
plt.ylabel("Percentage Share")
plt.xticks(size=15, rotation=75)
plt.show()
[nltk_data] Downloading package stopwords to /home/ydn3f/nltk_data...
[nltk_data] Package stopwords is already up-to-date!
plot_ngram_percentage_share(tokens_wo_stopwords, 3, total_word_count, num_results=10)
from collections import defaultdict
nlp = spacy.load("en_core_web_sm")
# Function to extract location mentions
def extract_locations(text):
doc = nlp(text)
locations = [ent.text for ent in doc.ents if ent.label_ == "GPE"]
return locations
travel_posts = solotravel[["content", "neg", "neu", "pos", "compound"]]
# Define a dictionary to store sentiment scores by location
location_sentiment = defaultdict(list)
# Iterate through each post and extract location mentions
for index, row in travel_posts.iterrows():
content = row["content"]
sentiment = {
"neg": row["neg"],
"neu": row["neu"],
"pos": row["pos"],
"compound": row["compound"],
}
locations = extract_locations(content)
for location in locations:
location_sentiment[location].append(sentiment)
# Calculate summary statistics for sentiment by location
location_summary = {}
for location, sentiment_scores in location_sentiment.items():
num_posts = len(sentiment_scores)
if num_posts > 0:
summary = {
"num_posts": num_posts,
"avg_neg": sum(score["neg"] for score in sentiment_scores) / num_posts,
"avg_neu": sum(score["neu"] for score in sentiment_scores) / num_posts,
"avg_pos": sum(score["pos"] for score in sentiment_scores) / num_posts,
"avg_compound": sum(score["compound"] for score in sentiment_scores) / num_posts,
}
location_summary[location] = summary
geo_sentiments = pd.DataFrame.from_dict(location_summary)
geo_sentiments = geo_sentiments.transpose().rename_axis('geo-entity').reset_index()
# Sort the DataFrame and select the top and bottom rows
sorted_geo_sentiments = geo_sentiments.query("num_posts > 5").sort_values(by='avg_compound') # minimum of 5 self-post
top_geo_sentiments = sorted_geo_sentiments.head(5)
bottom_geo_sentiments = sorted_geo_sentiments.tail(5)
tb_geo_sentiments = pd.concat([top_geo_sentiments, bottom_geo_sentiments])
colors = ['#2E8BC0' if avg_compound > 0 else '#AE0000' for avg_compound in tb_geo_sentiments['avg_compound']]
# Create the bar chart
plt.figure(figsize=(12, 6)) # Adjust the figure size as needed
sns.barplot(y='avg_compound', x='geo-entity', data=tb_geo_sentiments, palette=colors)
# Add the abline for y=0 (neutral sentiment)
plt.axhline(0, color='black', linewidth=2, linestyle='-')
# Customize the labels and titles
plt.xlabel('Average Compound Sentiment Score')
plt.ylabel('Location')
plt.title('Average Compound Sentiment by Geo-entity')
# Display the chart
plt.show()
# Create an empty list to store the records
location_data = []
# Iterate through location_sentiment and convert it into records
for location, sentiment_scores in location_sentiment.items():
for sentiment_score in sentiment_scores:
record = {
'location': location,
'neg': sentiment_score['neg'],
'neu': sentiment_score['neu'],
'pos': sentiment_score['pos'],
'compound': sentiment_score['compound']
}
location_data.append(record)
# Create a DataFrame from the list of records
location_df = pd.DataFrame(location_data)
# List of specified countries
specified_countries = ['india', 'japan', 'thailand', 'vietnam', 'paris', 'romania', 'berlin', 'venice', 'madrid', 'rome']
# Filter the DataFrame to include only the specified countries
popular_countries = location_df[location_df['location'].isin(specified_countries)]
# List of sentiment columns
sentiment_columns = ['compound', 'neg', 'neu', 'pos']
# Create box plots
plt.figure(figsize=(12, 8))
for sentiment_column in sentiment_columns:
sns.boxplot(data=popular_countries, x='location', y=sentiment_column, palette='Set3')
plt.xlabel('Countries')
plt.ylabel(sentiment_column.capitalize() + ' Score')
plt.title('Sentiment Analysis by Geo-entity')
plt.xticks(rotation=45)
plt.show()
covid19.head()
id | score | title | link | author | subreddit | flair | published | comments | content | neg | neu | pos | compound | sentiment | content_tsv_gin | content_tsv_gist | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | yjrg0a | 907 | Up vote if you're currently positive with your... | https://www.reddit.com/r/COVID19positive/comme... | Hailabigail | COVID19positive | Tested Positive - Breakthrough | 2022-11-02 | 311 | i'm seeing an overwhelming amount of posts wit... | 0.106 | 0.727 | 0.167 | 0.7293 | POS | '10':48 'amount':6,29 'breakthrough':31 'covid... | '10':48 'amount':6,29 'breakthrough':31 'covid... |
1 | 13p6qrm | 597 | Why is everyone pretending the pandemic disapp... | https://www.reddit.com/r/COVID19positive/comme... | marconas1_ | COVID19positive | Rant | 2023-05-22 | 271 | i work in a tech company, and it has become co... | 0.154 | 0.776 | 0.069 | -0.8343 | NEG | 'accept':66 'affect':33 'back':40 'becom':10 '... | 'accept':66 'affect':33 'back':40 'becom':10 '... |
2 | 12lw075 | 461 | What is….happening here? | https://www.reddit.com/r/COVID19positive/comme... | brutallyhonestkitten | COVID19positive | Rant | 2023-04-14 | 201 | like the title says, i feel like i am living i... | 0.027 | 0.853 | 0.119 | 0.9247 | POS | 'absolut':37 'alien':126 'altern':13 'anymor':... | 'absolut':37 'alien':126 'altern':13 'anymor':... |
3 | zw72uc | 418 | This new variant was one of the worst experien... | https://www.reddit.com/r/COVID19positive/comme... | Throwawayacount5093 | COVID19positive | Tested Positive - Me | 2022-12-27 | 145 | i’m in my early twenties, fully vaxed and boos... | 0.121 | 0.790 | 0.089 | -0.8072 | NEG | '104':120 '4':233 '60':206 '60mg':201 'abl':21... | '104':120 '4':233 '60':206 '60mg':201 'abl':21... |
4 | zji350 | 396 | The pandemic's over they said. You don't need ... | https://www.reddit.com/r/COVID19positive/comme... | None | COVID19positive | Tested Positive - Me | 2022-12-12 | 216 | i haven't slept in nearly 40 hours, was in the... | 0.069 | 0.839 | 0.091 | 0.2023 | POS | '15':14 '40':7 '4x':77 'ach':54 'ash':34 'cong... | '15':14 '40':7 '4x':77 'ach':54 'ash':34 'cong... |
covid19_list = []
for row in covid19["content"]:
covid19_list.append(row)
covid19_content = ' '.join(covid19_list)
covid19_tokens = word_tokenize(covid19_content)
total_word_count = len(covid19_tokens)
nltk.download("stopwords")
stop_words = set(stopwords.words("english"))
# Also need to remove some other random punctuations
other_removals = ["'", ";", "?", "(", ")", "'m", "'s", "n't", ":", "!", "`", '''"''', "'ve", "*", "`", ",", ""]
stop_words_updated = stop_words.union(other_removals)
# Filter out stopwords and short words
tokens_wo_stopwords = [word.lower() for word in covid19_tokens if word.lower() not in stop_words_updated and len(word) > 2]
freq_dist = nltk.FreqDist(tokens_wo_stopwords)
# Calculate the percentage share of words
word_freq = freq_dist.most_common(10)
percentage_share = [(word, freq / total_word_count * 100) for word, freq in word_freq]
# Create the plot
plt.figure(figsize=(12, 6))
x, y = zip(*percentage_share)
plt.bar(x, y)
plt.xlabel("Words")
plt.ylabel("Percentage Share")
plt.xticks(size=15, rotation=75)
plt.show()
[nltk_data] Downloading package stopwords to /home/ydn3f/nltk_data...
[nltk_data] Package stopwords is already up-to-date!
plot_ngram_percentage_share(tokens_wo_stopwords, 3, total_word_count, num_results=10)
covid19.head()
id | score | title | link | author | subreddit | flair | published | comments | content | neg | neu | pos | compound | sentiment | content_tsv_gin | content_tsv_gist | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | yjrg0a | 907 | Up vote if you're currently positive with your... | https://www.reddit.com/r/COVID19positive/comme... | Hailabigail | COVID19positive | Tested Positive - Breakthrough | 2022-11-02 | 311 | i'm seeing an overwhelming amount of posts wit... | 0.106 | 0.727 | 0.167 | 0.7293 | POS | '10':48 'amount':6,29 'breakthrough':31 'covid... | '10':48 'amount':6,29 'breakthrough':31 'covid... |
1 | 13p6qrm | 597 | Why is everyone pretending the pandemic disapp... | https://www.reddit.com/r/COVID19positive/comme... | marconas1_ | COVID19positive | Rant | 2023-05-22 | 271 | i work in a tech company, and it has become co... | 0.154 | 0.776 | 0.069 | -0.8343 | NEG | 'accept':66 'affect':33 'back':40 'becom':10 '... | 'accept':66 'affect':33 'back':40 'becom':10 '... |
2 | 12lw075 | 461 | What is….happening here? | https://www.reddit.com/r/COVID19positive/comme... | brutallyhonestkitten | COVID19positive | Rant | 2023-04-14 | 201 | like the title says, i feel like i am living i... | 0.027 | 0.853 | 0.119 | 0.9247 | POS | 'absolut':37 'alien':126 'altern':13 'anymor':... | 'absolut':37 'alien':126 'altern':13 'anymor':... |
3 | zw72uc | 418 | This new variant was one of the worst experien... | https://www.reddit.com/r/COVID19positive/comme... | Throwawayacount5093 | COVID19positive | Tested Positive - Me | 2022-12-27 | 145 | i’m in my early twenties, fully vaxed and boos... | 0.121 | 0.790 | 0.089 | -0.8072 | NEG | '104':120 '4':233 '60':206 '60mg':201 'abl':21... | '104':120 '4':233 '60':206 '60mg':201 'abl':21... |
4 | zji350 | 396 | The pandemic's over they said. You don't need ... | https://www.reddit.com/r/COVID19positive/comme... | None | COVID19positive | Tested Positive - Me | 2022-12-12 | 216 | i haven't slept in nearly 40 hours, was in the... | 0.069 | 0.839 | 0.091 | 0.2023 | POS | '15':14 '40':7 '4x':77 'ach':54 'ash':34 'cong... | '15':14 '40':7 '4x':77 'ach':54 'ash':34 'cong... |
covid19['published'] = pd.to_datetime(covid19['published'])
from nltk import bigrams, trigrams
#list(bigrams(tokens_wo_stopwords))
nltk.FreqDist(list(bigrams(tokens_wo_stopwords)))
FreqDist({('feel', 'like'): 27, ('tested', 'positive'): 20, ('felt', 'like'): 16, ('first', 'time'): 14, ('got', 'covid'): 14, ('taste', 'smell'): 13, ('feels', 'like'): 12, ('wearing', 'mask'): 11, ('sense', 'smell'): 11, ('last', 'week'): 10, ...})
# Convert the 'published' column to datetime
covid19['published'] = pd.to_datetime(covid19['published'])
# Define a time period for grouping, e.g., 'W' for weekly
covid19['time_period'] = covid19['published'].dt.isocalendar().week
# Initialize a list to store tokenized text for each time period
tokenized_by_time = []
# Iterate over time periods and tokenize the text for each period
for time_period, group in covid19.groupby('time_period'):
covid19_content = ' '.join(group['content'])
covid19_tokens = word_tokenize(covid19_content)
tokens_wo_stopwords = [word.lower() for word in covid19_tokens if word.lower() not in stop_words_updated and len(word) > 2]
tokenized_by_time.append(tokens_wo_stopwords)
top_bigrams_by_time = []
# Iterate over time periods and calculate the top 3 bigrams for each period
for tokens in tokenized_by_time:
bigram_fd = FreqDist(ngrams(tokens, 2))
top_bigrams = bigram_fd.most_common(3)
top_bigrams_by_time.append(top_bigrams)
df = pd.DataFrame({'time_period': covid19['time_period'].unique(), 'Top Bigrams': top_bigrams_by_time})
top_bigrams_by_time[0]
[(('give', 'space'), 2), (('work', 'get'), 2), (('time', 'covid'), 2)]
# Initialize lists to store data for the DataFrame
time_periods = []
bigrams = []
counters = []
# Iterate over time periods and bigrams to extract data for the DataFrame
for i, time_period in enumerate(covid19['time_period'].unique()):
for bigram, counter in top_bigrams_by_time[i]:
time_periods.append(time_period)
bigrams.append(' '.join(bigram))
counters.append(counter)
# Create a DataFrame with the extracted data
df = pd.DataFrame({'time_period': time_periods, 'Bigram': bigrams, 'Counter': counters})
df.sort_values('Counter', ascending=False).head(10)
time_period | Bigram | Counter | |
---|---|---|---|
96 | 7 | dry cough | 7 |
97 | 7 | taste smell | 6 |
98 | 7 | night sweats | 5 |
81 | 38 | high fever | 5 |
82 | 38 | pretty much | 4 |
25 | 42 | wearing mask | 4 |
24 | 42 | still one | 4 |
105 | 24 | feels like | 4 |
60 | 27 | nasal spray | 4 |
63 | 14 | feel like | 4 |
display(df.shape)
# Filter the DataFrame to include only bigrams containing 'covid'
covid_bigrams = df[df['Bigram'].str.contains('covid|dry cough|taste smell|night sweats|nasal spray|high fever')]
# Define a color mapping for each bigram
def map_color(bigram):
if 'covid' in bigram:
return 'teal'
elif 'dry cough' in bigram:
return 'red'
elif 'taste smell' in bigram:
return 'green'
elif 'night sweats' in bigram:
return 'orange'
elif 'nasal spray' in bigram:
return 'hotpink'
elif 'high fever' in bigram:
return 'maroon'
else:
return 'gray' # Default color for unmatched bigrams
(114, 3)
covid_bigrams = covid_bigrams.assign(Color=covid_bigrams['Bigram'].apply(map_color))
import matplotlib.patches as mpatches
# Create the scatter plot with jitter and alpha for the filtered data
plt.figure(figsize=(14, 6))
for bigram in covid_bigrams['Bigram'].unique():
subset = covid_bigrams[covid_bigrams['Bigram'] == bigram]
color = map_color(bigram)
jitter = np.random.normal(0, 0.9, len(subset)) # Add jitter for each unique bigram
plt.scatter(
subset['time_period'] + jitter, # Match the length of jitter to the subset
subset['Counter'],
s=400, # Adjust the size here (e.g., 100)
c=color, # Set the color based on the mapping
alpha=0.6,
edgecolor='black', # Add a black outline
linewidth=1.5, # Control the thickness of the outline
label=bigram
)
plt.xlabel('Week #')
plt.ylabel('Bigram Counter')
plt.title('Top Bigrams Over Time (Week 0 = Oct 2022)')
plt.xticks()
plt.grid(True)
# Create custom legend patches for each color category
legend_labels = [
mpatches.Patch(color='teal', label='Covid-Related Bigrams'),
mpatches.Patch(color='red', label='Dry Cough Bigrams'),
mpatches.Patch(color='green', label='Taste and Smell Bigrams'),
mpatches.Patch(color='orange', label='Night Sweats Bigrams'),
mpatches.Patch(color='hotpink', label='Nasal Spray Bigrams'),
mpatches.Patch(color='maroon', label='High Fever Bigrams')
]
plt.legend(handles=legend_labels, bbox_to_anchor=(1.05, 1), loc='upper left')
plt.show()
dataeng.head()
id | score | title | link | author | subreddit | flair | published | comments | content | neg | neu | pos | compound | sentiment | content_tsv_gin | content_tsv_gist | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 151xsis | 567 | Data Scientists -- Ok, now I get it. | https://www.reddit.com/r/dataengineering/comme... | tarzanboy76 | dataengineering | Discussion | 2023-07-17 | 220 | a few days ago, our data scientist gave me som... | 0.035 | 0.861 | 0.104 | 0.9340 | POS | 'access':193 'actual':62,110,147 'admin':192 '... | 'access':193 'actual':62,110,147 'admin':192 '... |
1 | 10kl6lg | 374 | Finally got a job | https://www.reddit.com/r/dataengineering/comme... | 1000gratitudepunches | dataengineering | Career | 2023-01-25 | 100 | i did it! after 8 months of working as a budte... | 0.000 | 0.950 | 0.050 | 0.5093 | POS | '12':24 '400':20 '8':5 'applic':22 'believ':42... | '12':24 '400':20 '8':5 'applic':22 'believ':42... |
2 | yyh6l9 | 381 | What are your favourite GitHub repos that show... | https://www.reddit.com/r/dataengineering/comme... | theoriginalmantooth | dataengineering | Discussion | 2022-11-18 | 40 | looking to level up my skills and want to know... | 0.000 | 0.899 | 0.101 | 0.5775 | POS | 'accounts/repos':20 'alreadi':46 'data':17 'di... | 'accounts/repos':20 'alreadi':46 'data':17 'di... |
3 | 14663ur | 294 | r/dataengineering will be joining the blackout... | https://www.reddit.com/r/dataengineering/comme... | AutoModerator | dataengineering | Meta | 2023-06-10 | 21 | [see here for the original r/dataengineering t... | 0.087 | 0.840 | 0.073 | -0.8688 | NEG | '/)*.':536 '/hc/en-us/requests/new):':352 '/r/... | '/)*.':536 '/hc/en-us/requests/new):':352 '/r/... |
4 | 10fg07o | 286 | just got laid off (FAANG) | https://www.reddit.com/r/dataengineering/comme... | Foodwithfloyd | dataengineering | Career | 2023-01-18 | 84 | hi all, its been a pretty awful day. two month... | 0.032 | 0.808 | 0.160 | 0.9118 | POS | 'ago':11 'anoth':26 'anyon':98 'aw':7 'beyond'... | 'ago':11 'anoth':26 'anyon':98 'aw':7 'beyond'... |
#Topic modelling
dataeng_list = []
for row in dataeng["content"]:
dataeng_list.append(row)
dataeng_content = ' '.join(dataeng_list)
dataeng_tokens = word_tokenize(dataeng_content)
total_word_count = len(dataeng_tokens)
nltk.download("stopwords")
stop_words = set(stopwords.words("english"))
# Also need to remove some other random punctuations
other_removals = ["'", ";", "?", "(", ")", "'m", "'s", "n't", ":", "!", "`", '''"''', "'ve", "*", "`", ",", ""]
stop_words_updated = stop_words.union(other_removals)
# Filter out stopwords and short words
tokens_wo_stopwords = [word.lower() for word in dataeng_tokens if word.lower() not in stop_words_updated and len(word) > 2]
freq_dist = nltk.FreqDist(tokens_wo_stopwords)
# Calculate the percentage share of words
word_freq = freq_dist.most_common(10)
percentage_share = [(word, freq / total_word_count * 100) for word, freq in word_freq]
# Create the plot
plt.figure(figsize=(12, 6))
x, y = zip(*percentage_share)
plt.bar(x, y)
plt.xlabel("Words")
plt.ylabel("Percentage Share")
plt.xticks(size=15, rotation=75)
plt.show()
[nltk_data] Downloading package stopwords to /home/ydn3f/nltk_data...
[nltk_data] Package stopwords is already up-to-date!
plot_ngram_percentage_share(tokens_wo_stopwords, 3, total_word_count, num_results=10)
#Topic Modelling
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.decomposition import NMF, LatentDirichletAllocation
count_vectorizer = CountVectorizer(stop_words='english')
term_frequency = count_vectorizer.fit_transform(dataeng_list)
feature_names = count_vectorizer.get_feature_names()
print(f"Shape of term freq matrix = {term_frequency.shape}")
print(f"Num of features identified = {len(feature_names)}")
#LDA model with 5 topics
lda = LatentDirichletAllocation(n_components=5, random_state=0)
lda.fit(term_frequency)
def display_topics(model, feature_names, no_top_words):
for topic_idx, term_weights in enumerate(model.components_):
sorted_indx = term_weights.argsort()
topk_words = [feature_names[i] for i in sorted_indx[-no_top_words :]]
print(f"Topic {topic_idx}:", end=None)
print(";".join(topk_words))
display_topics(lda, feature_names, 10)
Shape of term freq matrix = (100, 2888)
Num of features identified = 2888
Topic 0:
pipeline;sql;like;use;just;api;need;https;cloud;data
Topic 1:
isn;app;databricks;data;make;comments;www;https;com;reddit
Topic 2:
years;engineering;people;job;just;time;sql;like;company;data
Topic 3:
years;team;ve;learn;know;really;just;like;databricks;data
Topic 4:
team;blog;dbt;snowflake;databricks;spark;data;instacart;com;https
#TFIDF VEC
tfidf_vectorizer = TfidfVectorizer(stop_words='english')
tfidf = tfidf_vectorizer.fit_transform(dataeng_list)
tfidf_feature_names = tfidf_vectorizer.get_feature_names()
print(f"Shape of tfidf matrix = {tfidf.shape}")
print(f"Num of features identified = {len(tfidf_feature_names)}")
#6 topics
nmf = NMF(n_components=5, random_state=0)
nmf.fit(term_frequency)
#Top 10 words per topic
display_topics(nmf, tfidf_feature_names, 10)
Shape of tfidf matrix = (100, 2888)
Num of features identified = 2888
/opt/conda/lib/python3.7/site-packages/sklearn/decomposition/_nmf.py:315: FutureWarning: The 'init' value, when 'init=None' and n_components is less than n_samples and n_features, will be changed from 'nndsvd' to 'nndsvda' in 1.1 (renaming of 0.26).
"'nndsvda' in 1.1 (renaming of 0.26)."), FutureWarning)
Topic 0:
support;isn;official;app;make;comments;www;https;com;reddit
Topic 1:
business;work;files;engineering;years;ve;sql;learn;just;data
Topic 2:
time;data;excel;just;extremely;team;people;job;like;company
Topic 3:
spark;etl;understand;really;platform;cloud;lot;data;snowflake;databricks
Topic 4:
blog;data;course;spark;snowflake;www;instacart;databricks;com;https
# Sample function to assign topics based on keywords
def assign_topic(content):
if "career" in content.lower():
return "Career"
elif "projects" in content.lower():
return "Projects"
elif "personal" in content.lower():
return "Personal"
elif "people" in content.lower():
return "People"
elif "company" in content.lower():
return "Company"
else:
return "Other"
def assign_topic_data(content):
if "sql" in content.lower():
return "SQL"
elif "snowflake" in content.lower():
return "Snowflake"
elif "databricks" in content.lower():
return "Databricks"
elif "apache" in content.lower():
return "Apache"
elif "Spark" in content.lower():
return "Spark"
else:
return "Other"
topics = ["Career", "Projects", "Personal"]
data_topics = ['sql', 'databricks', 'snowflake', 'people', 'company']
# Apply the function to the DataFrame
dataeng['topic'] = dataeng['content'].apply(assign_topic)
dataeng['data_topic'] = dataeng['content'].apply(assign_topic_data)
# Group by topic and calculate the average sentiment score
role_sentiments = dataeng.groupby('topic')['compound'].mean().reset_index()
# Group by topic and calculate summary statistics
topic_summary = dataeng.groupby('topic').agg({
'compound': ['mean', 'min', 'max', 'median', 'std'],
'neg': 'mean',
'neu': 'mean',
'pos': 'mean'
}).reset_index()
# Flatten the multi-index columns
topic_summary.columns = ['_'.join(col).strip() for col in topic_summary.columns.values]
# Define the sentiment score columns
sentiment_columns = ['compound', 'neg', 'neu', 'pos']
# Create box plots
plt.figure(figsize=(12, 8))
for sentiment_column in sentiment_columns:
sns.boxplot(data=dataeng, x='topic', y=sentiment_column, palette='Set2')
plt.xlabel('Topics')
plt.ylabel(sentiment_column.capitalize() + ' Score')
plt.title('Sentiment Analysis by Topic')
plt.xticks(rotation=45)
plt.show()
tools_sentiments = dataeng.groupby('data_topic')['compound'].mean().reset_index()
# Define the sentiment score columns
sentiment_columns = ['compound', 'neg', 'neu', 'pos']
# Create box plots
plt.figure(figsize=(12, 8))
for sentiment_column in sentiment_columns:
sns.boxplot(data=dataeng, x='data_topic', y=sentiment_column, palette='Set3')
plt.xlabel('Topics')
plt.ylabel(sentiment_column.capitalize() + ' Score')
plt.title('Sentiment Analysis by Topic')
plt.xticks(rotation=45)
plt.show()
–Distribution Plots– Looking at the distribution of compound sentiments, we see immediately how balanced (or skewed) our datasets are according to their shape. I was a little disappointed to see so many positively-scored posts in the r/solotravel dataset. I was hoping to see a slight more mix of positives and negatives. This goes the same for the r/dataengineering subreddit
–{solotravel} sentiment analysis by location– it was interesting to see which geographic areas mentioned ranked the most/least by users in the solotravel subreddit based on their compound score. i made sure to set the minumum post count threshold to 5 to ensure that there is sufficient sample for a somewhat general consensus. it is not fair to have 1 negative review of a country/place represent the whole country in this analysis if there was only 1 post about said country! it was also interesting to see the bias that the subreddit seemingly had towards european travel destinations as opposed to non-european travel destinations, namely asia. i examined this disparity close and expanded it to representing the range of values via boxplots - building off the absolute average scores. what we see then is geographic locations who have strong vs weak consensus (pos, neg, neu, and compound) based on the box length. for example, from the boxplots, we can confirm that the european cities of Berlin, Venice, and Madrid are squarely considered among users in the subreddit to be associated with a positive experience.
—{data engingeer} sentiments by topic buckets— similar to what was done for solotravel dataset, i grouped up the contents of each individual post into selected topics. from there, we can see how posts relating to or talking about said selected topics are generally ranked based on sentimental polarity scores. for the most part, posts relating to “projects”, “career”, and “company” are fairly positive. an interesting point about the “people” topic bucket is that while it is positive on average, it has a longer body than the rest, which means there are also mixed sentiments. is this representative of the actual data? i don’t think so because “people” is a very general term and it can mean many different things based on context. so without context, i would argue that it is not conveying much as opposed to the other topic buckets.
then, i decided to do the same for data tools topics. it was really cool to see that Snowflake is the most well-received out of the bunch when people mentioned it in their posts - judging by the median from the compound boxplot. conversely, apache appears to be less than well-received but still relatively positive. databricks on the other hand is fairly positive but runs into the familiar problem of having a lot of mixed reviews to it.
—{covid19} top bri-grams over time— for the covid19 subreddit, i knew i wanted to see time being an important element in my analysis. so i conducted a time-series plot, using week’s rather than months, to analyze the shift in bi-gram frequencies or mentions over time. from the scatter plot, we can see that covid-related bigrams are showing up in recent weeks - the right side of the plot around where we are in the year (october 2023). but i wouldn’t be alarmed as these are mere bi-grams and not covid tests themselves which means this is just an indicator of how many times covid-related discussion is being brought up. what’s also interesting about this scatterplot is the seeming gap in discussion around week 20 mark - which around march and april of last year. so around spring time of last year, there was a gap of covid-related discussion among users and submissions alike in the covid19 subreddit - interesting!
altogether, i think this was a fantastic project to perform some NLP and text analysis given the time constraint. i feel like i have only just scratched the surface with my insights. thank you for reading.
the end