Loading package and dependencies:
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
library(data.table)
##
## Attaching package: 'data.table'
## The following objects are masked from 'package:dplyr':
##
## between, first, last
library(tidyr)
library(forcats)
library(stringr)
The survey data is structured in a way where the headers are the question numbers; and then the first row of the data is the corresponding text body of the question. Since I’m really only after the values of each column, I will go ahead and remove the first row. However, there is another use for the first row…
I will take the first row of data and transform it to use as an index
for referencing the column headers. For example, I don’t know what
column Q3
or Q7_6
stands for, so I’ll keep to
an index for referral.
Here’s what I mean about transforming first to get an index table of the questions.
df_index <- read.csv('./kagglesurvey/kaggle_survey_2022_responses.csv', header=0)
df_index <- transpose(df_index[1:2,])
head(df_index)
Loading in the survey data
df1 <- read.csv('./kagglesurvey/kaggle_survey_2022_responses.csv')
df1 <- df1[-1,]
head(df1$Q5)
## [1] "No" "No" "Yes" "No" "Yes" "Yes"
First things first, let us look at the gender demographics of our survey data, not including non-binaries or N/As. Almost immediately, I find that the gender demographics is overwhelmingly male. Data note #1: This tells me that:
The survey was distributed to a population that is mostly male: If your survey was sent to a population or group that is known to be predominantly male, such as a group of software developers, then it is expected that the survey data would be dominated by male respondents.
The survey was not designed to be inclusive: It is possible that the survey was not designed to be inclusive or welcoming to people of all genders. This could result in fewer responses from women and non-binary individuals, and skew the gender demographics.
There is a societal or cultural bias towards men in the subject area of the survey: In some fields, such as technology or finance, there may be a cultural or societal bias that makes it more difficult for women and non-binary individuals to participate or feel welcome. This could result in fewer responses from these groups.
The sad reality is that this does describe the demographics of the programming field: male-dominated with very few representation from female and non-binary programmers. Therefore, these analyses into gender dynamics will be heavily male-skewed.
gender_ratio <- df1 %>% filter(Q3 == "Man" | Q3 == "Woman")
gender_ratio %>% group_by(Q3) %>% summarise(percent = n()/nrow(.))
gender_demog <- gender_ratio %>% group_by(Q3) %>% summarise(percentage = round(n()/nrow(.),4)*100, lab.pos = cumsum(percentage)-.5*percentage)
ggplot(gender_demog, aes(x=1, y=percentage, fill=Q3)) +
geom_bar(stat="identity") +
coord_polar("y", start = 0) +
geom_text(aes(y = lab.pos, label = paste(percentage,"%", sep = "")), col = "white") + theme_void() +
scale_fill_manual(values=c("lightblue", "lightpink")) +
xlim(-1, 2.5)
#ggsave("gender_demo_white.png")
Moving on…
To explore the gender dynamics of the dataset, I focus on two columns
of interest: Q3 and Q29. Q3 is the reported gender of the survey taker,
while Q29 is the range in which their reported salary lies. From this, a
new df is created: gender_pay
. The goal of this section is
to extract value from Q29 in order to make any meaningful inferences
about gender dynamics.
How do we extract value from Q29?
Since each row is essentially a person, we can group
gender_pay
by gender and aggregate the salary range
data.
In order to do that, we must first perform some data carpentry on Q29
for it to be remotely useful for data analysis. First, I remove any
strings contained in the values (comma’s and dollar signs). However, I
will keep the dash -
in between the numbers for the time
being because I will need it to split the column into two in just a
moment.
Right from the jump we run into the first problem with the data:
outliers and null values. After running a quick tally of the values
available in Q29, I made the decision to remove outliers
(those reporting a salary greater than $1 million) because they will
skew the statistic. I also went ahead and removed the null values.
Since the non-binary group do not bear any reliable sample size in our data, it would be misleading to include them in the statistics.
After strictly filtering out gender_paygap
to no
millionaires, and just Man/Woman, I finally removed the dollar signs and
commas from column Q29. Then I split the column using the
separate()
function.
# big revamp
genderp <- df1 %>% select(c("Q3","Q29"))
genderp <- genderp %>% mutate_all(na_if, "")
genderp <- genderp %>% filter(Q3 == "Man" | Q3 == "Woman")
genderp <- genderp %>% filter(Q29 != ">1000000")
genderp <- genderp %>% mutate(across(starts_with("Q29"), ~gsub("\\$", "", .))) %>% mutate(across(starts_with("Q29"), ~gsub("\\,", "", .)))
genderp <- genderp %>% separate(Q29, c("MINsal", "MAXsal"))
genderp$MINsal <- as.numeric(genderp$MINsal)
genderp$MAXsal <- as.numeric(genderp$MAXsal)
After splitting Q29, what I am essentially left with are a “MIN” and “MAX” salary columns. But I can’t really use this piece of data or glean any meaningful information from it by it/themself, even when grouping by gender. So I put the newly created column into one again, but this time designating the newly-joined column as a factor. This way, the salary ranges are no longer just strings, but strings that can be ordered. Ordinal data types!
The following code block re-categorizes the newly-merged salary ranges by giving them a factor level, from high to low. This hierarchy will give the salary range data a numeric-like attribute (merely for ranking from high to low).
genderp <- genderp %>%
mutate(
salary_range = paste0 (
format(MINsal, trim = TRUE),
"-",
MAXsal
),
salary_range = fct_reorder(salary_range, MINsal)
)
#drop the NAs, then rename gender column
genderp <- genderp %>% drop_na()
genderp <- genderp %>% rename("gender" = "Q3")
After factoral-izing the newly created column “salary_range”, we can
aggregate gender_paygap
by gender and begin sorting. For
example, I can now run arrange()
on salary_range. Here, I
create the aggregated table aggr_gpg
. Let’s visualize the
data using percentages to see the gender ratios (percentage) by salary
range.
aggr_gpg <- genderp %>% group_by(salary_range, gender) %>% summarise(n = n()) %>% mutate(freq = n /sum(n))
## `summarise()` has grouped output by 'salary_range'. You can override using the
## `.groups` argument.
ggplot(aggr_gpg, aes(x=salary_range, y=freq, fill=gender)) +
geom_bar(data=subset(aggr_gpg, gender == "Man"), stat="identity") +
geom_bar(data=subset(aggr_gpg, gender == "Woman"), stat="identity", aes(y=-freq)) +
geom_hline(yintercept = 0, linetype="dotted", alpha=0.6) +
# Accuracy of y-axis
scale_y_continuous( labels=c("30%","0%","30%","60%","90%")) +
coord_flip() + scale_fill_manual(values=c("lightblue", "pink")) + theme_minimal() +
labs(x = "salary range (USD)", y = "percent share")
#ggsave("gendersalaryratio.png")
In this section, I will look at job titles (Q23) through the lens of
gender (Q3). I will name this new dataframe gender_job
.
df1 %>% select(Q23) %>% count(Q23)
Similar to the analysis above, I will perform the same data carpentry and filtering. I will also simplify some of the job titles into fewer words.
gender_job <- df1 %>% select(c("Q3", "Q23"))
gender_job <- gender_job %>% mutate_all(na_if, "")
gender_job <- gender_job %>% filter(Q3 == "Man" | Q3 == "Woman")
gender_job <- gender_job %>% filter(Q23 != "Currently not employed" & Q23 != "Other")
gender_job <- gender_job %>% drop_na()
#house cleaning code chunk
gender_job <- gender_job %>% mutate(Q23 = replace(Q23, Q23 == "Data Analyst (Business, Marketing, Financial, Quantitative, etc)", "Data Analyst")) %>% mutate(Q23 = replace(Q23, Q23 == "Manager (Program, Project, Operations, Executive-level, etc)", "Manager"))
#simplifying strings of some job roles
gender_job <- gender_job %>% rename("gender" = "Q3", "job_title" = "Q23")
After doing a bit of carpentry, I group by job title then gender to see the gender breakdown of each role.
gender_job_ratio <- gender_job %>% group_by(job_title, gender) %>% summarise(n = n()) %>% mutate(freq = n /sum(n))
## `summarise()` has grouped output by 'job_title'. You can override using the
## `.groups` argument.
gender_job %>% group_by(job_title, gender) %>% summarise(n = n()) %>% mutate(freq = n /sum(n))
## `summarise()` has grouped output by 'job_title'. You can override using the
## `.groups` argument.
Then we only need to visualize the newly-grouped table to gather our findings.
ggplot(gender_job_ratio, aes(x=job_title, y=freq, fill=gender)) +
geom_bar(data=subset(gender_job_ratio, gender == "Man"), stat="identity") +
geom_bar(data=subset(gender_job_ratio, gender == "Woman"), stat="identity", aes(y=-freq)) +
geom_hline(yintercept = 0, linetype="dotted", alpha=0.6) +
# Accuracy of y-axis
scale_y_continuous(breaks=c(-0.25,0,0.25,0.5,0.75),labels=c("25%", "0%", "25%","50%", "75%")) +
coord_flip() + scale_fill_manual(values=c("lightblue", "pink")) + theme_minimal() +
labs(x="job title", y="percent share")
#ggsave("genderjobratio.png")
Interestingly, teacher and statistician are less male-dominated than the rest of the group - still male-dominated overall but less than compared to the rest. The fact that teachers/educators are culturally female-dominated fields may have some influence on this data point.
For this next section, I am going to compare separate the reported
salaries by countries, specifically USA vs. the rest of the world. For
this analysis, we will focus on column Q4, which includes the country of
origin. Similar to what I did above with salary ranges, I am going to
strip out commas and dollar signs and exclude millionaires. I will first
make a general, global dataframe global_sal_range
from
which I will filter a US-only dataframe usa_sal_range
as
well as a dataframe excluding USA not_usa_salary
.
The end goal in mind is a histogram separating USA salaries from the rest of the world.
global_sal_range <- df1 %>% select(c("Q4","Q29")) %>% mutate(across(starts_with("Q29"), ~gsub("\\$", "", .))) %>% mutate(across(starts_with("Q29"), ~gsub("\\,", "", .)))
#strip out $ and commas
global_sal_range <- global_sal_range %>% filter(Q29 != ">1000000")
#remove millionaires
global_sal_range<- global_sal_range %>% separate(Q29, c("MINsal", "MAXsal"))
## Warning: Expected 2 pieces. Missing pieces filled with `NA` in 15861 rows [1,
## 2, 3, 5, 6, 7, 10, 11, 12, 13, 15, 16, 22, 23, 24, 25, 27, 29, 31, 32, ...].
#string split by "-"
Next I filter down to USA.
usa_sal_range <- global_sal_range %>% filter(Q4 == "United States of America")
usa_sal_range <- usa_sal_range %>% drop_na()
#drop na, then convert columns to numeric
usa_sal_range$MINsal <- as.numeric(usa_sal_range$MINsal)
usa_sal_range$MAXsal <- as.numeric(usa_sal_range$MAXsal)
#after, create factor column corresponding to original salary range columns
usa_sal_range <- usa_sal_range %>%
mutate(
salary_range = paste0 (
format(MINsal, trim = TRUE),
"-",
MAXsal
),
salary_range = fct_reorder(salary_range, MINsal)
)
usa_sal_range
usa_salary <- usa_sal_range %>% select(salary_range) %>% count(salary_range)
ggplot(usa_salary, aes(x=salary_range, y=n)) +
geom_col(fill="skyblue") +
coord_flip() + theme_minimal() + labs(x="salary range, annual (USD)", y="count")
#filter for non-USA countries, then drop_na
not_usa_salary <- global_sal_range %>% filter(Q4 != "United States of America")
not_usa_salary <- not_usa_salary %>% drop_na()
#convert to numeric dtype
not_usa_salary$MINsal <- as.numeric(not_usa_salary$MINsal)
not_usa_salary$MAXsal <- as.numeric(not_usa_salary$MAXsal)
#after, create factor column corresponding to original salary range columns
not_usa_salary <- not_usa_salary %>%
mutate(
salary_range = paste0 (
format(MINsal, trim = TRUE),
"-",
MAXsal
),
salary_range = fct_reorder(salary_range, MINsal)
)
#convert values to "rest of world"
not_usa_salary['Q4'] <- "Rest of World"
Next I rowbind the usa_sal_range
and
not_usa_salary
into one dataframe:
usa_v_row_sal
.
#combine the two dataframes by row (not by column)
usa_v_row_sal <- rbind(usa_sal_range, not_usa_salary)
usa_v_row_sal <- usa_v_row_sal %>% rename("country" = "Q4")
Finally, we visualize the result.
ggplot(usa_v_row_sal, aes(x=salary_range, fill=country, color=country)) +
geom_bar(color="white", width=1, linewidth=0) + coord_flip() + theme_minimal() +
scale_fill_manual(values=c("grey90","skyblue")) +
labs(x="salary range, annual (USD)", y="count") +
theme(panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
legend.position = c(0.87, 0.25))
#ggsave("salary_distribution.png")
The end.