[R Notebook] Kaggle Survey 2022

Section A.
- I. Structure of the dataset
Section A.
- II. Data Prep
Section B.
I. Exploration into gender dynamics
- Gender ratio of survey
- Gender ratio by salary range
Section B
Section B
II. Exploration into global salary ranges
- USA salary vs. Rest of World

Loading package and dependencies:

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(ggplot2)
library(data.table)

## 
## Attaching package: 'data.table'

## The following objects are masked from 'package:dplyr':
## 
##     between, first, last

library(tidyr)
library(forcats)
library(stringr)

Section A.

I. Structure of the dataset

The survey data is structured in a way where the headers are the question numbers; and then the first row of the data is the corresponding text body of the question. Since I’m really only after the values of each column, I will go ahead and remove the first row. However, there is another use for the first row…

I will take the first row of data and transform it to use as an index for referencing the column headers. For example, I don’t know what column Q3 or Q7_6 stands for, so I’ll keep to an index for referral.

Here’s what I mean about transforming first to get an index table of the questions.

df_index <- read.csv('./kagglesurvey/kaggle_survey_2022_responses.csv', header=0)
df_index <- transpose(df_index[1:2,])
head(df_index)

Section A.

II. Data Prep

Loading in the survey data

df1 <- read.csv('./kagglesurvey/kaggle_survey_2022_responses.csv')

 df1 <- df1[-1,]
 head(df1$Q5)

## [1] "No"  "No"  "Yes" "No"  "Yes" "Yes"

Section B.

I. Exploration into gender dynamics

Gender ratio of survey

First things first, let us look at the gender demographics of our survey data, not including non-binaries or N/As. Almost immediately, I find that the gender demographics is overwhelmingly male. Data note #1: This tells me that:

The survey was distributed to a population that is mostly male: If your survey was sent to a population or group that is known to be predominantly male, such as a group of software developers, then it is expected that the survey data would be dominated by male respondents.
The survey was not designed to be inclusive: It is possible that the survey was not designed to be inclusive or welcoming to people of all genders. This could result in fewer responses from women and non-binary individuals, and skew the gender demographics.
There is a societal or cultural bias towards men in the subject area of the survey: In some fields, such as technology or finance, there may be a cultural or societal bias that makes it more difficult for women and non-binary individuals to participate or feel welcome. This could result in fewer responses from these groups.

The sad reality is that this does describe the demographics of the programming field: male-dominated with very few representation from female and non-binary programmers. Therefore, these analyses into gender dynamics will be heavily male-skewed.

gender_ratio <- df1 %>% filter(Q3 == "Man" | Q3 == "Woman")

gender_ratio %>% group_by(Q3) %>% summarise(percent = n()/nrow(.))

gender_demog <- gender_ratio %>% group_by(Q3) %>% summarise(percentage = round(n()/nrow(.),4)*100, lab.pos = cumsum(percentage)-.5*percentage)

ggplot(gender_demog, aes(x=1, y=percentage, fill=Q3)) + 
  geom_bar(stat="identity") +
  coord_polar("y", start = 0) +
  geom_text(aes(y = lab.pos, label = paste(percentage,"%", sep = "")), col = "white") + theme_void() +
  scale_fill_manual(values=c("lightblue", "lightpink")) +
  xlim(-1, 2.5)

#ggsave("gender_demo_white.png")

Gender ratio by salary range

Moving on…

To explore the gender dynamics of the dataset, I focus on two columns of interest: Q3 and Q29. Q3 is the reported gender of the survey taker, while Q29 is the range in which their reported salary lies. From this, a new df is created: gender_pay. The goal of this section is to extract value from Q29 in order to make any meaningful inferences about gender dynamics.

How do we extract value from Q29?

Since each row is essentially a person, we can group gender_pay by gender and aggregate the salary range data.

In order to do that, we must first perform some data carpentry on Q29 for it to be remotely useful for data analysis. First, I remove any strings contained in the values (comma’s and dollar signs). However, I will keep the dash - in between the numbers for the time being because I will need it to split the column into two in just a moment.

Right from the jump we run into the first problem with the data: outliers and null values. After running a quick tally of the values available in Q29, I made the decision to remove outliers (those reporting a salary greater than $1 million) because they will skew the statistic. I also went ahead and removed the null values.

Since the non-binary group do not bear any reliable sample size in our data, it would be misleading to include them in the statistics.

After strictly filtering out gender_paygap to no millionaires, and just Man/Woman, I finally removed the dollar signs and commas from column Q29. Then I split the column using the separate() function.

# big revamp
genderp <- df1 %>% select(c("Q3","Q29"))
genderp <- genderp %>% mutate_all(na_if, "")
genderp <- genderp %>% filter(Q3 == "Man" | Q3 == "Woman")
genderp <- genderp %>% filter(Q29 != ">1000000")
genderp <- genderp %>% mutate(across(starts_with("Q29"), ~gsub("\\$", "", .))) %>% mutate(across(starts_with("Q29"), ~gsub("\\,", "", .)))
genderp <- genderp %>% separate(Q29, c("MINsal", "MAXsal"))
genderp$MINsal <- as.numeric(genderp$MINsal)
genderp$MAXsal <- as.numeric(genderp$MAXsal)

After splitting Q29, what I am essentially left with are a “MIN” and “MAX” salary columns. But I can’t really use this piece of data or glean any meaningful information from it by it/themself, even when grouping by gender. So I put the newly created column into one again, but this time designating the newly-joined column as a factor. This way, the salary ranges are no longer just strings, but strings that can be ordered. Ordinal data types!

The following code block re-categorizes the newly-merged salary ranges by giving them a factor level, from high to low. This hierarchy will give the salary range data a numeric-like attribute (merely for ranking from high to low).

genderp <- genderp %>% 
  mutate(
    salary_range = paste0 (
  format(MINsal, trim = TRUE),
  "-", 
  MAXsal
),
salary_range = fct_reorder(salary_range, MINsal)
  )

#drop the NAs, then rename gender column
genderp <- genderp %>% drop_na()
genderp <- genderp %>% rename("gender" =  "Q3")

After factoral-izing the newly created column “salary_range”, we can aggregate gender_paygap by gender and begin sorting. For example, I can now run arrange() on salary_range. Here, I create the aggregated table aggr_gpg . Let’s visualize the data using percentages to see the gender ratios (percentage) by salary range.

aggr_gpg <- genderp %>% group_by(salary_range, gender) %>% summarise(n = n()) %>% mutate(freq = n /sum(n))

## `summarise()` has grouped output by 'salary_range'. You can override using the
## `.groups` argument.

ggplot(aggr_gpg, aes(x=salary_range, y=freq, fill=gender)) +
  
geom_bar(data=subset(aggr_gpg, gender == "Man"), stat="identity") +

geom_bar(data=subset(aggr_gpg, gender == "Woman"), stat="identity", aes(y=-freq)) +

geom_hline(yintercept = 0, linetype="dotted", alpha=0.6) +

# Accuracy of y-axis 
scale_y_continuous( labels=c("30%","0%","30%","60%","90%"))  +
  
coord_flip() + scale_fill_manual(values=c("lightblue", "pink")) + theme_minimal() +
  
labs(x = "salary range (USD)", y = "percent share")

#ggsave("gendersalaryratio.png")

Section B

Gender ratio by job title.

In this section, I will look at job titles (Q23) through the lens of gender (Q3). I will name this new dataframe gender_job.

df1 %>% select(Q23) %>% count(Q23)

Similar to the analysis above, I will perform the same data carpentry and filtering. I will also simplify some of the job titles into fewer words.

gender_job <- df1 %>% select(c("Q3", "Q23"))
gender_job <- gender_job %>% mutate_all(na_if, "")
gender_job <- gender_job %>% filter(Q3 == "Man" | Q3 == "Woman")
gender_job <- gender_job %>% filter(Q23 != "Currently not employed" & Q23 != "Other")
gender_job <- gender_job %>% drop_na()
#house cleaning code chunk

gender_job <- gender_job %>% mutate(Q23 = replace(Q23, Q23 == "Data Analyst (Business, Marketing, Financial, Quantitative, etc)", "Data Analyst")) %>% mutate(Q23 = replace(Q23, Q23 == "Manager (Program, Project, Operations, Executive-level, etc)", "Manager")) 
#simplifying strings of some job roles

gender_job <- gender_job %>% rename("gender" = "Q3", "job_title" = "Q23")

After doing a bit of carpentry, I group by job title then gender to see the gender breakdown of each role.

gender_job_ratio <- gender_job %>% group_by(job_title, gender) %>% summarise(n = n()) %>% mutate(freq = n /sum(n))

## `summarise()` has grouped output by 'job_title'. You can override using the
## `.groups` argument.

gender_job %>% group_by(job_title, gender) %>% summarise(n = n()) %>% mutate(freq = n /sum(n))

## `summarise()` has grouped output by 'job_title'. You can override using the
## `.groups` argument.

Then we only need to visualize the newly-grouped table to gather our findings.

Job title by gender ratio

ggplot(gender_job_ratio, aes(x=job_title, y=freq, fill=gender)) +
  
geom_bar(data=subset(gender_job_ratio, gender == "Man"), stat="identity") +

geom_bar(data=subset(gender_job_ratio, gender == "Woman"), stat="identity", aes(y=-freq)) +

geom_hline(yintercept = 0, linetype="dotted", alpha=0.6) +
# Accuracy of y-axis
scale_y_continuous(breaks=c(-0.25,0,0.25,0.5,0.75),labels=c("25%", "0%", "25%","50%", "75%")) +

coord_flip() + scale_fill_manual(values=c("lightblue", "pink")) + theme_minimal() +
  
labs(x="job title", y="percent share")

#ggsave("genderjobratio.png")

Interestingly, teacher and statistician are less male-dominated than the rest of the group - still male-dominated overall but less than compared to the rest. The fact that teachers/educators are culturally female-dominated fields may have some influence on this data point.

Section B

II. Exploration into global salary ranges

USA salary vs. Rest of World

For this next section, I am going to compare separate the reported salaries by countries, specifically USA vs. the rest of the world. For this analysis, we will focus on column Q4, which includes the country of origin. Similar to what I did above with salary ranges, I am going to strip out commas and dollar signs and exclude millionaires. I will first make a general, global dataframe global_sal_range from which I will filter a US-only dataframe usa_sal_range as well as a dataframe excluding USA not_usa_salary.

The end goal in mind is a histogram separating USA salaries from the rest of the world.

global_sal_range <- df1 %>% select(c("Q4","Q29")) %>% mutate(across(starts_with("Q29"), ~gsub("\\$", "", .))) %>% mutate(across(starts_with("Q29"), ~gsub("\\,", "", .)))
#strip out $ and commas

global_sal_range <- global_sal_range  %>% filter(Q29 != ">1000000")
#remove millionaires

global_sal_range<- global_sal_range %>% separate(Q29, c("MINsal", "MAXsal"))

## Warning: Expected 2 pieces. Missing pieces filled with `NA` in 15861 rows [1,
## 2, 3, 5, 6, 7, 10, 11, 12, 13, 15, 16, 22, 23, 24, 25, 27, 29, 31, 32, ...].

#string split by "-"

Next I filter down to USA.

usa_sal_range <- global_sal_range %>% filter(Q4 == "United States of America")
usa_sal_range <- usa_sal_range %>% drop_na()
#drop na, then convert columns to numeric
usa_sal_range$MINsal <- as.numeric(usa_sal_range$MINsal)
usa_sal_range$MAXsal <- as.numeric(usa_sal_range$MAXsal)
#after, create factor column corresponding to original salary range columns
usa_sal_range <- usa_sal_range %>% 
  mutate(
    salary_range = paste0 (
  format(MINsal, trim = TRUE),
  "-", 
  MAXsal
),
salary_range = fct_reorder(salary_range, MINsal)
  )

usa_sal_range

usa_salary <- usa_sal_range %>% select(salary_range) %>% count(salary_range)

ggplot(usa_salary, aes(x=salary_range, y=n)) +

  geom_col(fill="skyblue") +
  
  coord_flip() + theme_minimal() + labs(x="salary range, annual (USD)", y="count")

#filter for non-USA countries, then drop_na
not_usa_salary <- global_sal_range %>% filter(Q4 != "United States of America")
not_usa_salary <- not_usa_salary %>% drop_na()

#convert to numeric dtype
not_usa_salary$MINsal <- as.numeric(not_usa_salary$MINsal)
not_usa_salary$MAXsal <- as.numeric(not_usa_salary$MAXsal)
#after, create factor column corresponding to original salary range columns
not_usa_salary <- not_usa_salary %>% 
  mutate(
    salary_range = paste0 (
  format(MINsal, trim = TRUE),
  "-", 
  MAXsal
),
salary_range = fct_reorder(salary_range, MINsal)
  )

#convert values to "rest of world"
not_usa_salary['Q4'] <- "Rest of World"

Next I rowbind the usa_sal_range and not_usa_salary into one dataframe: usa_v_row_sal .

#combine the two dataframes by row (not by column)
usa_v_row_sal <- rbind(usa_sal_range, not_usa_salary)
usa_v_row_sal <- usa_v_row_sal %>% rename("country" = "Q4")

Finally, we visualize the result.

ggplot(usa_v_row_sal, aes(x=salary_range, fill=country, color=country)) +
  
  geom_bar(color="white", width=1, linewidth=0) + coord_flip() + theme_minimal() +
  
  scale_fill_manual(values=c("grey90","skyblue")) + 
  
  labs(x="salary range, annual (USD)", y="count") +
  theme(panel.grid.major = element_blank(),
        panel.grid.minor = element_blank(),
        legend.position = c(0.87, 0.25))

#ggsave("salary_distribution.png")

The end.