Home » Projects » 7020 Customer Segmentation » R Notebook
Goal: I will perform customer segmentation based on retail data. The goal of this project is to perform unsupervised learning and clustering to summarize customer segment and population. More attribute information and the data context can be read here: Kaggle source
library(tidyverse)
── [1mAttaching packages[22m ─────────────────────────────────────── tidyverse 1.3.1 ──
[32m✔[39m [34mggplot2[39m 3.3.3 [32m✔[39m [34mpurrr [39m 0.3.4
[32m✔[39m [34mtibble [39m 3.1.1 [32m✔[39m [34mdplyr [39m 1.0.6
[32m✔[39m [34mtidyr [39m 1.1.3 [32m✔[39m [34mstringr[39m 1.4.0
[32m✔[39m [34mreadr [39m 1.4.0 [32m✔[39m [34mforcats[39m 0.5.1
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m masks [34mstats[39m::lag()
Explain your data set here; what is it about? What are the variables? What do you want to do with it? How are you going to clean it? etc.
my_data <- read.csv("../data/marketingcampaign.csv")
names(my_data) <- tolower(names(my_data)) #convert columns to lowercase
dim(my_data)
str(my_data)
'data.frame': 2240 obs. of 29 variables:
$ id : int 5524 2174 4141 6182 5324 7446 965 6177 4855 5899 ...
$ year_birth : int 1957 1954 1965 1984 1981 1967 1971 1985 1974 1950 ...
$ education : chr "Graduation" "Graduation" "Graduation" "Graduation" ...
$ marital_status : chr "Single" "Single" "Together" "Together" ...
$ income : int 58138 46344 71613 26646 58293 62513 55635 33454 30351 5648 ...
$ kidhome : int 0 1 0 1 1 0 0 1 1 1 ...
$ teenhome : int 0 1 0 0 0 1 1 0 0 1 ...
$ dt_customer : chr "4/9/12" "8/3/14" "21-08-2013" "10/2/14" ...
$ recency : int 58 38 26 26 94 16 34 32 19 68 ...
$ mntwines : int 635 11 426 11 173 520 235 76 14 28 ...
$ mntfruits : int 88 1 49 4 43 42 65 10 0 0 ...
$ mntmeatproducts : int 546 6 127 20 118 98 164 56 24 6 ...
$ mntfishproducts : int 172 2 111 10 46 0 50 3 3 1 ...
$ mntsweetproducts : int 88 1 21 3 27 42 49 1 3 1 ...
$ mntgoldprods : int 88 6 42 5 15 14 27 23 2 13 ...
$ numdealspurchases : int 3 2 1 2 5 2 4 2 1 1 ...
$ numwebpurchases : int 8 1 8 2 5 6 7 4 3 1 ...
$ numcatalogpurchases: int 10 1 2 0 3 4 3 0 0 0 ...
$ numstorepurchases : int 4 2 10 4 6 10 7 4 2 0 ...
$ numwebvisitsmonth : int 7 5 4 6 5 6 6 8 9 20 ...
$ acceptedcmp3 : int 0 0 0 0 0 0 0 0 0 1 ...
$ acceptedcmp4 : int 0 0 0 0 0 0 0 0 0 0 ...
$ acceptedcmp5 : int 0 0 0 0 0 0 0 0 0 0 ...
$ acceptedcmp1 : int 0 0 0 0 0 0 0 0 0 0 ...
$ acceptedcmp2 : int 0 0 0 0 0 0 0 0 0 0 ...
$ complain : int 0 0 0 0 0 0 0 0 0 0 ...
$ z_costcontact : int 3 3 3 3 3 3 3 3 3 3 ...
$ z_revenue : int 11 11 11 11 11 11 11 11 11 11 ...
$ response : int 1 0 0 0 0 0 0 0 1 0 ...
head(my_data)
id | year_birth | education | marital_status | income | kidhome | teenhome | dt_customer | recency | mntwines | ⋯ | numwebvisitsmonth | acceptedcmp3 | acceptedcmp4 | acceptedcmp5 | acceptedcmp1 | acceptedcmp2 | complain | z_costcontact | z_revenue | response | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
<int> | <int> | <chr> | <chr> | <int> | <int> | <int> | <chr> | <int> | <int> | ⋯ | <int> | <int> | <int> | <int> | <int> | <int> | <int> | <int> | <int> | <int> | |
1 | 5524 | 1957 | Graduation | Single | 58138 | 0 | 0 | 4/9/12 | 58 | 635 | ⋯ | 7 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 11 | 1 |
2 | 2174 | 1954 | Graduation | Single | 46344 | 1 | 1 | 8/3/14 | 38 | 11 | ⋯ | 5 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 11 | 0 |
3 | 4141 | 1965 | Graduation | Together | 71613 | 0 | 0 | 21-08-2013 | 26 | 426 | ⋯ | 4 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 11 | 0 |
4 | 6182 | 1984 | Graduation | Together | 26646 | 1 | 0 | 10/2/14 | 26 | 11 | ⋯ | 6 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 11 | 0 |
5 | 5324 | 1981 | PhD | Married | 58293 | 1 | 0 | 19-01-2014 | 94 | 173 | ⋯ | 5 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 11 | 0 |
6 | 7446 | 1967 | Master | Together | 62513 | 0 | 1 | 9/9/13 | 16 | 520 | ⋯ | 6 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 11 | 0 |
num_vars <- sum(sapply(my_data, is.numeric))
non_num_vars <- sum(sapply(my_data, function(x) !is.numeric(x)))
# Print the results
cat("There are", num_vars, "numeric variables and", non_num_vars, "non-numeric variables in the dataset.")
There are 26 numeric variables and 3 non-numeric variables in the dataset.
I would say this is a pretty good dataset to perform correlations since it is numeric-heavy. I will move on next to A) preprocess the data by removing any null values detected and B) engineer new features entirely and drop any irrelevant features.
# Get count of null values in each column, arrange by count
sort(colSums(is.na(my_data)), decreasing = TRUE)
“Income” column has 24 null values. I need to decide what to do with those null values. For starters, do I keep or remove them? If keep, how do I want to fill the nulls, use average, etc?
Since only 24 rows are affected by null values in our dataset of 2240 observations. I’m simply going to drop these from our dataframe.
ante <- nrow(my_data)
my_data <- my_data %>% na.omit()
poste <- nrow(my_data)
cat("Number of rows before null removal: ", ante, "\n")
cat("Number of rows after null removal: ", poste, "\n")
Number of rows before null removal: 2240
Number of rows after null removal: 2216
Let’s look at the categorical variables next in our data and see if we can revise them further into simplified segments.
Why segment categorical variables?
In this case, segmenting education levels into three groups can help to simplify and standardize the analysis of the data. By reducing the number of categories in the variable, it may be easier to identify patterns or trends in the data, especially if there are a large number of distinct values in the original column. Additionally, segmenting the values of a given categorical column into levels make it easier to perform statistical tests or create visualizations that require discrete categories.
my_data %>% select_if(is.character) %>% head
education | marital_status | dt_customer | |
---|---|---|---|
<chr> | <chr> | <chr> | |
1 | Graduation | Single | 4/9/12 |
2 | Graduation | Single | 8/3/14 |
3 | Graduation | Together | 21-08-2013 |
4 | Graduation | Together | 10/2/14 |
5 | PhD | Married | 19-01-2014 |
6 | Master | Together | 9/9/13 |
categ_data <- my_data %>% select_if(is.character) %>% select(-dt_customer)
for (col in names(categ_data)) {
table <- categ_data %>% group_by(!!sym(col)) %>% summarize(count = n())
print(table)
}
[90m# A tibble: 5 x 2[39m
education count
[3m[90m<chr>[39m[23m [3m[90m<int>[39m[23m
[90m1[39m 2n Cycle 200
[90m2[39m Basic 54
[90m3[39m Graduation [4m1[24m116
[90m4[39m Master 365
[90m5[39m PhD 481
[90m# A tibble: 8 x 2[39m
marital_status count
[3m[90m<chr>[39m[23m [3m[90m<int>[39m[23m
[90m1[39m Absurd 2
[90m2[39m Alone 3
[90m3[39m Divorced 232
[90m4[39m Married 857
[90m5[39m Single 471
[90m6[39m Together 573
[90m7[39m Widow 76
[90m8[39m YOLO 2
The lack of discreteness in the values is unnecessarily messy, however I can further segment columns “education” and “marital_status” into simpler bins. I will use case_when()
for mapping certain existing values to new values in the education column. For “marital_status”, I wil use ifelse()
function to categorize the valules based on a specified condition. (Data collectors: when surveying for categorical variables in your data, always make sure that data validation is required by allowing only certain values from a drop-down list).
#Segmenting education levels into three bins
my_data <- my_data %>% mutate(education = case_when(
education == "Basic" ~ "Undergraduate",
education == "2n Cycle" ~ "Undergraduate",
education == "Graduation" ~ "Graduate",
education == "Master" ~ "Postgraduate",
education == "PhD" ~ "Postgraduate",
TRUE ~ education
))
my_data %>% count(education)
#Segmenting marital status into two bins
my_data$living_with <- ifelse(my_data$marital_status %in% c("Single", "Divorced", "Widow", "Alone", "Absurd", "YOLO"), "Alone", "Partnered")
my_data %>% count(living_with) %>% arrange(desc(n))
education | n |
---|---|
<chr> | <int> |
Graduate | 1116 |
Postgraduate | 846 |
Undergraduate | 254 |
living_with | n |
---|---|
<chr> | <int> |
Partnered | 1430 |
Alone | 786 |
In the next part, I will be performing the following steps to engineer some new features, in a similar fashion to what we did above:
library(lubridate)
# Convert year_birth to date format
my_data$date_birth <- ymd(paste(my_data$year_birth, "-01-01", sep = ""))
# Calculate age
my_data$age <- interval(my_data$date_birth, Sys.Date()) %/% years(1)
# Create tspent feature
my_data <- my_data %>%
mutate(tspent = mntwines + mntfruits + mntmeatproducts + mntfishproducts + mntsweetproducts + mntgoldprods)
# Create dependents feature
my_data <- my_data %>%
mutate(dependents = kidhome + teenhome)
# Create household_size feature
my_data <- my_data %>%
mutate(living_with_num = ifelse(living_with == "Alone", 1, 2), # recode "Alone" to 1 and "Partnered" to 2
household_size = dependents + living_with_num) # add the two columns to create "household_size"
# Create is_parent feature
my_data <- my_data %>%
mutate(is_parent = if_else(dependents > 0, 1, 0))
# Rename some columns for readability
names(my_data) <- sub("mntwines", "wines", names(my_data))
names(my_data) <- sub("mntfruits", "fruits", names(my_data))
names(my_data) <- sub("mntmeatproducts", "meat", names(my_data))
names(my_data) <- sub("mntfishproducts", "fish", names(my_data))
names(my_data) <- sub("mntsweetproducts", "sweets", names(my_data))
names(my_data) <- sub("mntgoldprods", "gold", names(my_data))
# Drop redundant features
exclude <- c("marital_status", "dt_customer", "z_costcontact", "z_revenue", "date_birth", "id", "living_with_num")
my_data <- my_data[, !(names(my_data) %in% exclude)]
Attaching package: ‘lubridate’
The following objects are masked from ‘package:base’:
date, intersect, setdiff, union
str(my_data)
'data.frame': 2216 obs. of 30 variables:
$ year_birth : int 1957 1954 1965 1984 1981 1967 1971 1985 1974 1950 ...
$ education : chr "Graduate" "Graduate" "Graduate" "Graduate" ...
$ income : int 58138 46344 71613 26646 58293 62513 55635 33454 30351 5648 ...
$ kidhome : int 0 1 0 1 1 0 0 1 1 1 ...
$ teenhome : int 0 1 0 0 0 1 1 0 0 1 ...
$ recency : int 58 38 26 26 94 16 34 32 19 68 ...
$ wines : int 635 11 426 11 173 520 235 76 14 28 ...
$ fruits : int 88 1 49 4 43 42 65 10 0 0 ...
$ meat : int 546 6 127 20 118 98 164 56 24 6 ...
$ fish : int 172 2 111 10 46 0 50 3 3 1 ...
$ sweets : int 88 1 21 3 27 42 49 1 3 1 ...
$ gold : int 88 6 42 5 15 14 27 23 2 13 ...
$ numdealspurchases : int 3 2 1 2 5 2 4 2 1 1 ...
$ numwebpurchases : int 8 1 8 2 5 6 7 4 3 1 ...
$ numcatalogpurchases: int 10 1 2 0 3 4 3 0 0 0 ...
$ numstorepurchases : int 4 2 10 4 6 10 7 4 2 0 ...
$ numwebvisitsmonth : int 7 5 4 6 5 6 6 8 9 20 ...
$ acceptedcmp3 : int 0 0 0 0 0 0 0 0 0 1 ...
$ acceptedcmp4 : int 0 0 0 0 0 0 0 0 0 0 ...
$ acceptedcmp5 : int 0 0 0 0 0 0 0 0 0 0 ...
$ acceptedcmp1 : int 0 0 0 0 0 0 0 0 0 0 ...
$ acceptedcmp2 : int 0 0 0 0 0 0 0 0 0 0 ...
$ complain : int 0 0 0 0 0 0 0 0 0 0 ...
$ response : int 1 0 0 0 0 0 0 0 1 0 ...
$ living_with : chr "Alone" "Alone" "Partnered" "Partnered" ...
$ age : num 66 69 58 39 42 56 52 38 49 73 ...
$ tspent : int 1617 27 776 53 422 716 590 169 46 49 ...
$ dependents : int 0 2 0 1 1 1 1 1 1 2 ...
$ household_size : num 1 3 2 3 3 3 2 3 3 4 ...
$ is_parent : num 0 1 0 1 1 1 1 1 1 1 ...
Lastly, there are only 2 categorical variables (“living_with” and “education”) left in our dataframe. I’m going to convert them to factor for more accurate statistical analysis visualization capabilities as opposed to character types.
#convert to factor type
my_data$education <- factor(my_data$education, levels = c("Undergraduate", "Graduate", "Postgraduate"))
my_data$living_with <- factor(my_data$living_with, levels = c("Alone", "Partnered"))
num_vars <- sum(sapply(my_data, is.numeric))
non_num_vars <- sum(sapply(my_data, function(x) !is.numeric(x)))
# Print the results
cat("There are", num_vars, "numeric variables and", non_num_vars, "non-numeric variables in the dataset.")
There are 28 numeric variables and 2 non-numeric variables in the dataset.
As I begin the next stage in my data science project life cycle, I will prepackage my variables according to their data types for easier exploration and visualization.
num_data <- my_data %>%
select_if(is.numeric)
cat_data <- my_data %>%
select_if(is.factor)
summary(my_data)
year_birth education income kidhome
Min. :1893 Undergraduate: 254 Min. : 1730 Min. :0.0000
1st Qu.:1959 Graduate :1116 1st Qu.: 35303 1st Qu.:0.0000
Median :1970 Postgraduate : 846 Median : 51382 Median :0.0000
Mean :1969 Mean : 52247 Mean :0.4418
3rd Qu.:1977 3rd Qu.: 68522 3rd Qu.:1.0000
Max. :1996 Max. :666666 Max. :2.0000
teenhome recency wines fruits
Min. :0.0000 Min. : 0.00 Min. : 0.0 Min. : 0.00
1st Qu.:0.0000 1st Qu.:24.00 1st Qu.: 24.0 1st Qu.: 2.00
Median :0.0000 Median :49.00 Median : 174.5 Median : 8.00
Mean :0.5054 Mean :49.01 Mean : 305.1 Mean : 26.36
3rd Qu.:1.0000 3rd Qu.:74.00 3rd Qu.: 505.0 3rd Qu.: 33.00
Max. :2.0000 Max. :99.00 Max. :1493.0 Max. :199.00
meat fish sweets gold
Min. : 0.0 Min. : 0.00 Min. : 0.00 Min. : 0.00
1st Qu.: 16.0 1st Qu.: 3.00 1st Qu.: 1.00 1st Qu.: 9.00
Median : 68.0 Median : 12.00 Median : 8.00 Median : 24.50
Mean : 167.0 Mean : 37.64 Mean : 27.03 Mean : 43.97
3rd Qu.: 232.2 3rd Qu.: 50.00 3rd Qu.: 33.00 3rd Qu.: 56.00
Max. :1725.0 Max. :259.00 Max. :262.00 Max. :321.00
numdealspurchases numwebpurchases numcatalogpurchases numstorepurchases
Min. : 0.000 Min. : 0.000 Min. : 0.000 Min. : 0.000
1st Qu.: 1.000 1st Qu.: 2.000 1st Qu.: 0.000 1st Qu.: 3.000
Median : 2.000 Median : 4.000 Median : 2.000 Median : 5.000
Mean : 2.324 Mean : 4.085 Mean : 2.671 Mean : 5.801
3rd Qu.: 3.000 3rd Qu.: 6.000 3rd Qu.: 4.000 3rd Qu.: 8.000
Max. :15.000 Max. :27.000 Max. :28.000 Max. :13.000
numwebvisitsmonth acceptedcmp3 acceptedcmp4 acceptedcmp5
Min. : 0.000 Min. :0.00000 Min. :0.00000 Min. :0.0000
1st Qu.: 3.000 1st Qu.:0.00000 1st Qu.:0.00000 1st Qu.:0.0000
Median : 6.000 Median :0.00000 Median :0.00000 Median :0.0000
Mean : 5.319 Mean :0.07356 Mean :0.07401 Mean :0.0731
3rd Qu.: 7.000 3rd Qu.:0.00000 3rd Qu.:0.00000 3rd Qu.:0.0000
Max. :20.000 Max. :1.00000 Max. :1.00000 Max. :1.0000
acceptedcmp1 acceptedcmp2 complain response
Min. :0.00000 Min. :0.00000 Min. :0.000000 Min. :0.0000
1st Qu.:0.00000 1st Qu.:0.00000 1st Qu.:0.000000 1st Qu.:0.0000
Median :0.00000 Median :0.00000 Median :0.000000 Median :0.0000
Mean :0.06408 Mean :0.01354 Mean :0.009477 Mean :0.1503
3rd Qu.:0.00000 3rd Qu.:0.00000 3rd Qu.:0.000000 3rd Qu.:0.0000
Max. :1.00000 Max. :1.00000 Max. :1.000000 Max. :1.0000
living_with age tspent dependents
Alone : 786 Min. : 27.00 Min. : 5.0 Min. :0.0000
Partnered:1430 1st Qu.: 46.00 1st Qu.: 69.0 1st Qu.:0.0000
Median : 53.00 Median : 396.5 Median :1.0000
Mean : 54.18 Mean : 607.1 Mean :0.9472
3rd Qu.: 64.00 3rd Qu.:1048.0 3rd Qu.:1.0000
Max. :130.00 Max. :2525.0 Max. :3.0000
household_size is_parent
Min. :1.000 Min. :0.0000
1st Qu.:2.000 1st Qu.:0.0000
Median :3.000 Median :1.0000
Mean :2.593 Mean :0.7144
3rd Qu.:3.000 3rd Qu.:1.0000
Max. :5.000 Max. :1.0000
library(purrr)
options(repr.plot.width=14, repr.plot.height=8)
# Identify the numeric variables
num_vars <- my_data %>%
select_if(is.numeric) %>%
names()
# Create the facet grid of density plots
my_data %>%
select(all_of(num_vars)) %>%
gather(variable, value) %>%
ggplot(aes(x = value)) +
geom_density() +
facet_wrap(~ variable, scales = "free")
library(GGally)
ggpairs(num_data)
For this code block, I called the ggpairs() function to generate pairs plot (scatter plot matrix) among the variables in my dataset. What results is a computationally-intensive visualization that I will forego, for the simple reason that said pairs plot is in no way meaningful to our analysis due to its volume.
In plain terms, the (would-be) pairsplot is simply too cumbersome and large to detect any meaningful relation or insight from. I will try a correlation matrix heatmap and see if that is any better…
# Get upper triangle of the correlation matrix
get_upper_tri <- function(cormat){
cormat[lower.tri(cormat)]<- NA
return(cormat)
}
upper_tri <- get_upper_tri(cor(num_data))
options(repr.plot.width=16, repr.plot.height=16)
# Melt the correlation matrix
library(reshape2)
melted_cormat <- melt(upper_tri, na.rm = TRUE)
# Generate correlation matrix heatmap
ggplot(data = melted_cormat, aes(Var2, Var1, fill = value))+
geom_tile(color = "white")+
scale_fill_gradient2(low = "blue", high = "red", mid = "white",
midpoint = 0, limit = c(-1,1), space = "Lab",
name="Pearson\nCorrelation") +
theme_minimal()+
theme(axis.text.x = element_text(angle = 45, vjust = 1,
size = 13, hjust = 1), axis.text.y = element_text(size = 13))+
coord_fixed() +
geom_text(aes(label = round(value, 2)), size = 4, color = "black")
Attaching package: ‘reshape2’
The following object is masked from ‘package:tidyr’:
smiths
This is simply too many variables to account for
The correlation matrix heatmap is slightly better, but while it does provide insights into highly and inversely correlated variables, I’m still running into the volume problem.
I am going to employ PCA (Principal Component Analysis) to reduce the dimensions of the data. This is to help me simplify the dataset to a dimension that selects only variables with the most variance and relationships to another while keeping visualizations manageable.
Before employing PCA, I need to convert the categorical variables (education and living_with) to numeric using one-hot encoding. dummy_cols
function from the fastDummies library allows me to do just that. This process allows me to use these categorical variables as features in machine learning models and most importantly PCA.
I will store the newly-encoded features in a new dataframe “encoded_data”, where I will also remove the original categorical features so that I just have numerical variables.
Another careful consideration before deploying PCA are the outliers in the data. Though it is also important to consider the reason for the presence of outliers and whether they are genuine obserations or errors. Based on the following boxplots, the age column appears to have values over 12, and a sole outlier in the income variable. I think it is safe to assume that these outliers are indeed errors and are not genuine observations (at least for the age outliers).
options(repr.plot.width=16, repr.plot.height=10)
par(mfrow = c(2,4))
par(mar=c(2,2,2,2))
invisible(lapply(1:ncol(num_data), function(i) boxplot(num_data[, i], main=colnames(num_data)[i],
ylab=colnames(num_data)[i])))
## filtering the outliers from age and income column
my_data <- my_data %>% filter(age < 100, income < 150000, year_birth > 1920)
install.packages("fastDummies")
library(fastDummies)
# Convert categorical variables into numerical variables using one-hot encoding
encoded_data <- dummy_cols(my_data, select_columns = c("education", "living_with"), remove_selected_columns = TRUE)
# lowercase new columns, which have been encoded
names(encoded_data) <- tolower(names(encoded_data))
Updating HTML index of packages in '.Library'
Making 'packages.html' ...
done
str(encoded_data)
'data.frame': 2205 obs. of 33 variables:
$ year_birth : int 1957 1954 1965 1984 1981 1967 1971 1985 1974 1950 ...
$ income : int 58138 46344 71613 26646 58293 62513 55635 33454 30351 5648 ...
$ kidhome : int 0 1 0 1 1 0 0 1 1 1 ...
$ teenhome : int 0 1 0 0 0 1 1 0 0 1 ...
$ recency : int 58 38 26 26 94 16 34 32 19 68 ...
$ wines : int 635 11 426 11 173 520 235 76 14 28 ...
$ fruits : int 88 1 49 4 43 42 65 10 0 0 ...
$ meat : int 546 6 127 20 118 98 164 56 24 6 ...
$ fish : int 172 2 111 10 46 0 50 3 3 1 ...
$ sweets : int 88 1 21 3 27 42 49 1 3 1 ...
$ gold : int 88 6 42 5 15 14 27 23 2 13 ...
$ numdealspurchases : int 3 2 1 2 5 2 4 2 1 1 ...
$ numwebpurchases : int 8 1 8 2 5 6 7 4 3 1 ...
$ numcatalogpurchases : int 10 1 2 0 3 4 3 0 0 0 ...
$ numstorepurchases : int 4 2 10 4 6 10 7 4 2 0 ...
$ numwebvisitsmonth : int 7 5 4 6 5 6 6 8 9 20 ...
$ acceptedcmp3 : int 0 0 0 0 0 0 0 0 0 1 ...
$ acceptedcmp4 : int 0 0 0 0 0 0 0 0 0 0 ...
$ acceptedcmp5 : int 0 0 0 0 0 0 0 0 0 0 ...
$ acceptedcmp1 : int 0 0 0 0 0 0 0 0 0 0 ...
$ acceptedcmp2 : int 0 0 0 0 0 0 0 0 0 0 ...
$ complain : int 0 0 0 0 0 0 0 0 0 0 ...
$ response : int 1 0 0 0 0 0 0 0 1 0 ...
$ age : num 66 69 58 39 42 56 52 38 49 73 ...
$ tspent : int 1617 27 776 53 422 716 590 169 46 49 ...
$ dependents : int 0 2 0 1 1 1 1 1 1 2 ...
$ household_size : num 1 3 2 3 3 3 2 3 3 4 ...
$ is_parent : num 0 1 0 1 1 1 1 1 1 1 ...
$ education_undergraduate: int 0 0 0 0 0 0 0 0 0 0 ...
$ education_graduate : int 1 1 1 1 0 0 1 0 0 0 ...
$ education_postgraduate : int 0 0 0 0 1 1 0 1 1 1 ...
$ living_with_alone : int 1 1 0 0 0 0 1 0 0 0 ...
$ living_with_partnered : int 0 0 1 1 1 1 0 1 1 1 ...
Now the stage is set to employ PCA.
#scale data
scaled_encoded <- scale(encoded_data)
#PCA
pca_result <- prcomp(scaled_encoded)
summary(pca_result)
Importance of components:
PC1 PC2 PC3 PC4 PC5 PC6 PC7
Standard deviation 2.9871 1.8235 1.51779 1.42073 1.39579 1.25566 1.12552
Proportion of Variance 0.2704 0.1008 0.06981 0.06117 0.05904 0.04778 0.03839
Cumulative Proportion 0.2704 0.3711 0.44096 0.50212 0.56116 0.60894 0.64733
PC8 PC9 PC10 PC11 PC12 PC13 PC14
Standard deviation 1.08725 1.04687 1.00433 0.99746 0.93483 0.86835 0.79858
Proportion of Variance 0.03582 0.03321 0.03057 0.03015 0.02648 0.02285 0.01933
Cumulative Proportion 0.68315 0.71636 0.74692 0.77707 0.80356 0.82640 0.84573
PC15 PC16 PC17 PC18 PC19 PC20 PC21
Standard deviation 0.78072 0.75936 0.74098 0.73421 0.67504 0.65138 0.64311
Proportion of Variance 0.01847 0.01747 0.01664 0.01634 0.01381 0.01286 0.01253
Cumulative Proportion 0.86420 0.88167 0.89831 0.91465 0.92846 0.94131 0.95385
PC22 PC23 PC24 PC25 PC26 PC27
Standard deviation 0.60392 0.57685 0.52460 0.45855 0.44931 0.37180
Proportion of Variance 0.01105 0.01008 0.00834 0.00637 0.00612 0.00419
Cumulative Proportion 0.96490 0.97498 0.98332 0.98969 0.99581 1.00000
PC28 PC29 PC30 PC31 PC32
Standard deviation 6.136e-16 5.416e-16 4.723e-16 2.965e-16 2.109e-16
Proportion of Variance 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00
Cumulative Proportion 1.000e+00 1.000e+00 1.000e+00 1.000e+00 1.000e+00
PC33
Standard deviation 1.765e-16
Proportion of Variance 0.000e+00
Cumulative Proportion 1.000e+00
plot(pca_result, type = "l")
From the results of our PCA, I can identify that the first two principal components explain 36.62% of the variance, which means they both capture a strong underlying pattern in the data (total variability). Additionally, the scree plot confirms the sufficient number of PCs to use via the elbow method.
To identify the most important features based on the PCA, we can look at the absolute values of the loadings of each variable on PC1 and PC2, and select the variables with the highest absolute loadings. These variables are the ones that contribute the most to the variability captured by PC1 and PC2, and can be used as input variables for the next stage of my analysis: clustering. I will set a threshold level for each PC to narrow down further the list of features. 33 is too many to use for cluster analysis.
#selecting only the most important features from PC1 in the form of a list
pc1_features <- pca_result$rotation[, 1] %>%
abs() %>%
sort(decreasing = TRUE) %>%
head(9) %>% names()
#table view
pca_result$rotation[, 1] %>%
abs() %>%
sort(decreasing = TRUE) %>%
as.data.frame() %>%
filter(. >= 0.23) %>%
rownames_to_column(var = "feature")
feature | . |
---|---|
<chr> | <dbl> |
tspent | 0.3112970 |
income | 0.2794197 |
meat | 0.2781833 |
numcatalogpurchases | 0.2739654 |
wines | 0.2558732 |
fish | 0.2344989 |
dependents | 0.2339325 |
is_parent | 0.2324959 |
kidhome | 0.2321033 |
numstorepurchases | 0.2309661 |
#selecting only the most important features from PC2 in the form of a list
pc2_features <- pca_result$rotation[, 2] %>%
abs() %>%
sort(decreasing = TRUE) %>%
head(8) %>% names()
#table view
pca_result$rotation[, 2] %>%
abs() %>%
sort(decreasing = TRUE) %>%
as.data.frame() %>%
filter(. >= 0.23) %>%
rownames_to_column(var = "feature")
feature | . |
---|---|
<chr> | <dbl> |
teenhome | 0.4207707 |
age | 0.3320624 |
year_birth | 0.3320624 |
household_size | 0.3256689 |
dependents | 0.2837317 |
numdealspurchases | 0.2713271 |
numwebpurchases | 0.2384993 |
is_parent | 0.2332432 |
I arrive to the momentous stage of feature selection where I subset my original (scaled) dataframe to features only in either PC1 or PC2. The resulting subset will be used for cluster analysis.
scaled_encoded_small <- scaled_encoded[, unique(c(pc1_features, pc2_features))]
head(scaled_encoded_small)
tspent | income | meat | numcatalogpurchases | wines | fish | dependents | is_parent | kidhome | teenhome | age | year_birth | household_size | numdealspurchases | numwebpurchases |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1.6789425 | 0.3145795 | 1.7480030 | 2.6279300 | 0.9743448 | 2.4485988 | -1.26630212 | -1.5843004 | -0.8232184 | -0.9305557 | 1.0169580 | -1.0169580 | -1.7586132 | 0.3613967 | 1.4244487 |
-0.9636789 | -0.2548196 | -0.7315122 | -0.5879096 | -0.8745778 | -0.6521970 | 1.40310149 | 0.6309072 | 1.0385217 | 0.9063962 | 1.2732412 | -1.2732412 | 0.4484113 | -0.1687960 | -1.1327002 |
0.2811786 | 0.9651351 | -0.1759171 | -0.2305941 | 0.3550743 | 1.3359603 | -1.26630212 | -1.5843004 | -0.8232184 | -0.9305557 | 0.3335362 | -0.3335362 | -0.6551009 | -0.6989887 | 1.4244487 |
-0.9204662 | -1.2058136 | -0.6672284 | -0.9452251 | -0.8745778 | -0.5062772 | 0.06839968 | 0.6309072 | 1.0385217 | -0.9305557 | -1.2895907 | 1.2895907 | 0.4484113 | -0.1687960 | -0.7673932 |
-0.3071786 | 0.3220627 | -0.2172424 | 0.1267214 | -0.3945691 | 0.1503619 | 0.06839968 | 0.6309072 | 1.0385217 | -0.9305557 | -1.0333075 | 1.0333075 | 0.4484113 | 1.4217821 | 0.3285278 |
0.1814571 | 0.5257989 | -0.3090762 | 0.4840369 | 0.6335979 | -0.6886770 | 0.06839968 | 0.6309072 | -0.8232184 | 0.9063962 | 0.1626807 | -0.1626807 | 0.4484113 | -0.1687960 | 0.6938348 |
I have finally selected my most important features using PCA. I went from 33 features to 15, a little less than half. I have named this new dataframe “scaled_encoded_small”. I will now perform cluster analysis and modelling to which we will compare clusters to gain insights and use hypothesis testing to test differences among the features.
Recall that this is an unsupervised learning approach. We do not have a tagged feature or label to evaluate or score our model. The purpose of this section is to study the patterns in the clusters formed and determine the nature of the patterns detected in the clusters.
To determine the ideal number of clusters to use, I employ the silhouette method using fviz_nbclust()
from the “factoextra” package.
library(factoextra)
library(cluster)
library(fpc)
Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
# Determine optimal number of clusters using silhouette method
fviz_nbclust(scaled_encoded, kmeans, method = "silhouette")
# Select the number of clusters based on findings of silhouette method
k <- 2
# Run the k-means algorithm
kclust <- kmeans(scaled_encoded_small, centers = k, nstart = 20)
# View the cluster centers
kclust$centers
# View the cluster assignments
kclust$cluster
# Plot the clusters
clusplot(scaled_encoded_small, kclust$cluster, color = TRUE, shade = TRUE, labels = 2, lines = 0)
#save clusterplot
# png("clusplot.png", width = 30, height = 20, units = "cm", res = 150)
# # Plot the clusters
# clusplot(scaled_encoded_small, kclust$cluster, color = TRUE, shade = TRUE, labels = FALSE, lines = 0, main = "cluster plot (k-means)")
# dev.off()
table(kclust$cluster)
#this table indicates that there are 793 observations in cluster 1 and 1419 observations in cluster 2.
Append the newly generated cluster designations to our original dataframe (another copy for the encoded version of df)
clustered_data <- cbind(encoded_data, cluster = kclust$cluster)
clustered_ogdata <- cbind(my_data, cluster = kclust$cluster)
clustered_data <- as.data.frame(clustered_data)
clustered_ogdata <- as.data.frame(clustered_ogdata)
# Compute the correlation matrix for each cluster
library(corrplot)
#Important to exclude the cluster variable from the correlation matrix
cor_mat_cluster1 <- cor(clustered_data[clustered_data$cluster == 1, -which(names(clustered_data) == "cluster")])
cor_mat_cluster2 <- cor(clustered_data[clustered_data$cluster == 2, -which(names(clustered_data) == "cluster")])
# Remove upper triangle from each matrix
cor_mat_cluster1[upper.tri(cor_mat_cluster1)] <- NA
cor_mat_cluster2[upper.tri(cor_mat_cluster2)] <- NA
# Plot the correlation matrix for each cluster
corrplot(cor_mat_cluster1, method = "color")
corrplot(cor_mat_cluster2, method = "color")
Based on the correlation matrix for each cluster, I can see that the correlations between 1 and 2 that stick out are the spending habbits. Specifically, correlation matrix 1, which represents cluster 1,
#Examine the characteristic of each cluster
table(clustered_ogdata$education, clustered_ogdata$living_with)
# Conduct t-test to compare mean value of "totalspent" between the two clusters and test significance of difference
t.test(clustered_ogdata$tspent[clustered_ogdata$cluster == 1],
clustered_ogdata$spent[clustered_ogdata$cluster == 2])
cat("mean total spending of cluster 1: ", mean(clustered_ogdata$tspent[clustered_ogdata$cluster == 1]))
cat("\nmean total spending of cluster 2: ", mean(clustered_ogdata$tspent[clustered_ogdata$cluster == 2]))
summary(clustered_ogdata$tspent[clustered_ogdata$cluster == 1])
summary(clustered_ogdata$tspent[clustered_ogdata$cluster == 2])
H${0}$: No difference in mean spending (tspend) between the two clusters.
H${1}$: There exists a significant difference in the mean value of “tspend” variable between the two clusters
Based on the results of the t-test, the p-value is less than the alpha level (e.g., 0.05), so we can reject the null hypothesis and conclude that there is a significant difference in the mean value of “tspent” between the two clusters. In other words, the statistically significant difference (in means) suggests that two clusters are characterized by different spending behaviors.
I am going to use a chi-square test to compare the proportion of observations in the data who are non-parents and parenta (0 and 1 respectively) between the two clusters. The null hypothesis for this test is that there is no difference in the proportion of customers who are parents between the two clusters. The alternative hypothesis is that there is a significant difference in the proportions.
table(clustered_ogdata$cluster, clustered_ogdata$is_parent)
chisq_test <- chisq.test(table(clustered_ogdata$cluster, clustered_ogdata$is_parent))
print(chisq_test)
Since the p-value is less than the alpha, we can reject the null hypothesis that the proportion of parents is the same in both clusters. The results of the chi-square test indicate there is a statistically signiifcant difference in the proportion of customers who are parents between the two clusters. Overall, this suggests that the presence or absence of children in a household may be an important character in distinguishing between the two customer clusters.
Testing for statistically significant means among groups. For three two-way ANOVA test, I am comparing the means of total spending (tspent) across different factors (living situation, parental status, education) to test for statistical significance. The null hypothesis is that there is no significant difference in the mean total spending across the different groups.
anovad <- clustered_ogdata %>% mutate(is_parent = if_else(is_parent == 1, "yes", "no")) %>%
mutate(is_parent = as.factor(is_parent)) # convert to factor
# Conduct a two-way ANOVA
anova <- aov(tspent ~ living_with*is_parent*education, data = anovad)
# Check the ANOVA table
summary(anova)
# Check the assumption of normality of residuals
qqnorm(resid(anova))
qqline(resid(anova))
# Check the assumption of homogeneity of variances
plot(anova)
# Check the interaction plot
interaction.plot(clustered_ogdata$income, clustered_ogdata$education, clustered_ogdata$tspent)
# Conduct post-hoc test
TukeyHSD(anova)
summary(anova)
ANOVA findings:
The extremely small p-value for the education variable indicates that there is a significant difference in total spending (mean) across different education groups. The same can be said for parental status of the customer (i.e. parent vs. non-parent). In other words, there is a significant difference between the total spending averages between parents and non-parents as well as among education groups.
In contrast, living situation (“alone” or “partnered”) has no real bearing on total spending.
Interactions
Furthermore, the extremely small p-value for interaction between is_parent and education allows us to reject the null hypothesis which states there is no significant interaction effect between parent status and education on total spending.
In simpler terms, total spending is dependent on education level and whether the customer in question is a parent.
Overall, the results suggest that income and is_parent are significant factors that affect total spending, indicating that the effect of education on spending habits depends on whether the customer is a parent (and vice versa.
Tukey test findings
The results of the tukey tests goes an extra layer deeper where it detects which pairing level indicate statistical signficance within each independent variable and pairings within each interaction.
Key concept I learned from this whole process:
So in a way, what you (talking to myself) are saying is that once I've deployed k-means algorithm on my data, I can add the newly generated cluster designations from the k-means as a variable to my dataset and begin grouping the dataframe by said cluser designations (however many 'k' I settled on). From this grouping, I can start to slice, manipulate, and transform the data by the available features based on these groupings – revealing insights on the characteristics of each grouping. Essentially, the cluster groupings retrieved from the k-means algorithm act as the "labels" to segment or group the dataset by whatever feature(s) I decide to explore, and then explore deeper patterns based on the groupings.
In this final section, I will visualize the groupings using features as characteristics to highlight the clear distinctions between the clusters. The end product is a separate report visualizing the findings of the analysis. (Find it in the navigation sidebar!)