Since my first post, I’ve been reading notebooks shared by folks who ranked high in the challenge, and added two features that they used. Eventually, these new predictors did not help (I must be doing something wrong). I also explored some other ML algorithms. Last, I tuned the parameters more efficiently with a clever grid-search algorithm. All in all, I slightly improved my score, but most importantly, I now have a clean template for further use.
I would like to familiarize myself with machine learning (ML) techniques in R
. So I have been reading and learning by doing. I thought I’d share my experience for others who’d like to give it a try. All material available from GitHub at https://github.com/oliviergimenez/learning-machine-learning.
The two great books I’m using are:
An Introduction to Statistical Learning with Applications in R by Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani
Tidy models in R by Max Kuhn and Julia Silge
I also recommend checking out the material (codes, screencasts) shared by David Robinson and Julia Silge from whom I picked some useful tricks that I put to use below.
To try things, I’ve joined the Kaggle online community which gathers folks with lots of experience in ML from whom you can learn. Kaggle also hosts public datasets that can be used for playing around.
I use the tidymodels
metapackage that contains a suite of packages for modeling and machine learning using tidyverse
principles. Check out all possibilities here, and parsnip models in particular there.
Let’s start with the famous Titanic dataset. We need to predict if a passenger survived the sinking of the Titanic (1) or not (0). A dataset is provided for training our models (train.csv). Another dataset is provided (test.csv) for which we do not know the answer. We will predict survival for each passenger, submit our answer to Kaggle and see how well we did compared to other folks. The metric for comparison is the percentage of passengers we correctly predict – aka as accuracy.
First things first, let’s load some packages to get us started.
library(tidymodels) # metapackage for ML
library(tidyverse) # metapackage for data manipulation and visulaisation
library(stacks) # stack ML models for better perfomance
theme_set(theme_light())
::registerDoParallel(cores = 4) # parallel computations doParallel
Read in training data.
<- read_csv("dat/titanic/train.csv")
rawdata glimpse(rawdata)
## Rows: 891
## Columns: 12
## $ PassengerId <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17,…
## $ Survived <dbl> 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1…
## $ Pclass <dbl> 3, 1, 3, 1, 3, 3, 1, 3, 3, 2, 3, 1, 3, 3, 3, 2, 3, 2, 3, 3…
## $ Name <chr> "Braund, Mr. Owen Harris", "Cumings, Mrs. John Bradley (Fl…
## $ Sex <chr> "male", "female", "female", "female", "male", "male", "mal…
## $ Age <dbl> 22, 38, 26, 35, 35, NA, 54, 2, 27, 14, 4, 58, 20, 39, 14, …
## $ SibSp <dbl> 1, 1, 0, 1, 0, 0, 0, 3, 0, 1, 1, 0, 0, 1, 0, 0, 4, 0, 1, 0…
## $ Parch <dbl> 0, 0, 0, 0, 0, 0, 0, 1, 2, 0, 1, 0, 0, 5, 0, 0, 1, 0, 0, 0…
## $ Ticket <chr> "A/5 21171", "PC 17599", "STON/O2. 3101282", "113803", "37…
## $ Fare <dbl> 7.2500, 71.2833, 7.9250, 53.1000, 8.0500, 8.4583, 51.8625,…
## $ Cabin <chr> NA, "C85", NA, "C123", NA, NA, "E46", NA, NA, NA, "G6", "C…
## $ Embarked <chr> "S", "C", "S", "S", "S", "Q", "S", "S", "S", "C", "S", "S"…
::miss_var_summary(rawdata) naniar
After some data exploration (not shown), I decided to take care of missing values, gather the two family variables in a single variable, and create a variable title.
# Get most frequent port of embarkation
<- unique(na.omit(rawdata$Embarked))
uniqx <- as.character(fct_drop(uniqx[which.max(tabulate(match(rawdata$Embarked, uniqx)))]))
mode_embarked
# Build function for data cleaning and handling NAs
<- function(tbl){
process_data
%>%
tbl mutate(class = case_when(Pclass == 1 ~ "first",
== 2 ~ "second",
Pclass == 3 ~ "third"),
Pclass class = as_factor(class),
gender = factor(Sex),
fare = Fare,
age = Age,
ticket = Ticket,
alone = if_else(SibSp + Parch == 0, "yes", "no"), # alone variable
alone = as_factor(alone),
port = factor(Embarked), # rename embarked as port
title = str_extract(Name, "[A-Za-z]+\\."), # title variable
title = fct_lump(title, 4)) %>% # keep only most frequent levels of title
mutate(port = ifelse(is.na(port), mode_embarked, port), # deal w/ NAs in port (replace by mode)
port = as_factor(port)) %>%
group_by(title) %>%
mutate(median_age_title = median(age, na.rm = T)) %>%
ungroup() %>%
mutate(age = if_else(is.na(age), median_age_title, age)) %>% # deal w/ NAs in age (replace by median in title)
mutate(ticketfreq = ave(1:nrow(.), FUN = length),
fareadjusted = fare / ticketfreq) %>%
mutate(familyage = SibSp + Parch + 1 + age/70)
}
# Process the data
<- rawdata %>%
dataset process_data() %>%
mutate(survived = as_factor(if_else(Survived == 1, "yes", "no"))) %>%
mutate(survived = relevel(survived, ref = "yes")) %>% # first event is survived = yes
select(survived, class, gender, age, alone, port, title, fareadjusted, familyage)
# Have a look again
glimpse(dataset)
## Rows: 891
## Columns: 9
## $ survived <fct> no, yes, yes, yes, no, no, no, no, yes, yes, yes, yes, no…
## $ class <fct> third, first, third, first, third, third, first, third, t…
## $ gender <fct> male, female, female, female, male, male, male, male, fem…
## $ age <dbl> 22, 38, 26, 35, 35, 30, 54, 2, 27, 14, 4, 58, 20, 39, 14,…
## $ alone <fct> no, no, yes, no, yes, yes, yes, no, no, no, no, yes, yes,…
## $ port <fct> 3, 1, 3, 3, 3, 2, 3, 3, 3, 1, 3, 3, 3, 3, 3, 3, 2, 3, 3, …
## $ title <fct> Mr., Mrs., Miss., Mrs., Mr., Mr., Mr., Master., Mrs., Mrs…
## $ fareadjusted <dbl> 0.008136925, 0.080003704, 0.008894501, 0.059595960, 0.009…
## $ familyage <dbl> 2.314286, 2.542857, 1.371429, 2.500000, 1.500000, 1.42857…
::miss_var_summary(dataset) naniar
Let’s apply the same treatment to the test dataset.
<- read_csv("dat/titanic/test.csv")
rawdata <- rawdata %>%
holdout process_data() %>%
select(PassengerId, class, gender, age, alone, port, title, fareadjusted, familyage)
glimpse(holdout)
## Rows: 418
## Columns: 9
## $ PassengerId <dbl> 892, 893, 894, 895, 896, 897, 898, 899, 900, 901, 902, 90…
## $ class <fct> third, third, second, third, third, third, third, second,…
## $ gender <fct> male, female, male, male, female, male, female, male, fem…
## $ age <dbl> 34.5, 47.0, 62.0, 27.0, 22.0, 14.0, 30.0, 26.0, 18.0, 21.…
## $ alone <fct> yes, no, yes, yes, no, yes, yes, no, yes, no, yes, yes, n…
## $ port <fct> 2, 3, 2, 3, 3, 3, 2, 3, 1, 3, 3, 3, 3, 3, 3, 1, 2, 1, 3, …
## $ title <fct> Mr., Mrs., Mr., Mr., Mrs., Mr., Miss., Mr., Mrs., Mr., Mr…
## $ fareadjusted <dbl> 0.018730144, 0.016746411, 0.023175837, 0.020723684, 0.029…
## $ familyage <dbl> 1.492857, 2.671429, 1.885714, 1.385714, 3.314286, 1.20000…
::miss_var_summary(holdout) naniar
::skim(dataset) skimr
Name | dataset |
Number of rows | 891 |
Number of columns | 9 |
_______________________ | |
Column type frequency: | |
factor | 6 |
numeric | 3 |
________________________ | |
Group variables | None |
Variable type: factor
skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
---|---|---|---|---|---|
survived | 0 | 1 | FALSE | 2 | no: 549, yes: 342 |
class | 0 | 1 | FALSE | 3 | thi: 491, fir: 216, sec: 184 |
gender | 0 | 1 | FALSE | 2 | mal: 577, fem: 314 |
alone | 0 | 1 | FALSE | 2 | yes: 537, no: 354 |
port | 0 | 1 | FALSE | 4 | 3: 644, 1: 168, 2: 77, S: 2 |
title | 0 | 1 | FALSE | 5 | Mr.: 517, Mis: 182, Mrs: 125, Mas: 40 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
age | 0 | 1 | 29.39 | 13.26 | 0.42 | 21.00 | 30.00 | 35.00 | 80.00 | ▂▇▃▁▁ |
fareadjusted | 0 | 1 | 0.04 | 0.06 | 0.00 | 0.01 | 0.02 | 0.03 | 0.58 | ▇▁▁▁▁ |
familyage | 0 | 1 | 2.32 | 1.57 | 1.07 | 1.41 | 1.57 | 2.62 | 11.43 | ▇▁▁▁▁ |
Let’s explore the data.
%>%
dataset count(survived)
%>%
dataset group_by(gender) %>%
summarize(n = n(),
n_surv = sum(survived == "yes"),
pct_surv = n_surv / n)
%>%
dataset group_by(title) %>%
summarize(n = n(),
n_surv = sum(survived == "yes"),
pct_surv = n_surv / n) %>%
arrange(desc(pct_surv))
%>%
dataset group_by(class, gender) %>%
summarize(n = n(),
n_surv = sum(survived == "yes"),
pct_surv = n_surv / n) %>%
arrange(desc(pct_surv))
Some informative graphs.
%>%
dataset group_by(class, gender) %>%
summarize(n = n(),
n_surv = sum(survived == "yes"),
pct_surv = n_surv / n) %>%
mutate(class = fct_reorder(class, pct_surv)) %>%
ggplot(aes(pct_surv, class, fill = class, color = class)) +
geom_col(position = position_dodge()) +
scale_x_continuous(labels = percent) +
labs(x = "% in category that survived", fill = NULL, color = NULL, y = NULL) +
facet_wrap(~gender)
%>%
dataset mutate(age = cut(age, breaks = c(0, 20, 40, 60, 80))) %>%
group_by(age, gender) %>%
summarize(n = n(),
n_surv = sum(survived == "yes"),
pct_surv = n_surv / n) %>%
mutate(age = fct_reorder(age, pct_surv)) %>%
ggplot(aes(pct_surv, age, fill = age, color = age)) +
geom_col(position = position_dodge()) +
scale_x_continuous(labels = percent) +
labs(x = "% in category that survived", fill = NULL, color = NULL, y = NULL) +
facet_wrap(~gender)
%>%
dataset ggplot(aes(fareadjusted, group = survived, color = survived, fill = survived)) +
geom_histogram(alpha = .4, position = position_dodge()) +
labs(x = "fare", y = NULL, color = "survived?", fill = "survived?")
%>%
dataset ggplot(aes(familyage, group = survived, color = survived, fill = survived)) +
geom_histogram(alpha = .4, position = position_dodge()) +
labs(x = "family aged", y = NULL, color = "survived?", fill = "survived?")
Split our dataset in two, one dataset for training and the other one for testing. We will use an additionnal splitting step for cross-validation.
set.seed(2021)
<- initial_split(dataset, strata = "survived")
spl <- training(spl)
train <- testing(spl)
test
<- train %>%
train_5fold vfold_cv(5)
Let’s start with gradient boosting methods which are very popular in the ML community.
Set up defaults.
<- metric_set(accuracy) # metric is accuracy
mset <- control_grid(save_workflow = TRUE,
control save_pred = TRUE,
extract = extract_model) # grid for tuning
First a recipe.
<- recipe(survived ~ ., data = train) %>%
xg_rec step_impute_median(all_numeric()) %>% # replace missing value by median
step_dummy(all_nominal_predictors()) # all factors var are split into binary terms (factor disj coding)
Then specify a gradient boosting model.
<- boost_tree(mode = "classification", # binary response
xg_model trees = tune(),
mtry = tune(),
tree_depth = tune(),
learn_rate = tune(),
loss_reduction = tune(),
min_n = tune()) # parameters to be tuned
Now set our workflow.
<-
xg_wf workflow() %>%
add_model(xg_model) %>%
add_recipe(xg_rec)
Use cross-validation to evaluate our model with different param config.
<- xg_wf %>%
xg_tune tune_grid(train_5fold,
metrics = mset,
control = control,
grid = crossing(trees = 1000,
mtry = c(3, 5, 8), # finalize(mtry(), train)
tree_depth = c(5, 10, 15),
learn_rate = c(0.01, 0.005),
loss_reduction = c(0.01, 0.1, 1),
min_n = c(2, 10, 25)))
Visualize the results.
autoplot(xg_tune) + theme_light()
Collect metrics.
%>%
xg_tune collect_metrics() %>%
arrange(desc(mean))
The tuning takes some time. There are other ways to explore the parameter space more efficiently. For example, we will use the function dials::grid_max_entropy()
in the last section about ensemble modelling. Here, I will use finetune::tune_race_anova
.
library(finetune)
<-
xg_tune %>%
xg_wf tune_race_anova(
train_5fold,grid = 50,
param_info = xg_model %>% parameters(),
metrics = metric_set(accuracy),
control = control_race(verbose_elim = TRUE))
Visualize the results.
autoplot(xg_tune)
Collect metrics.
%>%
xg_tune collect_metrics() %>%
arrange(desc(mean))
Use best config to fit model to training data.
<- xg_wf %>%
xg_fit finalize_workflow(select_best(xg_tune)) %>%
fit(train)
## [20:15:43] WARNING: amalgamation/../src/learner.cc:1095: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.
Check out accuracy on testing dataset to see if we overfitted.
%>%
xg_fit augment(test, type.predict = "response") %>%
accuracy(survived, .pred_class)
Check out important features (aka predictors).
<- xgboost::xgb.importance(model = extract_fit_engine(xg_fit))
importances %>%
importances mutate(Feature = fct_reorder(Feature, Gain)) %>%
ggplot(aes(Gain, Feature)) +
geom_col()
Now we’re ready to predict survival for the holdout dataset and submit to Kaggle. Note that I use the whole dataset, not just the training dataset.
%>%
xg_wf finalize_workflow(select_best(xg_tune)) %>%
fit(dataset) %>%
augment(holdout) %>%
select(PassengerId, Survived = .pred_class) %>%
mutate(Survived = if_else(Survived == "yes", 1, 0)) %>%
write_csv("output/titanic/xgboost.csv")
## [20:15:45] WARNING: amalgamation/../src/learner.cc:1095: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.
I got and accuracy of 0.74162. Cool. Let’s train a random forest model now.
Let’s continue with random forest methods.
First a recipe.
<- recipe(survived ~ ., data = train) %>%
rf_rec step_impute_median(all_numeric()) %>% # replace missing value by median
step_dummy(all_nominal_predictors()) # all factors var are split into binary terms (factor disj coding)
Then specify a random forest model.
<- rand_forest(mode = "classification", # binary response
rf_model engine = "ranger", # by default
mtry = tune(),
trees = tune(),
min_n = tune()) # parameters to be tuned
Now set our workflow.
<-
rf_wf workflow() %>%
add_model(rf_model) %>%
add_recipe(rf_rec)
Use cross-validation to evaluate our model with different param config.
<-
rf_tune %>%
rf_wf tune_race_anova(
train_5fold,grid = 50,
param_info = rf_model %>% parameters(),
metrics = metric_set(accuracy),
control = control_race(verbose_elim = TRUE))
Visualize the results.
autoplot(rf_tune)
Collect metrics.
%>%
rf_tune collect_metrics() %>%
arrange(desc(mean))
Use best config to fit model to training data.
<- rf_wf %>%
rf_fit finalize_workflow(select_best(rf_tune)) %>%
fit(train)
Check out accuracy on testing dataset to see if we overfitted.
%>%
rf_fit augment(test, type.predict = "response") %>%
accuracy(survived, .pred_class)
Check out important features (aka predictors).
library(vip)
finalize_model(
x = rf_model,
parameters = select_best(rf_tune)) %>%
set_engine("ranger", importance = "permutation") %>%
fit(survived ~ ., data = juice(prep(rf_rec))) %>%
vip(geom = "point")
Now we’re ready to predict survival for the holdout dataset and submit to Kaggle.
%>%
rf_wf finalize_workflow(select_best(rf_tune)) %>%
fit(dataset) %>%
augment(holdout) %>%
select(PassengerId, Survived = .pred_class) %>%
mutate(Survived = if_else(Survived == "yes", 1, 0)) %>%
write_csv("output/titanic/randomforest.csv")
I got and accuracy of 0.77990, a bit better than gradient boosting.
Let’s continue with cat boosting methods.
Set up defaults.
<- metric_set(accuracy) # metric is accuracy
mset <- control_grid(save_workflow = TRUE,
control save_pred = TRUE,
extract = extract_model) # grid for tuning
First a recipe.
<- recipe(survived ~ ., data = train) %>%
cb_rec step_impute_median(all_numeric()) %>% # replace missing value by median
step_dummy(all_nominal_predictors()) # all factors var are split into binary terms (factor disj coding)
Then specify a cat boosting model.
library(treesnip)
<- boost_tree(mode = "classification",
cb_model engine = "catboost",
mtry = tune(),
trees = tune(),
min_n = tune(),
tree_depth = tune(),
learn_rate = tune()) # parameters to be tuned
Now set our workflow.
<-
cb_wf workflow() %>%
add_model(cb_model) %>%
add_recipe(cb_rec)
Use cross-validation to evaluate our model with different param config.
<- cb_wf %>%
cb_tune tune_race_anova(
train_5fold,grid = 30,
param_info = cb_model %>% parameters(),
metrics = metric_set(accuracy),
control = control_race(verbose_elim = TRUE))
Visualize the results.
autoplot(cb_tune)
Collect metrics.
%>%
cb_tune collect_metrics() %>%
arrange(desc(mean))
Use best config to fit model to training data.
<- cb_wf %>%
cb_fit finalize_workflow(select_best(cb_tune)) %>%
fit(train)
Check out accuracy on testing dataset to see if we overfitted.
%>%
cb_fit augment(test, type.predict = "response") %>%
accuracy(survived, .pred_class)
Now we’re ready to predict survival for the holdout dataset and submit to Kaggle.
%>%
cb_wf finalize_workflow(select_best(cb_tune)) %>%
fit(dataset) %>%
augment(holdout) %>%
select(PassengerId, Survived = .pred_class) %>%
mutate(Survived = if_else(Survived == "yes", 1, 0)) %>%
write_csv("output/titanic/catboost.csv")
I got and accuracy of 0.76076. Cool.
Let’s continue with elastic net regularization.
First a recipe.
<- recipe(survived ~ ., data = train) %>%
en_rec step_impute_median(all_numeric()) %>% # replace missing value by median
step_normalize(all_numeric_predictors()) %>% # normalize
step_dummy(all_nominal_predictors()) # all factors var are split into binary terms (factor disj coding)
Then specify a regularization model. We tune parameter mixture, with ridge regression for mixture = 0, and lasso for mixture = 1.
<- logistic_reg(penalty = tune(),
en_model mixture = tune()) %>% # param to be tuned
set_engine("glmnet") %>% # elastic net
set_mode("classification") # binary response
Now set our workflow.
<-
en_wf workflow() %>%
add_model(en_model) %>%
add_recipe(en_rec)
Use cross-validation to evaluate our model with different param config.
<- en_wf %>%
en_tune tune_grid(train_5fold,
metrics = mset,
control = control,
grid = crossing(penalty = 10 ^ seq(-8, -.5, .5),
mixture = seq(0, 1, length.out = 10)))
Visualize the results.
autoplot(en_tune)
Collect metrics.
%>%
en_tune collect_metrics() %>%
arrange(desc(mean))
Use best config to fit model to training data.
<- en_wf %>%
en_fit finalize_workflow(select_best(en_tune)) %>%
fit(train)
Check out accuracy on testing dataset to see if we overfitted.
%>%
en_fit augment(test, type.predict = "response") %>%
accuracy(survived, .pred_class)
Check out important features (aka predictors).
library(broom)
$fit$fit$fit %>%
en_fittidy() %>%
filter(lambda >= select_best(en_tune)$penalty) %>%
filter(lambda == min(lambda),
!= "(Intercept)") %>%
term mutate(term = fct_reorder(term, estimate)) %>%
ggplot(aes(estimate, term, fill = estimate > 0)) +
geom_col() +
theme(legend.position = "none")
Now we’re ready to predict survival for the holdout dataset and submit to Kaggle.
%>%
en_wf finalize_workflow(select_best(en_tune)) %>%
fit(dataset) %>%
augment(holdout) %>%
select(PassengerId, Survived = .pred_class) %>%
mutate(Survived = if_else(Survived == "yes", 1, 0)) %>%
write_csv("output/titanic/regularization.csv")
I got and accuracy of 0.76315.
And what about a good old-fashioned logistic regression (not a ML algo)?
First a recipe.
<- recipe(survived ~ ., data = train) %>%
logistic_rec step_impute_median(all_numeric()) %>% # replace missing value by median
step_normalize(all_numeric_predictors()) %>% # normalize
step_dummy(all_nominal_predictors()) # all factors var are split into binary terms (factor disj coding)
Then specify a logistic regression.
<- logistic_reg() %>% # no param to be tuned
logistic_model set_engine("glm") %>% # elastic net
set_mode("classification") # binary response
Now set our workflow.
<-
logistic_wf workflow() %>%
add_model(logistic_model) %>%
add_recipe(logistic_rec)
Fit model.
<- logistic_wf %>%
logistic_fit fit(train)
Inspect significant features (aka predictors).
tidy(logistic_fit, exponentiate = TRUE) %>%
filter(p.value < 0.05)
Same thing, but graphically.
library(broom)
%>%
logistic_fit tidy() %>%
mutate(term = fct_reorder(term, estimate)) %>%
ggplot(aes(estimate, term, fill = estimate > 0)) +
geom_col() +
theme(legend.position = "none")
Check out accuracy on testing dataset to see if we overfitted.
%>%
logistic_fit augment(test, type.predict = "response") %>%
accuracy(survived, .pred_class)
Confusion matrix.
%>%
logistic_fit augment(test, type.predict = "response") %>%
conf_mat(survived, .pred_class)
## Truth
## Prediction yes no
## yes 59 13
## no 27 125
ROC curve.
%>%
logistic_fit augment(test, type.predict = "response") %>%
roc_curve(truth = survived, estimate = .pred_yes) %>%
autoplot()
Now we’re ready to predict survival for the holdout dataset and submit to Kaggle.
%>%
logistic_wf fit(dataset) %>%
augment(holdout) %>%
select(PassengerId, Survived = .pred_class) %>%
mutate(Survived = if_else(Survived == "yes", 1, 0)) %>%
write_csv("output/titanic/logistic.csv")
I got and accuracy of 0.76076. Oldies but goodies!
We go on with neural networks.
Set up defaults.
<- metric_set(accuracy) # metric is accuracy
mset <- control_grid(save_workflow = TRUE,
control save_pred = TRUE,
extract = extract_model) # grid for tuning
First a recipe.
<- recipe(survived ~ ., data = train) %>%
nn_rec step_impute_median(all_numeric()) %>% # replace missing value by median
step_normalize(all_numeric_predictors()) %>%
step_dummy(all_nominal_predictors()) # all factors var are split into binary terms (factor disj coding)
Then specify a neural network.
<- mlp(epochs = tune(),
nn_model hidden_units = tune(),
dropout = tune()) %>% # param to be tuned
set_mode("classification") %>% # binary response var
set_engine("keras", verbose = 0)
Now set our workflow.
<-
nn_wf workflow() %>%
add_model(nn_model) %>%
add_recipe(nn_rec)
Use cross-validation to evaluate our model with different param config.
<- nn_wf %>%
nn_tune tune_race_anova(
train_5fold,grid = 30,
param_info = nn_model %>% parameters(),
metrics = metric_set(accuracy),
control = control_race(verbose_elim = TRUE))
Visualize the results.
autoplot(nn_tune)
Collect metrics.
%>%
nn_tune collect_metrics() %>%
arrange(desc(mean))
Use best config to fit model to training data.
<- nn_wf %>%
nn_fit finalize_workflow(select_best(nn_tune)) %>%
fit(train)
Check out accuracy on testing dataset to see if we overfitted.
%>%
nn_fit augment(test, type.predict = "response") %>%
accuracy(survived, .pred_class)
Now we’re ready to predict survival for the holdout dataset and submit to Kaggle.
%>%
nn_wf finalize_workflow(select_best(nn_tune)) %>%
fit(train) %>%
augment(holdout) %>%
select(PassengerId, Survived = .pred_class) %>%
mutate(Survived = if_else(Survived == "yes", 1, 0)) %>%
write_csv("output/titanic/nn.csv")
I got and accuracy of 0.78708. My best score so far.
We go on with support vector machines.
Set up defaults.
<- metric_set(accuracy) # metric is accuracy
mset <- control_grid(save_workflow = TRUE,
control save_pred = TRUE,
extract = extract_model) # grid for tuning
First a recipe.
<- recipe(survived ~ ., data = train) %>%
svm_rec step_impute_median(all_numeric()) %>% # replace missing value by median
# remove any zero variance predictors
step_zv(all_predictors()) %>%
# remove any linear combinations
step_lincomb(all_numeric()) %>%
step_dummy(all_nominal_predictors()) # all factors var are split into binary terms (factor disj coding)
Then specify a svm.
<- svm_rbf(cost = tune(),
svm_model rbf_sigma = tune()) %>% # param to be tuned
set_mode("classification") %>% # binary response var
set_engine("kernlab")
Now set our workflow.
<-
svm_wf workflow() %>%
add_model(svm_model) %>%
add_recipe(svm_rec)
Use cross-validation to evaluate our model with different param config.
<- svm_wf %>%
svm_tune tune_race_anova(
train_5fold,grid = 30,
param_info = svm_model %>% parameters(),
metrics = metric_set(accuracy),
control = control_race(verbose_elim = TRUE))
Visualize the results.
autoplot(svm_tune)
Collect metrics.
%>%
svm_tune collect_metrics() %>%
arrange(desc(mean))
Use best config to fit model to training data.
<- svm_wf %>%
svm_fit finalize_workflow(select_best(svm_tune)) %>%
fit(train)
Check out accuracy on testing dataset to see if we overfitted.
%>%
svm_fit augment(test, type.predict = "response") %>%
accuracy(survived, .pred_class)
Now we’re ready to predict survival for the holdout dataset and submit to Kaggle.
%>%
svm_wf finalize_workflow(select_best(svm_tune)) %>%
fit(dataset) %>%
augment(holdout) %>%
select(PassengerId, Survived = .pred_class) %>%
mutate(Survived = if_else(Survived == "yes", 1, 0)) %>%
write_csv("output/titanic/svm.csv")
I got and accuracy of 0.77511.
We go on with decision trees.
Set up defaults.
<- metric_set(accuracy) # metric is accuracy
mset <- control_grid(save_workflow = TRUE,
control save_pred = TRUE,
extract = extract_model) # grid for tuning
First a recipe.
<- recipe(survived ~ ., data = train) %>%
dt_rec step_impute_median(all_numeric()) %>% # replace missing value by median
step_zv(all_predictors()) %>%
step_dummy(all_nominal_predictors()) # all factors var are split into binary terms (factor disj coding)
Then specify a decision tree model.
library(baguette)
<- bag_tree(cost_complexity = tune(),
dt_model tree_depth = tune(),
min_n = tune()) %>% # param to be tuned
set_engine("rpart", times = 25) %>% # nb bootstraps
set_mode("classification") # binary response var
Now set our workflow.
<-
dt_wf workflow() %>%
add_model(dt_model) %>%
add_recipe(dt_rec)
Use cross-validation to evaluate our model with different param config.
<- dt_wf %>%
dt_tune tune_race_anova(
train_5fold,grid = 30,
param_info = dt_model %>% parameters(),
metrics = metric_set(accuracy),
control = control_race(verbose_elim = TRUE))
Visualize the results.
autoplot(dt_tune)
Collect metrics.
%>%
dt_tune collect_metrics() %>%
arrange(desc(mean))
Use best config to fit model to training data.
<- dt_wf %>%
dt_fit finalize_workflow(select_best(dt_tune)) %>%
fit(train)
Check out accuracy on testing dataset to see if we overfitted.
%>%
dt_fit augment(test, type.predict = "response") %>%
accuracy(survived, .pred_class)
Now we’re ready to predict survival for the holdout dataset and submit to Kaggle.
%>%
dt_wf finalize_workflow(select_best(dt_tune)) %>%
fit(dataset) %>%
augment(holdout) %>%
select(PassengerId, Survived = .pred_class) %>%
mutate(Survived = if_else(Survived == "yes", 1, 0)) %>%
write_csv("output/titanic/dt.csv")
I got and accuracy of 0.76794.
Let’s do some ensemble modelling with all algo but logistic and catboost. Tune again with a probability-based metric. Start with xgboost.
library(finetune)
library(stacks)
# xgboost
<- recipe(survived ~ ., data = train) %>%
xg_rec step_impute_median(all_numeric()) %>% # replace missing value by median
step_dummy(all_nominal_predictors()) # all factors var are split into binary terms (factor disj coding)
<- boost_tree(mode = "classification", # binary response
xg_model trees = tune(),
mtry = tune(),
tree_depth = tune(),
learn_rate = tune(),
loss_reduction = tune(),
min_n = tune()) # parameters to be tuned
<-
xg_wf workflow() %>%
add_model(xg_model) %>%
add_recipe(xg_rec)
<- grid_latin_hypercube(
xg_grid trees(),
finalize(mtry(), train),
tree_depth(),
learn_rate(),
loss_reduction(),
min_n(),
size = 30)
<- xg_wf %>%
xg_tune tune_grid(
resamples = train_5fold,
grid = xg_grid,
metrics = metric_set(roc_auc),
control = control_stack_grid())
Then random forests.
# random forest
<- recipe(survived ~ ., data = train) %>%
rf_rec step_impute_median(all_numeric()) %>%
step_dummy(all_nominal_predictors())
<- rand_forest(mode = "classification", # binary response
rf_model engine = "ranger", # by default
mtry = tune(),
trees = tune(),
min_n = tune()) # parameters to be tuned
<-
rf_wf workflow() %>%
add_model(rf_model) %>%
add_recipe(rf_rec)
<- grid_latin_hypercube(
rf_grid finalize(mtry(), train),
trees(),
min_n(),
size = 30)
<- rf_wf %>%
rf_tune tune_grid(
resamples = train_5fold,
grid = rf_grid,
metrics = metric_set(roc_auc),
control = control_stack_grid())
Regularisation methods (between ridge and lasso).
# regularization methods
<- recipe(survived ~ ., data = train) %>%
en_rec step_impute_median(all_numeric()) %>% # replace missing value by median
step_normalize(all_numeric_predictors()) %>% # normalize
step_dummy(all_nominal_predictors())
<- logistic_reg(penalty = tune(),
en_model mixture = tune()) %>% # param to be tuned
set_engine("glmnet") %>% # elastic net
set_mode("classification") # binary response
<-
en_wf workflow() %>%
add_model(en_model) %>%
add_recipe(en_rec)
<- grid_latin_hypercube(
en_grid penalty(),
mixture(),
size = 30)
<- en_wf %>%
en_tune tune_grid(
resamples = train_5fold,
grid = en_grid,
metrics = metric_set(roc_auc),
control = control_stack_grid())
Neural networks (takes time, so pick only a few values for illustration purpose).
# neural networks
<- recipe(survived ~ ., data = train) %>%
nn_rec step_impute_median(all_numeric()) %>% # replace missing value by median
step_normalize(all_numeric_predictors()) %>%
step_dummy(all_nominal_predictors())
<- mlp(epochs = tune(),
nn_model hidden_units = 2,
dropout = tune()) %>% # param to be tuned
set_mode("classification") %>% # binary response var
set_engine("keras", verbose = 0)
<-
nn_wf workflow() %>%
add_model(nn_model) %>%
add_recipe(nn_rec)
# nn_grid <- grid_latin_hypercube(
# epochs(),
# hidden_units(),
# dropout(),
# size = 10)
<- nn_wf %>%
nn_tune tune_grid(
resamples = train_5fold,
grid = crossing(dropout = c(0.1, 0.2), epochs = c(250, 500, 1000)), # nn_grid
metrics = metric_set(roc_auc),
control = control_stack_grid())
#autoplot(nn_tune)
Support vector machines.
# support vector machines
<- recipe(survived ~ ., data = train) %>%
svm_rec step_impute_median(all_numeric()) %>%
step_normalize(all_numeric_predictors()) %>%
step_dummy(all_nominal_predictors())
<- svm_rbf(cost = tune(),
svm_model rbf_sigma = tune()) %>% # param to be tuned
set_mode("classification") %>% # binary response var
set_engine("kernlab")
<-
svm_wf workflow() %>%
add_model(svm_model) %>%
add_recipe(svm_rec)
<- grid_latin_hypercube(
svm_grid cost(),
rbf_sigma(),
size = 30)
<- svm_wf %>%
svm_tune tune_grid(
resamples = train_5fold,
grid = svm_grid,
metrics = metric_set(roc_auc),
control = control_stack_grid())
Last, decision trees.
# decision trees
<- recipe(survived ~ ., data = train) %>%
dt_rec step_impute_median(all_numeric()) %>%
step_zv(all_predictors()) %>%
step_dummy(all_nominal_predictors())
library(baguette)
<- bag_tree(cost_complexity = tune(),
dt_model tree_depth = tune(),
min_n = tune()) %>% # param to be tuned
set_engine("rpart", times = 25) %>% # nb bootstraps
set_mode("classification") # binary response var
<-
dt_wf workflow() %>%
add_model(dt_model) %>%
add_recipe(dt_rec)
<- grid_latin_hypercube(
dt_grid cost_complexity(),
tree_depth(),
min_n(),
size = 30)
<- dt_wf %>%
dt_tune tune_grid(
resamples = train_5fold,
grid = dt_grid,
metrics = metric_set(roc_auc),
control = control_stack_grid())
Get best config.
<- xg_tune %>% filter_parameters(parameters = select_best(xg_tune))
xg_best <- rf_tune %>% filter_parameters(parameters = select_best(rf_tune))
rf_best <- en_tune %>% filter_parameters(parameters = select_best(en_tune))
en_best <- nn_tune %>% filter_parameters(parameters = select_best(nn_tune))
nn_best <- svm_tune %>% filter_parameters(parameters = select_best(svm_tune))
svm_best <- dt_tune %>% filter_parameters(parameters = select_best(dt_tune)) dt_best
Do the stacked ensemble modelling.
Pile all models together.
<- stacks() %>% # initialize
blended add_candidates(en_best) %>% # add regularization model
add_candidates(xg_best) %>% # add gradient boosting
add_candidates(rf_best) %>% # add random forest
add_candidates(nn_best) %>% # add neural network
add_candidates(svm_best) %>% # add svm
add_candidates(dt_best) # add decision trees
blended
Fit regularized model.
<- blended %>%
blended_fit blend_predictions() # fit regularized model
Visualise penalized model. Note that neural networks are dropped, despite achieving best score when used in isolation. I’ll have to dig into that.
autoplot(blended_fit)
autoplot(blended_fit, type = "members")
autoplot(blended_fit, type = "weights")
Fit candidate members with non-zero stacking coef with full training dataset.
<- blended_fit %>%
blended_regularized fit_members()
blended_regularized
## # A tibble: 3 × 3
## member type weight
## <chr> <chr> <dbl>
## 1 .pred_no_en_best_1_12 logistic_reg 2.67
## 2 .pred_no_dt_best_1_11 bag_tree 1.95
## 3 .pred_no_rf_best_1_05 rand_forest 0.922
Perf on testing dataset?
%>%
test bind_cols(predict(blended_regularized, .)) %>%
accuracy(survived, .pred_class)
Now predict.
%>%
holdout bind_cols(predict(blended_regularized, .)) %>%
select(PassengerId, Survived = .pred_class) %>%
mutate(Survived = if_else(Survived == "yes", 1, 0)) %>%
write_csv("output/titanic/stacked.csv")
I got an 0.76076 accuracy.
I covered several ML algorithms and logistic regression with the awesome tidymodels
metapackage in R
. My scores at predicting Titanic survivors were ok I guess. Some folks on Kaggle got a perfect accuracy, so there is always room for improvement. Maybe better tuning, better features (or predictors) or other algorithms would increase accuracy. Of course, I forgot to use set.seed()
so results are not exactly reproducible.