Since my first post, I’ve been reading notebooks shared by folks who ranked high in the challenge, and added two features that they used. Eventually, these new predictors did not help (I must be doing something wrong). I also explored some other ML algorithms. Last, I tuned the parameters more efficiently with a clever grid-search algorithm. All in all, I slightly improved my score, but most importantly, I now have a clean template for further use.
I would like to familiarize myself with machine learning (ML) techniques in R. So I have been reading and learning by doing. I thought I’d share my experience for others who’d like to give it a try. All material available from GitHub at https://github.com/oliviergimenez/learning-machine-learning.
The two great books I’m using are:
An Introduction to Statistical Learning with Applications in R by Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani
Tidy models in R by Max Kuhn and Julia Silge
I also recommend checking out the material (codes, screencasts) shared by David Robinson and Julia Silge from whom I picked some useful tricks that I put to use below.
To try things, I’ve joined the Kaggle online community which gathers folks with lots of experience in ML from whom you can learn. Kaggle also hosts public datasets that can be used for playing around.
I use the tidymodels metapackage that contains a suite of packages for modeling and machine learning using tidyverse principles. Check out all possibilities here, and parsnip models in particular there.
Let’s start with the famous Titanic dataset. We need to predict if a passenger survived the sinking of the Titanic (1) or not (0). A dataset is provided for training our models (train.csv). Another dataset is provided (test.csv) for which we do not know the answer. We will predict survival for each passenger, submit our answer to Kaggle and see how well we did compared to other folks. The metric for comparison is the percentage of passengers we correctly predict – aka as accuracy.
First things first, let’s load some packages to get us started.
library(tidymodels) # metapackage for ML
library(tidyverse) # metapackage for data manipulation and visulaisation
library(stacks) # stack ML models for better perfomance
theme_set(theme_light())
doParallel::registerDoParallel(cores = 4) # parallel computationsRead in training data.
rawdata <- read_csv("dat/titanic/train.csv")
glimpse(rawdata)## Rows: 891
## Columns: 12
## $ PassengerId <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17,…
## $ Survived <dbl> 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1…
## $ Pclass <dbl> 3, 1, 3, 1, 3, 3, 1, 3, 3, 2, 3, 1, 3, 3, 3, 2, 3, 2, 3, 3…
## $ Name <chr> "Braund, Mr. Owen Harris", "Cumings, Mrs. John Bradley (Fl…
## $ Sex <chr> "male", "female", "female", "female", "male", "male", "mal…
## $ Age <dbl> 22, 38, 26, 35, 35, NA, 54, 2, 27, 14, 4, 58, 20, 39, 14, …
## $ SibSp <dbl> 1, 1, 0, 1, 0, 0, 0, 3, 0, 1, 1, 0, 0, 1, 0, 0, 4, 0, 1, 0…
## $ Parch <dbl> 0, 0, 0, 0, 0, 0, 0, 1, 2, 0, 1, 0, 0, 5, 0, 0, 1, 0, 0, 0…
## $ Ticket <chr> "A/5 21171", "PC 17599", "STON/O2. 3101282", "113803", "37…
## $ Fare <dbl> 7.2500, 71.2833, 7.9250, 53.1000, 8.0500, 8.4583, 51.8625,…
## $ Cabin <chr> NA, "C85", NA, "C123", NA, NA, "E46", NA, NA, NA, "G6", "C…
## $ Embarked <chr> "S", "C", "S", "S", "S", "Q", "S", "S", "S", "C", "S", "S"…
naniar::miss_var_summary(rawdata)After some data exploration (not shown), I decided to take care of missing values, gather the two family variables in a single variable, and create a variable title.
# Get most frequent port of embarkation
uniqx <- unique(na.omit(rawdata$Embarked))
mode_embarked <- as.character(fct_drop(uniqx[which.max(tabulate(match(rawdata$Embarked, uniqx)))]))
# Build function for data cleaning and handling NAs
process_data <- function(tbl){
tbl %>%
mutate(class = case_when(Pclass == 1 ~ "first",
Pclass == 2 ~ "second",
Pclass == 3 ~ "third"),
class = as_factor(class),
gender = factor(Sex),
fare = Fare,
age = Age,
ticket = Ticket,
alone = if_else(SibSp + Parch == 0, "yes", "no"), # alone variable
alone = as_factor(alone),
port = factor(Embarked), # rename embarked as port
title = str_extract(Name, "[A-Za-z]+\\."), # title variable
title = fct_lump(title, 4)) %>% # keep only most frequent levels of title
mutate(port = ifelse(is.na(port), mode_embarked, port), # deal w/ NAs in port (replace by mode)
port = as_factor(port)) %>%
group_by(title) %>%
mutate(median_age_title = median(age, na.rm = T)) %>%
ungroup() %>%
mutate(age = if_else(is.na(age), median_age_title, age)) %>% # deal w/ NAs in age (replace by median in title)
mutate(ticketfreq = ave(1:nrow(.), FUN = length),
fareadjusted = fare / ticketfreq) %>%
mutate(familyage = SibSp + Parch + 1 + age/70)
}
# Process the data
dataset <- rawdata %>%
process_data() %>%
mutate(survived = as_factor(if_else(Survived == 1, "yes", "no"))) %>%
mutate(survived = relevel(survived, ref = "yes")) %>% # first event is survived = yes
select(survived, class, gender, age, alone, port, title, fareadjusted, familyage)
# Have a look again
glimpse(dataset)## Rows: 891
## Columns: 9
## $ survived <fct> no, yes, yes, yes, no, no, no, no, yes, yes, yes, yes, no…
## $ class <fct> third, first, third, first, third, third, first, third, t…
## $ gender <fct> male, female, female, female, male, male, male, male, fem…
## $ age <dbl> 22, 38, 26, 35, 35, 30, 54, 2, 27, 14, 4, 58, 20, 39, 14,…
## $ alone <fct> no, no, yes, no, yes, yes, yes, no, no, no, no, yes, yes,…
## $ port <fct> 3, 1, 3, 3, 3, 2, 3, 3, 3, 1, 3, 3, 3, 3, 3, 3, 2, 3, 3, …
## $ title <fct> Mr., Mrs., Miss., Mrs., Mr., Mr., Mr., Master., Mrs., Mrs…
## $ fareadjusted <dbl> 0.008136925, 0.080003704, 0.008894501, 0.059595960, 0.009…
## $ familyage <dbl> 2.314286, 2.542857, 1.371429, 2.500000, 1.500000, 1.42857…
naniar::miss_var_summary(dataset)Let’s apply the same treatment to the test dataset.
rawdata <- read_csv("dat/titanic/test.csv")
holdout <- rawdata %>%
process_data() %>%
select(PassengerId, class, gender, age, alone, port, title, fareadjusted, familyage)
glimpse(holdout)## Rows: 418
## Columns: 9
## $ PassengerId <dbl> 892, 893, 894, 895, 896, 897, 898, 899, 900, 901, 902, 90…
## $ class <fct> third, third, second, third, third, third, third, second,…
## $ gender <fct> male, female, male, male, female, male, female, male, fem…
## $ age <dbl> 34.5, 47.0, 62.0, 27.0, 22.0, 14.0, 30.0, 26.0, 18.0, 21.…
## $ alone <fct> yes, no, yes, yes, no, yes, yes, no, yes, no, yes, yes, n…
## $ port <fct> 2, 3, 2, 3, 3, 3, 2, 3, 1, 3, 3, 3, 3, 3, 3, 1, 2, 1, 3, …
## $ title <fct> Mr., Mrs., Mr., Mr., Mrs., Mr., Miss., Mr., Mrs., Mr., Mr…
## $ fareadjusted <dbl> 0.018730144, 0.016746411, 0.023175837, 0.020723684, 0.029…
## $ familyage <dbl> 1.492857, 2.671429, 1.885714, 1.385714, 3.314286, 1.20000…
naniar::miss_var_summary(holdout)skimr::skim(dataset)| Name | dataset |
| Number of rows | 891 |
| Number of columns | 9 |
| _______________________ | |
| Column type frequency: | |
| factor | 6 |
| numeric | 3 |
| ________________________ | |
| Group variables | None |
Variable type: factor
| skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
|---|---|---|---|---|---|
| survived | 0 | 1 | FALSE | 2 | no: 549, yes: 342 |
| class | 0 | 1 | FALSE | 3 | thi: 491, fir: 216, sec: 184 |
| gender | 0 | 1 | FALSE | 2 | mal: 577, fem: 314 |
| alone | 0 | 1 | FALSE | 2 | yes: 537, no: 354 |
| port | 0 | 1 | FALSE | 4 | 3: 644, 1: 168, 2: 77, S: 2 |
| title | 0 | 1 | FALSE | 5 | Mr.: 517, Mis: 182, Mrs: 125, Mas: 40 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| age | 0 | 1 | 29.39 | 13.26 | 0.42 | 21.00 | 30.00 | 35.00 | 80.00 | ▂▇▃▁▁ |
| fareadjusted | 0 | 1 | 0.04 | 0.06 | 0.00 | 0.01 | 0.02 | 0.03 | 0.58 | ▇▁▁▁▁ |
| familyage | 0 | 1 | 2.32 | 1.57 | 1.07 | 1.41 | 1.57 | 2.62 | 11.43 | ▇▁▁▁▁ |
Let’s explore the data.
dataset %>%
count(survived)dataset %>%
group_by(gender) %>%
summarize(n = n(),
n_surv = sum(survived == "yes"),
pct_surv = n_surv / n)dataset %>%
group_by(title) %>%
summarize(n = n(),
n_surv = sum(survived == "yes"),
pct_surv = n_surv / n) %>%
arrange(desc(pct_surv))dataset %>%
group_by(class, gender) %>%
summarize(n = n(),
n_surv = sum(survived == "yes"),
pct_surv = n_surv / n) %>%
arrange(desc(pct_surv))Some informative graphs.
dataset %>%
group_by(class, gender) %>%
summarize(n = n(),
n_surv = sum(survived == "yes"),
pct_surv = n_surv / n) %>%
mutate(class = fct_reorder(class, pct_surv)) %>%
ggplot(aes(pct_surv, class, fill = class, color = class)) +
geom_col(position = position_dodge()) +
scale_x_continuous(labels = percent) +
labs(x = "% in category that survived", fill = NULL, color = NULL, y = NULL) +
facet_wrap(~gender)dataset %>%
mutate(age = cut(age, breaks = c(0, 20, 40, 60, 80))) %>%
group_by(age, gender) %>%
summarize(n = n(),
n_surv = sum(survived == "yes"),
pct_surv = n_surv / n) %>%
mutate(age = fct_reorder(age, pct_surv)) %>%
ggplot(aes(pct_surv, age, fill = age, color = age)) +
geom_col(position = position_dodge()) +
scale_x_continuous(labels = percent) +
labs(x = "% in category that survived", fill = NULL, color = NULL, y = NULL) +
facet_wrap(~gender)dataset %>%
ggplot(aes(fareadjusted, group = survived, color = survived, fill = survived)) +
geom_histogram(alpha = .4, position = position_dodge()) +
labs(x = "fare", y = NULL, color = "survived?", fill = "survived?")dataset %>%
ggplot(aes(familyage, group = survived, color = survived, fill = survived)) +
geom_histogram(alpha = .4, position = position_dodge()) +
labs(x = "family aged", y = NULL, color = "survived?", fill = "survived?")Split our dataset in two, one dataset for training and the other one for testing. We will use an additionnal splitting step for cross-validation.
set.seed(2021)
spl <- initial_split(dataset, strata = "survived")
train <- training(spl)
test <- testing(spl)
train_5fold <- train %>%
vfold_cv(5)Let’s start with gradient boosting methods which are very popular in the ML community.
Set up defaults.
mset <- metric_set(accuracy) # metric is accuracy
control <- control_grid(save_workflow = TRUE,
save_pred = TRUE,
extract = extract_model) # grid for tuningFirst a recipe.
xg_rec <- recipe(survived ~ ., data = train) %>%
step_impute_median(all_numeric()) %>% # replace missing value by median
step_dummy(all_nominal_predictors()) # all factors var are split into binary terms (factor disj coding)Then specify a gradient boosting model.
xg_model <- boost_tree(mode = "classification", # binary response
trees = tune(),
mtry = tune(),
tree_depth = tune(),
learn_rate = tune(),
loss_reduction = tune(),
min_n = tune()) # parameters to be tunedNow set our workflow.
xg_wf <-
workflow() %>%
add_model(xg_model) %>%
add_recipe(xg_rec)Use cross-validation to evaluate our model with different param config.
xg_tune <- xg_wf %>%
tune_grid(train_5fold,
metrics = mset,
control = control,
grid = crossing(trees = 1000,
mtry = c(3, 5, 8), # finalize(mtry(), train)
tree_depth = c(5, 10, 15),
learn_rate = c(0.01, 0.005),
loss_reduction = c(0.01, 0.1, 1),
min_n = c(2, 10, 25)))Visualize the results.
autoplot(xg_tune) + theme_light()Collect metrics.
xg_tune %>%
collect_metrics() %>%
arrange(desc(mean))The tuning takes some time. There are other ways to explore the parameter space more efficiently. For example, we will use the function dials::grid_max_entropy() in the last section about ensemble modelling. Here, I will use finetune::tune_race_anova.
library(finetune)
xg_tune <-
xg_wf %>%
tune_race_anova(
train_5fold,
grid = 50,
param_info = xg_model %>% parameters(),
metrics = metric_set(accuracy),
control = control_race(verbose_elim = TRUE))Visualize the results.
autoplot(xg_tune)Collect metrics.
xg_tune %>%
collect_metrics() %>%
arrange(desc(mean))Use best config to fit model to training data.
xg_fit <- xg_wf %>%
finalize_workflow(select_best(xg_tune)) %>%
fit(train)## [20:15:43] WARNING: amalgamation/../src/learner.cc:1095: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.
Check out accuracy on testing dataset to see if we overfitted.
xg_fit %>%
augment(test, type.predict = "response") %>%
accuracy(survived, .pred_class)Check out important features (aka predictors).
importances <- xgboost::xgb.importance(model = extract_fit_engine(xg_fit))
importances %>%
mutate(Feature = fct_reorder(Feature, Gain)) %>%
ggplot(aes(Gain, Feature)) +
geom_col()Now we’re ready to predict survival for the holdout dataset and submit to Kaggle. Note that I use the whole dataset, not just the training dataset.
xg_wf %>%
finalize_workflow(select_best(xg_tune)) %>%
fit(dataset) %>%
augment(holdout) %>%
select(PassengerId, Survived = .pred_class) %>%
mutate(Survived = if_else(Survived == "yes", 1, 0)) %>%
write_csv("output/titanic/xgboost.csv")## [20:15:45] WARNING: amalgamation/../src/learner.cc:1095: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.
I got and accuracy of 0.74162. Cool. Let’s train a random forest model now.
Let’s continue with random forest methods.
First a recipe.
rf_rec <- recipe(survived ~ ., data = train) %>%
step_impute_median(all_numeric()) %>% # replace missing value by median
step_dummy(all_nominal_predictors()) # all factors var are split into binary terms (factor disj coding)Then specify a random forest model.
rf_model <- rand_forest(mode = "classification", # binary response
engine = "ranger", # by default
mtry = tune(),
trees = tune(),
min_n = tune()) # parameters to be tunedNow set our workflow.
rf_wf <-
workflow() %>%
add_model(rf_model) %>%
add_recipe(rf_rec)Use cross-validation to evaluate our model with different param config.
rf_tune <-
rf_wf %>%
tune_race_anova(
train_5fold,
grid = 50,
param_info = rf_model %>% parameters(),
metrics = metric_set(accuracy),
control = control_race(verbose_elim = TRUE))Visualize the results.
autoplot(rf_tune)Collect metrics.
rf_tune %>%
collect_metrics() %>%
arrange(desc(mean))Use best config to fit model to training data.
rf_fit <- rf_wf %>%
finalize_workflow(select_best(rf_tune)) %>%
fit(train)Check out accuracy on testing dataset to see if we overfitted.
rf_fit %>%
augment(test, type.predict = "response") %>%
accuracy(survived, .pred_class)Check out important features (aka predictors).
library(vip)
finalize_model(
x = rf_model,
parameters = select_best(rf_tune)) %>%
set_engine("ranger", importance = "permutation") %>%
fit(survived ~ ., data = juice(prep(rf_rec))) %>%
vip(geom = "point")Now we’re ready to predict survival for the holdout dataset and submit to Kaggle.
rf_wf %>%
finalize_workflow(select_best(rf_tune)) %>%
fit(dataset) %>%
augment(holdout) %>%
select(PassengerId, Survived = .pred_class) %>%
mutate(Survived = if_else(Survived == "yes", 1, 0)) %>%
write_csv("output/titanic/randomforest.csv")I got and accuracy of 0.77990, a bit better than gradient boosting.
Let’s continue with cat boosting methods.
Set up defaults.
mset <- metric_set(accuracy) # metric is accuracy
control <- control_grid(save_workflow = TRUE,
save_pred = TRUE,
extract = extract_model) # grid for tuningFirst a recipe.
cb_rec <- recipe(survived ~ ., data = train) %>%
step_impute_median(all_numeric()) %>% # replace missing value by median
step_dummy(all_nominal_predictors()) # all factors var are split into binary terms (factor disj coding)Then specify a cat boosting model.
library(treesnip)
cb_model <- boost_tree(mode = "classification",
engine = "catboost",
mtry = tune(),
trees = tune(),
min_n = tune(),
tree_depth = tune(),
learn_rate = tune()) # parameters to be tunedNow set our workflow.
cb_wf <-
workflow() %>%
add_model(cb_model) %>%
add_recipe(cb_rec)Use cross-validation to evaluate our model with different param config.
cb_tune <- cb_wf %>%
tune_race_anova(
train_5fold,
grid = 30,
param_info = cb_model %>% parameters(),
metrics = metric_set(accuracy),
control = control_race(verbose_elim = TRUE))Visualize the results.
autoplot(cb_tune)Collect metrics.
cb_tune %>%
collect_metrics() %>%
arrange(desc(mean))Use best config to fit model to training data.
cb_fit <- cb_wf %>%
finalize_workflow(select_best(cb_tune)) %>%
fit(train)Check out accuracy on testing dataset to see if we overfitted.
cb_fit %>%
augment(test, type.predict = "response") %>%
accuracy(survived, .pred_class)Now we’re ready to predict survival for the holdout dataset and submit to Kaggle.
cb_wf %>%
finalize_workflow(select_best(cb_tune)) %>%
fit(dataset) %>%
augment(holdout) %>%
select(PassengerId, Survived = .pred_class) %>%
mutate(Survived = if_else(Survived == "yes", 1, 0)) %>%
write_csv("output/titanic/catboost.csv")I got and accuracy of 0.76076. Cool.
Let’s continue with elastic net regularization.
First a recipe.
en_rec <- recipe(survived ~ ., data = train) %>%
step_impute_median(all_numeric()) %>% # replace missing value by median
step_normalize(all_numeric_predictors()) %>% # normalize
step_dummy(all_nominal_predictors()) # all factors var are split into binary terms (factor disj coding)Then specify a regularization model. We tune parameter mixture, with ridge regression for mixture = 0, and lasso for mixture = 1.
en_model <- logistic_reg(penalty = tune(),
mixture = tune()) %>% # param to be tuned
set_engine("glmnet") %>% # elastic net
set_mode("classification") # binary responseNow set our workflow.
en_wf <-
workflow() %>%
add_model(en_model) %>%
add_recipe(en_rec)Use cross-validation to evaluate our model with different param config.
en_tune <- en_wf %>%
tune_grid(train_5fold,
metrics = mset,
control = control,
grid = crossing(penalty = 10 ^ seq(-8, -.5, .5),
mixture = seq(0, 1, length.out = 10)))Visualize the results.
autoplot(en_tune)Collect metrics.
en_tune %>%
collect_metrics() %>%
arrange(desc(mean))Use best config to fit model to training data.
en_fit <- en_wf %>%
finalize_workflow(select_best(en_tune)) %>%
fit(train)Check out accuracy on testing dataset to see if we overfitted.
en_fit %>%
augment(test, type.predict = "response") %>%
accuracy(survived, .pred_class)Check out important features (aka predictors).
library(broom)
en_fit$fit$fit$fit %>%
tidy() %>%
filter(lambda >= select_best(en_tune)$penalty) %>%
filter(lambda == min(lambda),
term != "(Intercept)") %>%
mutate(term = fct_reorder(term, estimate)) %>%
ggplot(aes(estimate, term, fill = estimate > 0)) +
geom_col() +
theme(legend.position = "none")Now we’re ready to predict survival for the holdout dataset and submit to Kaggle.
en_wf %>%
finalize_workflow(select_best(en_tune)) %>%
fit(dataset) %>%
augment(holdout) %>%
select(PassengerId, Survived = .pred_class) %>%
mutate(Survived = if_else(Survived == "yes", 1, 0)) %>%
write_csv("output/titanic/regularization.csv")I got and accuracy of 0.76315.
And what about a good old-fashioned logistic regression (not a ML algo)?
First a recipe.
logistic_rec <- recipe(survived ~ ., data = train) %>%
step_impute_median(all_numeric()) %>% # replace missing value by median
step_normalize(all_numeric_predictors()) %>% # normalize
step_dummy(all_nominal_predictors()) # all factors var are split into binary terms (factor disj coding)Then specify a logistic regression.
logistic_model <- logistic_reg() %>% # no param to be tuned
set_engine("glm") %>% # elastic net
set_mode("classification") # binary responseNow set our workflow.
logistic_wf <-
workflow() %>%
add_model(logistic_model) %>%
add_recipe(logistic_rec)Fit model.
logistic_fit <- logistic_wf %>%
fit(train)Inspect significant features (aka predictors).
tidy(logistic_fit, exponentiate = TRUE) %>%
filter(p.value < 0.05)Same thing, but graphically.
library(broom)
logistic_fit %>%
tidy() %>%
mutate(term = fct_reorder(term, estimate)) %>%
ggplot(aes(estimate, term, fill = estimate > 0)) +
geom_col() +
theme(legend.position = "none")Check out accuracy on testing dataset to see if we overfitted.
logistic_fit %>%
augment(test, type.predict = "response") %>%
accuracy(survived, .pred_class)Confusion matrix.
logistic_fit %>%
augment(test, type.predict = "response") %>%
conf_mat(survived, .pred_class)## Truth
## Prediction yes no
## yes 59 13
## no 27 125
ROC curve.
logistic_fit %>%
augment(test, type.predict = "response") %>%
roc_curve(truth = survived, estimate = .pred_yes) %>%
autoplot()Now we’re ready to predict survival for the holdout dataset and submit to Kaggle.
logistic_wf %>%
fit(dataset) %>%
augment(holdout) %>%
select(PassengerId, Survived = .pred_class) %>%
mutate(Survived = if_else(Survived == "yes", 1, 0)) %>%
write_csv("output/titanic/logistic.csv")I got and accuracy of 0.76076. Oldies but goodies!
We go on with neural networks.
Set up defaults.
mset <- metric_set(accuracy) # metric is accuracy
control <- control_grid(save_workflow = TRUE,
save_pred = TRUE,
extract = extract_model) # grid for tuningFirst a recipe.
nn_rec <- recipe(survived ~ ., data = train) %>%
step_impute_median(all_numeric()) %>% # replace missing value by median
step_normalize(all_numeric_predictors()) %>%
step_dummy(all_nominal_predictors()) # all factors var are split into binary terms (factor disj coding)Then specify a neural network.
nn_model <- mlp(epochs = tune(),
hidden_units = tune(),
dropout = tune()) %>% # param to be tuned
set_mode("classification") %>% # binary response var
set_engine("keras", verbose = 0)Now set our workflow.
nn_wf <-
workflow() %>%
add_model(nn_model) %>%
add_recipe(nn_rec)Use cross-validation to evaluate our model with different param config.
nn_tune <- nn_wf %>%
tune_race_anova(
train_5fold,
grid = 30,
param_info = nn_model %>% parameters(),
metrics = metric_set(accuracy),
control = control_race(verbose_elim = TRUE))Visualize the results.
autoplot(nn_tune)Collect metrics.
nn_tune %>%
collect_metrics() %>%
arrange(desc(mean))Use best config to fit model to training data.
nn_fit <- nn_wf %>%
finalize_workflow(select_best(nn_tune)) %>%
fit(train)Check out accuracy on testing dataset to see if we overfitted.
nn_fit %>%
augment(test, type.predict = "response") %>%
accuracy(survived, .pred_class)Now we’re ready to predict survival for the holdout dataset and submit to Kaggle.
nn_wf %>%
finalize_workflow(select_best(nn_tune)) %>%
fit(train) %>%
augment(holdout) %>%
select(PassengerId, Survived = .pred_class) %>%
mutate(Survived = if_else(Survived == "yes", 1, 0)) %>%
write_csv("output/titanic/nn.csv")I got and accuracy of 0.78708. My best score so far.
We go on with support vector machines.
Set up defaults.
mset <- metric_set(accuracy) # metric is accuracy
control <- control_grid(save_workflow = TRUE,
save_pred = TRUE,
extract = extract_model) # grid for tuningFirst a recipe.
svm_rec <- recipe(survived ~ ., data = train) %>%
step_impute_median(all_numeric()) %>% # replace missing value by median
# remove any zero variance predictors
step_zv(all_predictors()) %>%
# remove any linear combinations
step_lincomb(all_numeric()) %>%
step_dummy(all_nominal_predictors()) # all factors var are split into binary terms (factor disj coding)Then specify a svm.
svm_model <- svm_rbf(cost = tune(),
rbf_sigma = tune()) %>% # param to be tuned
set_mode("classification") %>% # binary response var
set_engine("kernlab")Now set our workflow.
svm_wf <-
workflow() %>%
add_model(svm_model) %>%
add_recipe(svm_rec)Use cross-validation to evaluate our model with different param config.
svm_tune <- svm_wf %>%
tune_race_anova(
train_5fold,
grid = 30,
param_info = svm_model %>% parameters(),
metrics = metric_set(accuracy),
control = control_race(verbose_elim = TRUE))Visualize the results.
autoplot(svm_tune)Collect metrics.
svm_tune %>%
collect_metrics() %>%
arrange(desc(mean))Use best config to fit model to training data.
svm_fit <- svm_wf %>%
finalize_workflow(select_best(svm_tune)) %>%
fit(train)Check out accuracy on testing dataset to see if we overfitted.
svm_fit %>%
augment(test, type.predict = "response") %>%
accuracy(survived, .pred_class)Now we’re ready to predict survival for the holdout dataset and submit to Kaggle.
svm_wf %>%
finalize_workflow(select_best(svm_tune)) %>%
fit(dataset) %>%
augment(holdout) %>%
select(PassengerId, Survived = .pred_class) %>%
mutate(Survived = if_else(Survived == "yes", 1, 0)) %>%
write_csv("output/titanic/svm.csv")I got and accuracy of 0.77511.
We go on with decision trees.
Set up defaults.
mset <- metric_set(accuracy) # metric is accuracy
control <- control_grid(save_workflow = TRUE,
save_pred = TRUE,
extract = extract_model) # grid for tuningFirst a recipe.
dt_rec <- recipe(survived ~ ., data = train) %>%
step_impute_median(all_numeric()) %>% # replace missing value by median
step_zv(all_predictors()) %>%
step_dummy(all_nominal_predictors()) # all factors var are split into binary terms (factor disj coding)Then specify a decision tree model.
library(baguette)
dt_model <- bag_tree(cost_complexity = tune(),
tree_depth = tune(),
min_n = tune()) %>% # param to be tuned
set_engine("rpart", times = 25) %>% # nb bootstraps
set_mode("classification") # binary response varNow set our workflow.
dt_wf <-
workflow() %>%
add_model(dt_model) %>%
add_recipe(dt_rec)Use cross-validation to evaluate our model with different param config.
dt_tune <- dt_wf %>%
tune_race_anova(
train_5fold,
grid = 30,
param_info = dt_model %>% parameters(),
metrics = metric_set(accuracy),
control = control_race(verbose_elim = TRUE))Visualize the results.
autoplot(dt_tune)