Binary image classification using Keras in R: Using CT scans to predict patients with Covid

Here I illustrate how to train a CNN with Keras in R to predict from patients’ CT scans those who will develop severe illness from Covid.

Motivation

Michael Blum tweeted about the STOIC2021 - COVID-19 AI challenge. The main goal of this challenge is to predict from the patients’ CT scans who will develop severe illness from Covid.

Given my recent interest in machine learning, this challenge peaked my interest. Although Python is the machine learning lingua franca, it is possible to train a convolutional neural network (CNN) in R and perform (binary) image classification.

Here, I will use an R interface to Keras that allows training neural networks. Note that the dataset shared for the challenge is big, like 280Go big, and it took me a day to download it. For the sake of illustration, I will use a similar but much lighter dataset from a Kaggle repository https://www.kaggle.com/plameneduardo/sarscov2-ctscan-dataset.

The code is available on GitHub as usual https://github.com/oliviergimenez/bin-image-classif.

First things first, load the packages we will need.

library(tidyverse)
theme_set(theme_light())
library(keras)

Read in and process data

We will need a function to process images, I’m stealing that one written by Spencer Palladino.

process_pix <- function(lsf) {
  img <- lapply(lsf, image_load, grayscale = TRUE) # grayscale the image
  arr <- lapply(img, image_to_array) # turns it into an array
  arr_resized <- lapply(arr, image_array_resize, 
                        height = 100, 
                        width = 100) # resize
  arr_normalized <- normalize(arr_resized, axis = 1) #normalize to make small numbers 
  return(arr_normalized)
}

Now let’s process images for patients with Covid, and do some reshaping. Idem with images for patients without Covid.

# with covid
lsf <- list.files("dat/COVID/", full.names = TRUE) 
covid <- process_pix(lsf)
covid <- covid[,,,1] # get rid of last dim
covid_reshaped <- array_reshape(covid, c(nrow(covid), 100*100))
# without covid
lsf <- list.files("dat/non-COVID/", full.names = TRUE) 
ncovid <- process_pix(lsf)
ncovid <- ncovid[,,,1] # get rid of last dim
ncovid_reshaped <- array_reshape(ncovid, c(nrow(ncovid), 100*100))

We have 1252 CT scans of patients with Covid, and 1229 without.

Let’s visualise these scans. Let’s pick a patient with Covid, and another one without.

scancovid <- reshape2::melt(covid[10,,])
plotcovid <- scancovid %>%
  ggplot() +
  aes(x = Var1, y = Var2, fill = value) + 
  geom_raster() +
  labs(x = NULL, y = NULL, title = "CT scan of a patient with covid") + 
  scale_fill_viridis_c() + 
  theme(legend.position = "none")

scanncovid <- reshape2::melt(ncovid[10,,])
plotncovid <- scanncovid %>%
  ggplot() +
  aes(x = Var1, y = Var2, fill = value) + 
  geom_raster() +
  labs(x = NULL, y = NULL, title = "CT scan of a patient without covid") + 
  scale_fill_viridis_c() + 
  theme(legend.position = "none")

library(patchwork)
plotcovid + plotncovid

Put altogether and shuffle.

df <- rbind(cbind(covid_reshaped, 1), # 1 = covid
            cbind(ncovid_reshaped, 0)) # 0 = no covid
set.seed(1234)
shuffle <- sample(nrow(df), replace = F)
df <- df[shuffle, ]

Sounds great. We have everything we need to start training a convolutional neural network model.

Convolutional neural network (CNN)

Let’s build our training and testing datasets using a 80/20 split.

set.seed(2022)
split <- sample(2, nrow(df), replace = T, prob = c(0.8, 0.2))
train <- df[split == 1,]
test <- df[split == 2,]
train_target <- df[split == 1, 10001] # label in training dataset
test_target <- df[split == 2, 10001] # label in testing dataset

Now build our model. I use three layers (layer_dense() function) that I put one after the other with piping. I also use regularization (layer_dropout() function) to avoid overfitting.

model <- keras_model_sequential() %>%
  layer_dense(units = 512, activation = "relu") %>% 
  layer_dropout(0.4) %>%
  layer_dense(units = 256, activation = "relu") %>%
  layer_dropout(0.3) %>%
  layer_dense(units = 128, activation = "relu") %>%
  layer_dropout(0.2) %>%
  layer_dense(units = 2, activation = 'softmax')

Compile the model with defaults specific to binary classification.

model %>%
  compile(optimizer = 'adam',
          loss = 'binary_crossentropy', 
          metrics = c('accuracy'))

We use one-hot encoding (to_categorical() function) aka dummy coding in statistics.

train_label <- to_categorical(train_target)
test_label <- to_categorical(test_target)

Now let’s fit our model to the training dataset.

fit_covid <- model %>%
  fit(x = train,
      y = train_label, 
      epochs = 25,
      batch_size = 512, # try also 128 and 256
      verbose = 2,
      validation_split = 0.2)

A quick visualization of the performances shows that the algorithm is doing not too bad. No over/under-fitting. Accuracy and loss are fine.

plot(fit_covid)

What about the performances on the testing dataset?

model %>%
  evaluate(test, test_label)
##       loss   accuracy 
## 0.02048795 0.99795920

Let’s do some predictions on the testing dataset, and compare with ground truth.

predictedclasses <- model %>%
  predict_classes(test)
table(Prediction = predictedclasses, 
      Actual = test_target)
##           Actual
## Prediction   0   1
##          0 243   0
##          1   1 246

Pretty cool. Only one healthy patient is misclassified as being sick. Let’s save our model for further use.

save_model_tf(model, "model/covidmodel") # save the model

I’m happy with these results. In general however, we need to find ways to improve the performances. Check out some tips here with examples implemented in Keras with R there.

Statistics, ecology & social sciences

I’m a scientist working at the interface of animal ecology, statistical modeling and social sciences.

comments powered by Disqus

Related