Introduction to the Tidyverse

class: center, middle, inverse, title-slide

# Introduction to the Tidyverse
## How to be a tidy data scientist
### Olivier Gimenez
### Novembre 2020

---

# **Tidyverse**

- **Ordocosme** in 🇫🇷 with _Tidy_ for "bien rangé" and _verse_ for "univers"

- A collection of R 📦 developed by H. Wickham and others at Rstudio

---

# **Tidyverse**

* "A framework for managing data that aims at making the cleaning and preparing steps [muuuuuuuch] easier" (Julien Barnier).

* Main characteristics of a tidy dataset:
    - each variable is a column
    - each observation is a raw
    - each value is in a different cell

---

# **Tidyverse** is a collection of R 📦

* `ggplot2` - visualising stuff

* `dplyr`, `tidyr` - data manipulation

* `purrr` - advanced programming

* `readr` - import data

* `tibble` - improved data.frame format

* `forcats` - working w/ factors

* `stringr` - working w/ chain of characters

---

# **Tidyverse** is a collection of R 📦

* [`ggplot2` - visualising stuff](https://ggplot2.tidyverse.org/)

* [`dplyr`, `tidyr` - data manipulation](https://dplyr.tidyverse.org/)

* `purrr` - advanced programming

* [`readr` - import data](https://readr.tidyverse.org/)

* [`tibble` - improved data.frame format](https://tibble.tidyverse.org/)

* [`forcats` - working w/ factors](https://forcats.tidyverse.org/)

* [`stringr` - working w/ chain of characters](https://stringr.tidyverse.org/)

---
class: middle

# Workflow in data science

---
class: middle

# Workflow in data science, with **Tidyverse**

---
background-image: url(https://github.com/rstudio/hex-stickers/raw/master/SVG/tidyverse.svg?sanitize=true)
background-size: 100px
background-position: 90% 3%

# Load [tidyverse](www.tidyverse.org) 📦

```r
#install.packages("tidyverse")
library(tidyverse)
```

---
class: middle

## Case study:
# [Using Twitter to predict citation rates of ecological research](https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0166570)

---
class: inverse, center, middle

# Import

---

# Import data

**readr::read_csv** function:

* ~~keeps input types as is (no conversion to factor)~~ (since `R` 4.0.0)

* creates `tibbles` instead of `data.frame`
     - no names to rows
     - allows column names with special characters (see next slide)
     - more clever on screen display than w/ data.frames (see next slide)
     - [no partial matching on column names](https://stackoverflow.com/questions/58513997/how-to-make-r-stop-accepting-partial-matches-for-column-names)
     - warning if attempt to access unexisting column

* is daaaaaamn fast 🏎

---

# Import data

```r
citations_raw <- read_csv('https://raw.githubusercontent.com/oliviergimenez/intro_tidyverse/master/journal.pone.0166570.s001.CSV')
citations_raw
```

```
## # A tibble: 1,599 x 12
## `Journal identi… `5-year journal… `Year published` Volume Issue Authors
## <chr> <dbl> <dbl> <dbl> <chr> <chr> 
## 1 Ecology Letters 16.7 2014 17 12 Morin …
## 2 Ecology Letters 16.7 2014 17 12 Jucker…
## 3 Ecology Letters 16.7 2014 17 12 Calcag…
## 4 Ecology Letters 16.7 2014 17 11 Segre …
## 5 Ecology Letters 16.7 2014 17 11 Kaufma…
## 6 Ecology Letters 16.7 2014 17 10 Nasto …
## 7 Ecology Letters 16.7 2014 17 10 Tschir…
## 8 Ecology Letters 16.7 2014 17 9 Barnec…
## 9 Ecology Letters 16.7 2014 17 9 Pinto-…
## 10 Ecology Letters 16.7 2014 17 9 Clough…
## # … with 1,589 more rows, and 6 more variables: `Collection date` <chr>,
## # `Publication date` <chr>, `Number of tweets` <dbl>, `Number of
## # users` <dbl>, `Twitter reach` <dbl>, `Number of Web of Science
## # citations` <dbl>
```

---
class: inverse, center, middle

# Tidy, transform

---

# Rename columns

```r
citations_temp <- rename(citations_raw,
 journal = 'Journal identity',
 impactfactor = '5-year journal impact factor',
 pubyear = 'Year published',
 colldate = 'Collection date',
 pubdate = 'Publication date',
 nbtweets = 'Number of tweets',
 woscitations = 'Number of Web of Science citations')
citations_temp
```

```
## # A tibble: 1,599 x 12
## journal impactfactor pubyear Volume Issue Authors colldate pubdate nbtweets
## <chr> <dbl> <dbl> <dbl> <chr> <chr> <chr> <chr> <dbl>
## 1 Ecolog… 16.7 2014 17 12 Morin … 2/1/2016 9/16/2… 18
## 2 Ecolog… 16.7 2014 17 12 Jucker… 2/1/2016 10/13/… 15
## 3 Ecolog… 16.7 2014 17 12 Calcag… 2/1/2016 10/21/… 5
## 4 Ecolog… 16.7 2014 17 11 Segre … 2/1/2016 8/28/2… 9
## 5 Ecolog… 16.7 2014 17 11 Kaufma… 2/1/2016 8/28/2… 3
## 6 Ecolog… 16.7 2014 17 10 Nasto … 2/2/2016 7/28/2… 27
## 7 Ecolog… 16.7 2014 17 10 Tschir… 2/2/2016 8/6/20… 6
## 8 Ecolog… 16.7 2014 17 9 Barnec… 2/2/2016 6/17/2… 19
## 9 Ecolog… 16.7 2014 17 9 Pinto-… 2/2/2016 6/12/2… 26
## 10 Ecolog… 16.7 2014 17 9 Clough… 2/2/2016 7/17/2… 44
## # … with 1,589 more rows, and 3 more variables: `Number of users` <dbl>,
## # `Twitter reach` <dbl>, woscitations <dbl>
```

---

# Create (or modify) columns

```r
citations <- mutate(citations_temp, journal = as.factor(journal))
citations
```

```
## # A tibble: 1,599 x 12
## journal impactfactor pubyear Volume Issue Authors colldate pubdate nbtweets
## <fct> <dbl> <dbl> <dbl> <chr> <chr> <chr> <chr> <dbl>
## 1 Ecolog… 16.7 2014 17 12 Morin … 2/1/2016 9/16/2… 18
## 2 Ecolog… 16.7 2014 17 12 Jucker… 2/1/2016 10/13/… 15
## 3 Ecolog… 16.7 2014 17 12 Calcag… 2/1/2016 10/21/… 5
## 4 Ecolog… 16.7 2014 17 11 Segre … 2/1/2016 8/28/2… 9
## 5 Ecolog… 16.7 2014 17 11 Kaufma… 2/1/2016 8/28/2… 3
## 6 Ecolog… 16.7 2014 17 10 Nasto … 2/2/2016 7/28/2… 27
## 7 Ecolog… 16.7 2014 17 10 Tschir… 2/2/2016 8/6/20… 6
## 8 Ecolog… 16.7 2014 17 9 Barnec… 2/2/2016 6/17/2… 19
## 9 Ecolog… 16.7 2014 17 9 Pinto-… 2/2/2016 6/12/2… 26
## 10 Ecolog… 16.7 2014 17 9 Clough… 2/2/2016 7/17/2… 44
## # … with 1,589 more rows, and 3 more variables: `Number of users` <dbl>,
## # `Twitter reach` <dbl>, woscitations <dbl>
```

---

# Create (or modify) columns

```r
levels(citations$journal)
```

```
##  [1] "Animal Conservation"              "Conservation Letters"            
##  [3] "Diversity and Distributions"      "Ecological Applications"         
##  [5] "Ecology"                          "Ecology Letters"                 
##  [7] "Evolution"                        "Evolutionary Applications"       
##  [9] "Fish and Fisheries"               "Functional Ecology"              
## [11] "Global Change Biology"            "Global Ecology and Biogeography" 
## [13] "Journal of Animal Ecology"        "Journal of Applied Ecology"      
## [15] "Journal of Biogeography"          "Limnology and Oceanography"      
## [17] "Mammal Review"                    "Methods in Ecology and Evolution"
## [19] "Molecular Ecology Resources"      "New Phytologist"
```

---
class: inverse, center, middle

# Give your code some air

---

# Cleaner code with "pipe" operator `%>%`

```r
citations_raw %>%
  rename(journal = 'Journal identity',
       impactfactor = '5-year journal impact factor',
       pubyear = 'Year published',
       colldate = 'Collection date',
       pubdate = 'Publication date',
       nbtweets = 'Number of tweets',
       woscitations = 'Number of Web of Science citations') %>%
  mutate(journal = as.factor(journal))
```

---

# Name object

```r
*citations <- citations_raw %>%
 rename(journal = 'Journal identity',
 impactfactor = '5-year journal impact factor',
 pubyear = 'Year published',
 colldate = 'Collection date',
 pubdate = 'Publication date',
 nbtweets = 'Number of tweets',
 woscitations = 'Number of Web of Science citations') %>%
 mutate(journal = as.factor(journal))
```

---

# Syntax with pipe

* Verb(Subject,Complement) replaced by Subject %>% Verb(Complement)

* No need to name unimportant intermediate variables

* Clear syntax (readability)

---

# Base R from [Lise Vaudor's blog](http://perso.ens-lyon.fr/lise.vaudor/)

```r
white_and_yolk <- crack(egg, add_seasoning)
omelette_batter <- beat(white_and_yolk)
omelette_with_chives <- cook(omelette_batter,add_chives)
```

---

# Piping from [Lise Vaudor's blog](http://perso.ens-lyon.fr/lise.vaudor/)

```r
egg %>%
  crack(add_seasoning) %>%
  beat() %>%
  cook(add_chives) -> omelette_with_chives
```

---
class: inverse, center, middle

# Tidy, transform

---

# Select columns

```r
citations %>%
  select(journal, impactfactor, nbtweets)
```

```
## # A tibble: 1,599 x 3
## journal impactfactor nbtweets
## <fct> <dbl> <dbl>
## 1 Ecology Letters 16.7 18
## 2 Ecology Letters 16.7 15
## 3 Ecology Letters 16.7 5
## 4 Ecology Letters 16.7 9
## 5 Ecology Letters 16.7 3
## 6 Ecology Letters 16.7 27
## 7 Ecology Letters 16.7 6
## 8 Ecology Letters 16.7 19
## 9 Ecology Letters 16.7 26
## 10 Ecology Letters 16.7 44
## # … with 1,589 more rows
```

---

# Drop columns

```r
citations %>%
  select(-Volume, -Issue, -Authors)
```

```
## # A tibble: 1,599 x 9
## journal impactfactor pubyear colldate pubdate nbtweets `Number of user…
## <fct> <dbl> <dbl> <chr> <chr> <dbl> <dbl>
## 1 Ecolog… 16.7 2014 2/1/2016 9/16/2… 18 16
## 2 Ecolog… 16.7 2014 2/1/2016 10/13/… 15 12
## 3 Ecolog… 16.7 2014 2/1/2016 10/21/… 5 4
## 4 Ecolog… 16.7 2014 2/1/2016 8/28/2… 9 8
## 5 Ecolog… 16.7 2014 2/1/2016 8/28/2… 3 3
## 6 Ecolog… 16.7 2014 2/2/2016 7/28/2… 27 23
## 7 Ecolog… 16.7 2014 2/2/2016 8/6/20… 6 6
## 8 Ecolog… 16.7 2014 2/2/2016 6/17/2… 19 18
## 9 Ecolog… 16.7 2014 2/2/2016 6/12/2… 26 23
## 10 Ecolog… 16.7 2014 2/2/2016 7/17/2… 44 42
## # … with 1,589 more rows, and 2 more variables: `Twitter reach` <dbl>,
## # woscitations <dbl>
```

---

# Split a column in several columns

```r
citations %>%
  separate(pubdate,c('month','day','year'),'/')
```

```
## # A tibble: 1,599 x 14
## journal impactfactor pubyear Volume Issue Authors colldate month day year 
## <fct> <dbl> <dbl> <dbl> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 Ecolog… 16.7 2014 17 12 Morin … 2/1/2016 9 16 2014 
## 2 Ecolog… 16.7 2014 17 12 Jucker… 2/1/2016 10 13 2014 
## 3 Ecolog… 16.7 2014 17 12 Calcag… 2/1/2016 10 21 2014 
## 4 Ecolog… 16.7 2014 17 11 Segre … 2/1/2016 8 28 2014 
## 5 Ecolog… 16.7 2014 17 11 Kaufma… 2/1/2016 8 28 2014 
## 6 Ecolog… 16.7 2014 17 10 Nasto … 2/2/2016 7 28 2014 
## 7 Ecolog… 16.7 2014 17 10 Tschir… 2/2/2016 8 6 2014 
## 8 Ecolog… 16.7 2014 17 9 Barnec… 2/2/2016 6 17 2014 
## 9 Ecolog… 16.7 2014 17 9 Pinto-… 2/2/2016 6 12 2014 
## 10 Ecolog… 16.7 2014 17 9 Clough… 2/2/2016 7 17 2014 
## # … with 1,589 more rows, and 4 more variables: nbtweets <dbl>, `Number of
## # users` <dbl>, `Twitter reach` <dbl>, woscitations <dbl>
```

---

# Transform in Date format...

```r
library(lubridate)
citations %>%
  mutate(pubdate = mdy(pubdate),
         colldate = mdy(colldate))
```

```
## # A tibble: 1,599 x 12
## journal impactfactor pubyear Volume Issue Authors colldate pubdate 
## <fct> <dbl> <dbl> <dbl> <chr> <chr> <date> <date> 
## 1 Ecolog… 16.7 2014 17 12 Morin … 2016-02-01 2014-09-16
## 2 Ecolog… 16.7 2014 17 12 Jucker… 2016-02-01 2014-10-13
## 3 Ecolog… 16.7 2014 17 12 Calcag… 2016-02-01 2014-10-21
## 4 Ecolog… 16.7 2014 17 11 Segre … 2016-02-01 2014-08-28
## 5 Ecolog… 16.7 2014 17 11 Kaufma… 2016-02-01 2014-08-28
## 6 Ecolog… 16.7 2014 17 10 Nasto … 2016-02-02 2014-07-28
## 7 Ecolog… 16.7 2014 17 10 Tschir… 2016-02-02 2014-08-06
## 8 Ecolog… 16.7 2014 17 9 Barnec… 2016-02-02 2014-06-17
## 9 Ecolog… 16.7 2014 17 9 Pinto-… 2016-02-02 2014-06-12
## 10 Ecolog… 16.7 2014 17 9 Clough… 2016-02-02 2014-07-17
## # … with 1,589 more rows, and 4 more variables: nbtweets <dbl>, `Number of
## # users` <dbl>, `Twitter reach` <dbl>, woscitations <dbl>
```

---

# ...for easy manipulation of dates

```r
library(lubridate)
citations %>%
  mutate(pubdate = mdy(pubdate),
         colldate = mdy(colldate),
*        pubyear2 = year(pubdate))
```

```
## # A tibble: 1,599 x 13
## journal impactfactor pubyear Volume Issue Authors colldate pubdate 
## <fct> <dbl> <dbl> <dbl> <chr> <chr> <date> <date> 
## 1 Ecolog… 16.7 2014 17 12 Morin … 2016-02-01 2014-09-16
## 2 Ecolog… 16.7 2014 17 12 Jucker… 2016-02-01 2014-10-13
## 3 Ecolog… 16.7 2014 17 12 Calcag… 2016-02-01 2014-10-21
## 4 Ecolog… 16.7 2014 17 11 Segre … 2016-02-01 2014-08-28
## 5 Ecolog… 16.7 2014 17 11 Kaufma… 2016-02-01 2014-08-28
## 6 Ecolog… 16.7 2014 17 10 Nasto … 2016-02-02 2014-07-28
## 7 Ecolog… 16.7 2014 17 10 Tschir… 2016-02-02 2014-08-06
## 8 Ecolog… 16.7 2014 17 9 Barnec… 2016-02-02 2014-06-17
## 9 Ecolog… 16.7 2014 17 9 Pinto-… 2016-02-02 2014-06-12
## 10 Ecolog… 16.7 2014 17 9 Clough… 2016-02-02 2014-07-17
## # … with 1,589 more rows, and 5 more variables: nbtweets <dbl>, `Number of
## # users` <dbl>, `Twitter reach` <dbl>, woscitations <dbl>, pubyear2 <dbl>
```

* Check out `?lubridate::lubridate` for more functions

---

# How to join tables together?

<blockquote class="twitter-tweet" data-lang="fr">More <a href="https://twitter.com/hashtag/dplyr?src=hash&amp;ref_src=twsrc%5Etfw">#dplyr</a> 🔧 gifs! It took me a hella long time to wrap my head around the different types of joins when I first started learning them, so here&#39;s a few examples with some excellent mini datasets from <a href="https://twitter.com/hashtag/dplyr?src=hash&amp;ref_src=twsrc%5Etfw">#dplyr</a> designed specifically for this purpose! <a href="https://twitter.com/hashtag/rstats?src=hash&amp;ref_src=twsrc%5Etfw">#rstats</a> <a href="https://twitter.com/hashtag/tidyverse?src=hash&amp;ref_src=twsrc%5Etfw">#tidyverse</a> <a href="https://t.co/G56fWmIZSq">pic.twitter.com/G56fWmIZSq</a>&mdash; Nic Crane (@nic_crane) <a href="https://twitter.com/nic_crane/status/1064237554910806016?ref_src=twsrc%5Etfw">18 novembre 2018</a></blockquote>
<script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script>

[![Watch the video](assets/mp4/dplyr_join.mp4)](assets/mp4/dplyr_join.mp4)

---

## <https://www.garrickadenbuie.com/project/tidyexplain/>

---
class: inverse, center, middle

# Easy character manipulation

---

# Select rows corresponding to papers with more than 3 authors

```r
citations %>%
* filter(str_detect(Authors,'et al'))
```

```
## # A tibble: 1,280 x 12
## journal impactfactor pubyear Volume Issue Authors colldate pubdate nbtweets
## <fct> <dbl> <dbl> <dbl> <chr> <chr> <chr> <chr> <dbl>
## 1 Ecolog… 16.7 2014 17 12 Morin … 2/1/2016 9/16/2… 18
## 2 Ecolog… 16.7 2014 17 12 Jucker… 2/1/2016 10/13/… 15
## 3 Ecolog… 16.7 2014 17 12 Calcag… 2/1/2016 10/21/… 5
## 4 Ecolog… 16.7 2014 17 11 Segre … 2/1/2016 8/28/2… 9
## 5 Ecolog… 16.7 2014 17 11 Kaufma… 2/1/2016 8/28/2… 3
## 6 Ecolog… 16.7 2014 17 10 Nasto … 2/2/2016 7/28/2… 27
## 7 Ecolog… 16.7 2014 17 10 Tschir… 2/2/2016 8/6/20… 6
## 8 Ecolog… 16.7 2014 17 9 Barnec… 2/2/2016 6/17/2… 19
## 9 Ecolog… 16.7 2014 17 9 Pinto-… 2/2/2016 6/12/2… 26
## 10 Ecolog… 16.7 2014 17 9 Clough… 2/2/2016 7/17/2… 44
## # … with 1,270 more rows, and 3 more variables: `Number of users` <dbl>,
## # `Twitter reach` <dbl>, woscitations <dbl>
```

---

# Get column with rows corresponding to papers with more than 3 authors

```r
citations %>%
* filter(str_detect(Authors,'et al')) %>%
* select(Authors)
```

```
## # A tibble: 1,280 x 1
## Authors 
## <chr> 
## 1 Morin et al 
## 2 Jucker et al 
## 3 Calcagno et al 
## 4 Segre et al 
## 5 Kaufman et al 
## 6 Nasto et al 
## 7 Tschirren et al 
## 8 Barnechi et al 
## 9 Pinto-Sanchez et al
## 10 Clough et al 
## # … with 1,270 more rows
```

---

# Select rows corresponding to papers with less than 3 authors

```r
citations %>%
* filter(!str_detect(Authors,'et al'))
```

```
## # A tibble: 319 x 12
## journal impactfactor pubyear Volume Issue Authors colldate pubdate nbtweets
## <fct> <dbl> <dbl> <dbl> <chr> <chr> <chr> <chr> <dbl>
## 1 Ecolog… 16.7 2014 17 6 Neutle… 2/15/20… 3/17/2… 8
## 2 Ecolog… 16.7 2014 17 5 Kellne… 2/15/20… 2/20/2… 18
## 3 Ecolog… 16.7 2014 17 4 Griffi… 2/15/20… 1/16/2… 4
## 4 Ecolog… 16.7 2014 17 3 Gremer… 2/15/20… 1/17/2… 4
## 5 Ecolog… 16.7 2014 17 2 Cavier… 2/15/20… 10/17/… 16
## 6 Ecolog… 16.7 2014 17 2 Haegma… 2/15/20… 12/5/2… 9
## 7 Ecolog… 16.7 2013 16 12 Kearney 2/15/20… 10/1/2… 13
## 8 Ecolog… 16.7 2013 16 9 Locey … 2/15/20… 7/15/2… 28
## 9 Ecolog… 16.7 2013 16 8 Quinte… 2/15/20… 6/26/2… 120
## 10 Ecolog… 16.7 2013 16 3 Lesser… 2/15/20… 12/22/… 9
## # … with 309 more rows, and 3 more variables: `Number of users` <dbl>, `Twitter
## # reach` <dbl>, woscitations <dbl>
```

---

# Get column with rows corresponding to papers with less than 3 authors

```r
citations %>%
* filter(!str_detect(Authors,'et al')) %>%
* select(Authors)
```

```
## # A tibble: 319 x 1
## Authors 
## <chr> 
## 1 Neutle and Thorne 
## 2 Kellner and Asner 
## 3 Griffin and Willi 
## 4 Gremer and Venable
## 5 Cavieres 
## 6 Haegman and Loreau
## 7 Kearney 
## 8 Locey and White 
## 9 Quintero and Weins
## 10 Lesser and Jackson
## # … with 309 more rows
```

---

# Get column with rows corresponding to papers with less than 3 authors

```r
citations %>%
  filter(!str_detect(Authors,'et al')) %>%
* pull(Authors) %>%
  head(10)
```

```
##  [1] "Neutle and Thorne"  "Kellner and Asner"  "Griffin and Willi" 
##  [4] "Gremer and Venable" "Cavieres"           "Haegman and Loreau"
##  [7] "Kearney"            "Locey and White"    "Quintero and Weins"
## [10] "Lesser and Jackson"
```

---

# Select rows corresponding to papers with less than 3 authors in journal with IF < 5

```r
citations %>%
* filter(!str_detect(Authors,'et al'), impactfactor < 5)
```

```
## # A tibble: 77 x 12
## journal impactfactor pubyear Volume Issue Authors colldate pubdate nbtweets
## <fct> <dbl> <dbl> <dbl> <chr> <chr> <chr> <chr> <dbl>
## 1 Molecu… 4.9 2014 14 6 Gautier 2/27/20… 5/14/2… 2
## 2 Molecu… 4.9 2014 14 5 Gambel… 2/27/20… 3/7/20… 7
## 3 Molecu… 4.9 2014 14 4 Kekkon… 2/27/20… 3/10/2… 4
## 4 Molecu… 4.9 2014 14 3 Bhatta… 2/27/20… 12/8/2… 0
## 5 Molecu… 4.9 2014 14 1 Christ… 2/28/20… 10/25/… 0
## 6 Molecu… 4.9 2013 13 4 Villar… 2/28/20… 5/2/20… 0
## 7 Molecu… 4.9 2013 13 4 Wang 2/28/20… 4/25/2… 0
## 8 Molecu… 4.9 2012 12 1 Joly 2/28/20… 9/7/20… 3
## 9 Animal… 3.21 2014 17 6 Plavsic 2/9/2016 4/17/2… 9
## 10 Animal… 3.21 2014 17 Supp… Knox a… 2/11/20… 11/13/… 1
## # … with 67 more rows, and 3 more variables: `Number of users` <dbl>, `Twitter
## # reach` <dbl>, woscitations <dbl>
```

---

# Convert words to lowercase

```r
citations %>%
* mutate(authors_lowercase = str_to_lower(Authors)) %>%
  select(authors_lowercase)
```

```
## # A tibble: 1,599 x 1
## authors_lowercase 
## <chr> 
## 1 morin et al 
## 2 jucker et al 
## 3 calcagno et al 
## 4 segre et al 
## 5 kaufman et al 
## 6 nasto et al 
## 7 tschirren et al 
## 8 barnechi et al 
## 9 pinto-sanchez et al
## 10 clough et al 
## # … with 1,589 more rows
```

---

# Remove all spaces in journal names

```r
citations %>%
* mutate(journal = str_remove_all(journal," ")) %>%
  select(journal) %>%
  unique() %>%
  head(5)
```

```
## # A tibble: 5 x 1
## journal 
## <chr> 
## 1 EcologyLetters 
## 2 GlobalChangeBiology 
## 3 GlobalEcologyandBiogeography
## 4 MolecularEcologyResources 
## 5 DiversityandDistributions
```

---

# Explore 📦 stringr and regular expressions

* Check out the [vignette on stringr](https://cran.r-project.org/web/packages/stringr/vignettes/stringr.html) for more examples on character manipulation and pattern matching functions.

* Check out the [vignette on regular expressions](https://stringr.tidyverse.org/articles/regular-expressions.html) which are a concise and flexible tool for describing patterns in strings.

---
class: inverse, center, middle

# Basic exploratory data analysis

---

# Count

```r
citations %>% count(journal, sort = TRUE)
```

```
## # A tibble: 20 x 2
## journal n
## <fct> <int>
## 1 New Phytologist 144
## 2 Ecology 108
## 3 Evolution 108
## 4 Global Change Biology 108
## 5 Global Ecology and Biogeography 108
## 6 Journal of Biogeography 108
## 7 Ecology Letters 106
## 8 Diversity and Distributions 105
## 9 Animal Conservation 102
## 10 Methods in Ecology and Evolution 90
## 11 Evolutionary Applications 74
## 12 Functional Ecology 54
## 13 Journal of Animal Ecology 54
## 14 Journal of Applied Ecology 54
## 15 Limnology and Oceanography 54
## 16 Molecular Ecology Resources 54
## 17 Conservation Letters 53
## 18 Ecological Applications 48
## 19 Fish and Fisheries 36
## 20 Mammal Review 31
```

---

# Count

```r
citations %>%
  count(journal, pubyear) %>%
  head()
```

```
## # A tibble: 6 x 3
## journal pubyear n
## <fct> <dbl> <int>
## 1 Animal Conservation 2012 18
## 2 Animal Conservation 2013 18
## 3 Animal Conservation 2014 66
## 4 Conservation Letters 2012 17
## 5 Conservation Letters 2013 18
## 6 Conservation Letters 2014 18
```

---

# Count sum of tweets per journal

```r
citations %>%
  count(journal, wt = nbtweets, sort = TRUE)
```

```
## # A tibble: 20 x 2
## journal n
## <fct> <dbl>
## 1 Ecology Letters 1538
## 2 Animal Conservation 1268
## 3 Journal of Applied Ecology 1012
## 4 Methods in Ecology and Evolution 699
## 5 Global Change Biology 613
## 6 Conservation Letters 542
## 7 New Phytologist 509
## 8 Global Ecology and Biogeography 379
## 9 Ecology 335
## 10 Evolution 335
## 11 Journal of Animal Ecology 323
## 12 Fish and Fisheries 261
## 13 Evolutionary Applications 238
## 14 Journal of Biogeography 209
## 15 Diversity and Distributions 200
## 16 Mammal Review 166
## 17 Functional Ecology 155
## 18 Molecular Ecology Resources 139
## 19 Ecological Applications 125
## 20 Limnology and Oceanography 0
```

---

# Group by variable to calculate stats

```r
citations %>%
* group_by(journal) %>%
* summarise(avg_tweets = mean(nbtweets)) %>%
  head(10)
```

```
## # A tibble: 10 x 2
## journal avg_tweets
## <fct> <dbl>
## 1 Animal Conservation 12.4 
## 2 Conservation Letters 10.2 
## 3 Diversity and Distributions 1.90
## 4 Ecological Applications 2.60
## 5 Ecology 3.10
## 6 Ecology Letters 14.5 
## 7 Evolution 3.10
## 8 Evolutionary Applications 3.22
## 9 Fish and Fisheries 7.25
## 10 Functional Ecology 2.87
```

---

# Order stuff

```r
citations %>%
  group_by(journal) %>%
  summarise(avg_tweets = mean(nbtweets)) %>%
* arrange(desc(avg_tweets)) %>% # decreasing order (wo desc for increasing)
  head(10)
```

```
## # A tibble: 10 x 2
## journal avg_tweets
## <fct> <dbl>
## 1 Journal of Applied Ecology 18.7 
## 2 Ecology Letters 14.5 
## 3 Animal Conservation 12.4 
## 4 Conservation Letters 10.2 
## 5 Methods in Ecology and Evolution 7.77
## 6 Fish and Fisheries 7.25
## 7 Journal of Animal Ecology 5.98
## 8 Global Change Biology 5.68
## 9 Mammal Review 5.35
## 10 New Phytologist 3.53
```

---

# What if we want to work on several columns?

---

# Compute mean across all numeric columns for each journal

```r
citations %>%
* group_by(journal) %>%
* summarize(across(where(is.numeric), mean)) %>%
  head()
```

```
## `summarise()` ungrouping output (override with `.groups` argument)
```

```
## # A tibble: 6 x 8
## journal impactfactor pubyear Volume nbtweets `Number of user… `Twitter reach`
## <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Animal… 3.21 2013. 16.5 12.4 9.71 28345.
## 2 Conser… 6.4 2013. 6.02 10.2 8.85 23234.
## 3 Divers… 5.4 2013 19 1.90 1.77 2350.
## 4 Ecolog… 5.06 2013 23 2.60 2.5 5727.
## 5 Ecology 6.16 2013 94 3.10 2.87 6176.
## 6 Ecolog… 16.7 2013. 16.0 14.5 14.0 44748.
## # … with 1 more variable: woscitations <dbl>
```

---

## <https://github.com/courtiol/Rguides>

---

# Tidying tibbles

---

## Going from **long** to **wide** format and vice-versa

---
class: inverse, center, middle

# Visualize

---

# Visualization with ggplot2

* The package ggplot2 implements a **g**rammar of **g**raphics

* Operates on data.frames or tibbles, not vectors like base R

* Explicitly differentiates between the data and its representation

---

# The ggplot2 grammar

---

# Scatterplots

```r
*citations %>%
* ggplot() +
  aes(x = nbtweets, y = woscitations) +
  geom_point()
```
* Pass in the data frame as your first argument

---

# Scatterplots

```r
citations %>%
  ggplot() +
* aes(x = nbtweets, y = woscitations) +
  geom_point()
```
* Pass in the data frame as your first argument
* Aesthetics maps the data onto plot characteristics, here x and y axes

---

# Scatterplots

```r
citations %>%
  ggplot() +
  aes(x = nbtweets, y = woscitations) +
* geom_point()
```
* Pass in the data frame as your first argument
* Aesthetics maps the data onto plot characteristics, here x and y axes
* Display the data geometrically as points

---

# Scatterplots

```r
citations %>%
  ggplot() +
  aes(x = nbtweets, y = woscitations) +
  geom_point()
```

---

# Scatterplots, with colors

```r
citations %>%
  ggplot() +
  aes(x = nbtweets, y = woscitations) +
* geom_point(color = "red")
```

---

# Scatterplots, with species-specific colors

```r
citations %>%
  ggplot() +
* aes(x = nbtweets, y = woscitations, color = journal) +
  geom_point()
```

* Placing color inside aesthetic maps it to the data

---

# Pick a few journals

```r
citations_ecology <- citations %>%
 mutate(journal = str_to_lower(journal)) %>% # all journals names lowercase
 filter(journal %in%
 c('journal of animal ecology','journal of applied ecology','ecology')) # filter
citations_ecology
```

```
## # A tibble: 216 x 12
## journal impactfactor pubyear Volume Issue Authors colldate pubdate nbtweets
## <chr> <dbl> <dbl> <dbl> <chr> <chr> <chr> <chr> <dbl>
## 1 ecology 6.16 2014 95 12 Maglia… 3/19/20… 12/1/2… 1
## 2 ecology 6.16 2014 95 12 Soinen 3/19/20… 12/1/2… 6
## 3 ecology 6.16 2014 95 12 Graham… 3/19/20… 12/1/2… 1
## 4 ecology 6.16 2014 95 11 White … 3/19/20… 11/1/2… 9
## 5 ecology 6.16 2014 95 11 Einars… 3/19/20… 11/1/2… 15
## 6 ecology 6.16 2014 95 11 Haav a… 3/19/20… 11/1/2… 2
## 7 ecology 6.16 2014 95 10 Dodds … 3/19/20… 10/1/2… 1
## 8 ecology 6.16 2014 95 10 Brown … 3/19/20… 10/1/2… 1
## 9 ecology 6.16 2014 95 10 Wright… 3/19/20… 10/1/2… 0
## 10 ecology 6.16 2014 95 9 Ramahl… 3/19/20… 9/1/20… 27
## # … with 206 more rows, and 3 more variables: `Number of users` <dbl>, `Twitter
## # reach` <dbl>, woscitations <dbl>
```

---

# Scatterplots, with species-specific shapes

```r
citations_ecology %>%
  ggplot() +
* aes(x = nbtweets, y = woscitations, shape = journal) +
  geom_point(size=2)
```

---

# Scatterplots, lines instead of points

```r
citations_ecology %>%
  ggplot() +
  aes(x = nbtweets, y = woscitations) +
* geom_line() +
  scale_x_log10()
```

---

# Scatterplots, lines with sorting beforehand

```r
citations_ecology %>%
* arrange(woscitations) %>%
  ggplot() +
  aes(x = nbtweets, y = woscitations) +
  geom_line() +
  scale_x_log10()
```

---

# Scatterplots, add points

```r
citations_ecology %>%
  ggplot() +
  aes(x = nbtweets, y = woscitations) +
  geom_line() +
* geom_point() +
  scale_x_log10()
```

---

# Scatterplots, add linear trend

```r
citations_ecology %>%
  ggplot() +
  aes(x = nbtweets, y = woscitations) +
  geom_point() +
* geom_smooth(method = "lm") +
  scale_x_log10()
```

---

# Scatterplots, add smoother

```r
citations_ecology %>%
  ggplot() +
  aes(x = nbtweets, y = woscitations) +
  geom_point() +
* geom_smooth() +
  scale_x_log10()
```

---

# aes or not aes?

* If we are to establish a link between the values of a variable and a graphical feature, ie a mapping, then we need an aes().

* Otherwise, the graphical feature is modified irrespective of the data, then we do not need an aes().

---

# Histograms

```r
citations_ecology %>%
  ggplot() +
  aes(x = nbtweets) +
* geom_histogram()
```

---

# Histograms, with colors

```r
citations_ecology %>%
  ggplot() +
  aes(x = nbtweets) +
* geom_histogram(fill = "orange")
```

---

# Histograms, with colors

```r
citations_ecology %>%
  ggplot() +
  aes(x = nbtweets) +
* geom_histogram(fill = "orange", color = "brown")
```

---

# Histograms, with labels and title

```r
citations_ecology %>%
  ggplot() +
  aes(x = nbtweets) +
  geom_histogram(fill = "orange", color = "brown") +
* labs(x = "Number of tweets",
*      y = "Count",
*      title = "Histogram of the number of tweets")
```

---

# Histograms, by species

```r
citations_ecology %>%
  ggplot() +
  aes(x = nbtweets) +
  geom_histogram(fill = "orange", color = "brown") +
  labs(x = "Number of tweets",
       y = "Count",
       title = "Histogram of the number of tweets") + 
* facet_wrap(vars(journal))
```

---

# Boxplots

```r
citations_ecology %>%
  ggplot() +
  aes(x = "", y = nbtweets) +
* geom_boxplot() +
  scale_y_log10()
```

---

# Boxplots with colors

```r
citations_ecology %>%
  ggplot() +
  aes(x = "", y = nbtweets) +
* geom_boxplot(fill = "green") +
  scale_y_log10()
```

---

# Boxplots with colors by species

```r
citations_ecology %>%
  ggplot() +
* aes(x = journal, y = nbtweets, fill = journal) +
  geom_boxplot() +
  scale_y_log10()
```

---

# Get rid of the ticks on x axis

```r
citations_ecology %>%
  ggplot() +
  aes(x = journal, y = nbtweets, fill = journal) +
  geom_boxplot() +
  scale_y_log10() + 
* theme(axis.text.x = element_blank()) +
* labs(x = "")
```

---

# Boxplots, user-specified colors by species

```r
citations_ecology %>%
  ggplot() +
  aes(x = journal, y = nbtweets, fill = journal) +
  geom_boxplot() +
  scale_y_log10() +
* scale_fill_manual(
*   values = c("red", "blue", "purple")) +
  theme(axis.text.x = element_blank()) +
  labs(x = "")
```

---

# Boxplots, change legend settings

```r
citations_ecology %>%
  ggplot() +
  aes(x = journal, y = nbtweets, fill = journal) +
  geom_boxplot() +
  scale_y_log10() +
* scale_fill_manual(
    values = c("red", "blue", "purple"),
*   name = "Journal name",
*   labels = c("Ecology", "J Animal Ecology", "J Applied Ecology")) +
  theme(axis.text.x = element_blank()) +
  labs(x = "")
```

---

# Ugly bar plots

```r
citations %>%
  count(journal) %>%
  ggplot() +
  aes(x = journal, y = n) +
* geom_col()
```

---

# Idem, with flipping

```r
citations %>%
  count(journal) %>%
  ggplot() +
* aes(x = n, y = journal) +
  geom_col()
```

---

# Idem, with factors reordering and flipping

```r
citations %>%
  count(journal) %>%
  ggplot() +
* aes(x = n, y = fct_reorder(journal, n)) +
  geom_col()
```

---

# Further cleaning

```r
citations %>%
  count(journal) %>%
  ggplot() +
  aes(x = n, y = fct_reorder(journal, n)) +
  geom_col() + 
  labs(x = "counts", y = "")
```

---

# More about how to (tidy) work with factors

* [Be the boss of your factors](https://stat545.com/block029_factors.html) and 
* [forcats, forcats, vous avez dit forcats ?](https://thinkr.fr/forcats-forcats-vous-avez-dit-forcats/).

---

# Density plots

```r
citations_ecology %>%
  ggplot() +
  aes(x = nbtweets, fill = journal) +
* geom_density() +
  scale_x_log10()
```

---

# Density plots, control transparency

```r
citations_ecology %>%
  ggplot() +
  aes(x = nbtweets, fill = journal) +
* geom_density(alpha = 0.5) +
  scale_x_log10()
```

---

# Change default background `B & W theme`

```r
citations_ecology %>%
  ggplot() +
  aes(x = nbtweets, fill = journal) +
  geom_density(alpha = 0.5) +
  scale_x_log10() +
* theme_bw()
```

---

# Change default background theme `classic theme`

```r
citations_ecology %>%
  ggplot() +
  aes(x = nbtweets, fill = journal) +
  geom_density(alpha = 0.5) +
  scale_x_log10() +
* theme_classic()
```

---

# Change default background theme `dark theme`

```r
citations_ecology %>%
  ggplot() +
  aes(x = nbtweets, fill = journal) +
  geom_density(alpha = 0.5) +
  scale_x_log10() +
* theme_dark()
```

---

# More on data visualisation with ggplot2

* [Portfolio](https://www.r-graph-gallery.com/portfolio/ggplot2-package/) of ggplot2 plots

* [Cédric Scherer's portfolio](https://cedricscherer.netlify.app/top/dataviz/) of data visualisations

* [Top](http://r-statistics.co/Top50-Ggplot2-Visualizations-MasterList-R-Code.html) ggplot2 visualizations

* [Interactive](https://dreamrs.github.io/esquisse/) ggplot2 visualizations

---

background-image: url(https://github.com/rstudio/hex-stickers/raw/master/SVG/tidyverse.svg?sanitize=true)
background-size: 550px
background-position: 50% 50%

---

# To dive even deeper in the tidyverse

* [Learn the tidyverse](https://www.tidyverse.org/learn/): books, workshops and online courses

* My selection of books:
   - [R for Data Science](https://r4ds.had.co.nz/) et [Advanced R](http://adv-r.had.co.nz/)
   - [Introduction à R et au tidyverse](https://juba.github.io/tidyverse/)
   - [Fundamentals of Data visualization](https://clauswilke.com/dataviz/)
   - [Data Visualization: A practical introduction](http://socviz.co/)

* [Tidy Tuesdays videos](https://www.youtube.com/user/safe4democracy/videos) by D. Robinson chief data scientist at DataCamp

* Material of the [2-day workshop Data Science in the tidyverse](https://github.com/cwickham/data-science-in-tidyverse) held at the RStudio 2019 conference

* Material of the stat545 course on [Data wrangling, exploration, and analysis with R](https://stat545.com/) at the University of British Columbia

* List of best R packages (with their description) on [data import, wrangling and visualization](https://www.computerworld.com/article/2921176/business-intelligence/great-r-packages-for-data-import-wrangling-visualization.html)

---

# [How to switch from base R to tidyverse?](https://www.significantdigits.org/2017/10/switching-from-base-r-to-tidyverse/)

---

# The [RStudio Cheat Sheets](https://www.rstudio.com/resources/cheatsheets/)

---
class: title-slide-final, middle
background-size: 55px
background-position: 9% 15%

# Thanks!

### I created these slides with [xaringan](https://github.com/yihui/xaringan) and [RMarkdown](https://rmarkdown.rstudio.com/) using the [rutgers css](https://github.com/jvcasillas/ru_xaringan) that I slightly modified.

### Credit: I used material from [Cécile Sauder](https://github.com/cecilesauder/RLadiesTidyverse), [Stephanie J. Spielman](http://sjspielman.org/bio5312_fall2017/) and [Julien Barnier](https://juba.github.io/tidyverse/).

| | |
| :--------------------------------------------------------------------------------------------------------- | :-------------------------------- |
| | **olivier.gimenez@cefe.cnrs.fr** |
| | [**https://oliviergimenez.github.io/**](https://oliviergimenez.github.io/) |
| | [**@oaggimenez**](https://twitter.com/oaggimenez) |
| | [**@oliviergimenez**](https://github.com/oliviergimenez)