class: center, middle, inverse, title-slide # Introduction to the Tidyverse ## How to be a tidy data scientist ### Olivier Gimenez ### Novembre 2020 --- # **Tidyverse** - **Ordocosme** in 🇫🇷 with _Tidy_ for "bien rangĂ©" and _verse_ for "univers" - A collection of R 📦 developed by H. Wickham and others at Rstudio <img src="assets/img/wickham_president.jpg" width="50%" style="display: block; margin: auto;" /> --- # **Tidyverse** * "A framework for managing data that aims at making the cleaning and preparing steps [muuuuuuuch] easier" (Julien Barnier). * Main characteristics of a tidy dataset: - each variable is a column - each observation is a raw - each value is in a different cell <img src="assets/img/tidydata.png" width="80%" style="display: block; margin: auto;" /> --- # **Tidyverse** is a collection of R 📦 * `ggplot2` - visualising stuff * `dplyr`, `tidyr` - data manipulation * `purrr` - advanced programming * `readr` - import data * `tibble` - improved data.frame format * `forcats` - working w/ factors * `stringr` - working w/ chain of characters --- # **Tidyverse** is a collection of R 📦 * [`ggplot2` - visualising stuff](https://ggplot2.tidyverse.org/) * [`dplyr`, `tidyr` - data manipulation](https://dplyr.tidyverse.org/) * `purrr` - advanced programming * [`readr` - import data](https://readr.tidyverse.org/) * [`tibble` - improved data.frame format](https://tibble.tidyverse.org/) * [`forcats` - working w/ factors](https://forcats.tidyverse.org/) * [`stringr` - working w/ chain of characters](https://stringr.tidyverse.org/) --- class: middle # Workflow in data science <img src="assets/img/data-science-workflow.png" width="100%" style="display: block; margin: auto;" /> --- class: middle # Workflow in data science, with **Tidyverse** <img src="assets/img/01_tidyverse_data_science.png" width="90%" style="display: block; margin: auto;" /> --- background-image: url(https://github.com/rstudio/hex-stickers/raw/master/SVG/tidyverse.svg?sanitize=true) background-size: 100px background-position: 90% 3% # Load [tidyverse](www.tidyverse.org) 📦 ```r #install.packages("tidyverse") library(tidyverse) ``` --- class: middle ## Case study: # [Using Twitter to predict citation rates of ecological research](https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0166570) <img src="assets/img/paper_workflow.png" width="85%" style="display: block; margin: auto;" /> --- class: inverse, center, middle # Import --- # Import data **readr::read_csv** function: * ~~keeps input types as is (no conversion to factor)~~ (since `R` 4.0.0) * creates `tibbles` instead of `data.frame` - no names to rows - allows column names with special characters (see next slide) - more clever on screen display than w/ data.frames (see next slide) - [no partial matching on column names](https://stackoverflow.com/questions/58513997/how-to-make-r-stop-accepting-partial-matches-for-column-names) - warning if attempt to access unexisting column * is daaaaaamn fast 🏎 --- # Import data ```r citations_raw <- read_csv('https://raw.githubusercontent.com/oliviergimenez/intro_tidyverse/master/journal.pone.0166570.s001.CSV') citations_raw ``` ``` ## # A tibble: 1,599 x 12 ## `Journal identi… `5-year journal… `Year published` Volume Issue Authors ## <chr> <dbl> <dbl> <dbl> <chr> <chr> ## 1 Ecology Letters 16.7 2014 17 12 Morin … ## 2 Ecology Letters 16.7 2014 17 12 Jucker… ## 3 Ecology Letters 16.7 2014 17 12 Calcag… ## 4 Ecology Letters 16.7 2014 17 11 Segre … ## 5 Ecology Letters 16.7 2014 17 11 Kaufma… ## 6 Ecology Letters 16.7 2014 17 10 Nasto … ## 7 Ecology Letters 16.7 2014 17 10 Tschir… ## 8 Ecology Letters 16.7 2014 17 9 Barnec… ## 9 Ecology Letters 16.7 2014 17 9 Pinto-… ## 10 Ecology Letters 16.7 2014 17 9 Clough… ## # … with 1,589 more rows, and 6 more variables: `Collection date` <chr>, ## # `Publication date` <chr>, `Number of tweets` <dbl>, `Number of ## # users` <dbl>, `Twitter reach` <dbl>, `Number of Web of Science ## # citations` <dbl> ``` --- class: inverse, center, middle # Tidy, transform --- # Rename columns ```r citations_temp <- rename(citations_raw, journal = 'Journal identity', impactfactor = '5-year journal impact factor', pubyear = 'Year published', colldate = 'Collection date', pubdate = 'Publication date', nbtweets = 'Number of tweets', woscitations = 'Number of Web of Science citations') citations_temp ``` ``` ## # A tibble: 1,599 x 12 ## journal impactfactor pubyear Volume Issue Authors colldate pubdate nbtweets ## <chr> <dbl> <dbl> <dbl> <chr> <chr> <chr> <chr> <dbl> ## 1 Ecolog… 16.7 2014 17 12 Morin … 2/1/2016 9/16/2… 18 ## 2 Ecolog… 16.7 2014 17 12 Jucker… 2/1/2016 10/13/… 15 ## 3 Ecolog… 16.7 2014 17 12 Calcag… 2/1/2016 10/21/… 5 ## 4 Ecolog… 16.7 2014 17 11 Segre … 2/1/2016 8/28/2… 9 ## 5 Ecolog… 16.7 2014 17 11 Kaufma… 2/1/2016 8/28/2… 3 ## 6 Ecolog… 16.7 2014 17 10 Nasto … 2/2/2016 7/28/2… 27 ## 7 Ecolog… 16.7 2014 17 10 Tschir… 2/2/2016 8/6/20… 6 ## 8 Ecolog… 16.7 2014 17 9 Barnec… 2/2/2016 6/17/2… 19 ## 9 Ecolog… 16.7 2014 17 9 Pinto-… 2/2/2016 6/12/2… 26 ## 10 Ecolog… 16.7 2014 17 9 Clough… 2/2/2016 7/17/2… 44 ## # … with 1,589 more rows, and 3 more variables: `Number of users` <dbl>, ## # `Twitter reach` <dbl>, woscitations <dbl> ``` --- # Create (or modify) columns ```r citations <- mutate(citations_temp, journal = as.factor(journal)) citations ``` ``` ## # A tibble: 1,599 x 12 ## journal impactfactor pubyear Volume Issue Authors colldate pubdate nbtweets ## <fct> <dbl> <dbl> <dbl> <chr> <chr> <chr> <chr> <dbl> ## 1 Ecolog… 16.7 2014 17 12 Morin … 2/1/2016 9/16/2… 18 ## 2 Ecolog… 16.7 2014 17 12 Jucker… 2/1/2016 10/13/… 15 ## 3 Ecolog… 16.7 2014 17 12 Calcag… 2/1/2016 10/21/… 5 ## 4 Ecolog… 16.7 2014 17 11 Segre … 2/1/2016 8/28/2… 9 ## 5 Ecolog… 16.7 2014 17 11 Kaufma… 2/1/2016 8/28/2… 3 ## 6 Ecolog… 16.7 2014 17 10 Nasto … 2/2/2016 7/28/2… 27 ## 7 Ecolog… 16.7 2014 17 10 Tschir… 2/2/2016 8/6/20… 6 ## 8 Ecolog… 16.7 2014 17 9 Barnec… 2/2/2016 6/17/2… 19 ## 9 Ecolog… 16.7 2014 17 9 Pinto-… 2/2/2016 6/12/2… 26 ## 10 Ecolog… 16.7 2014 17 9 Clough… 2/2/2016 7/17/2… 44 ## # … with 1,589 more rows, and 3 more variables: `Number of users` <dbl>, ## # `Twitter reach` <dbl>, woscitations <dbl> ``` --- # Create (or modify) columns ```r levels(citations$journal) ``` ``` ## [1] "Animal Conservation" "Conservation Letters" ## [3] "Diversity and Distributions" "Ecological Applications" ## [5] "Ecology" "Ecology Letters" ## [7] "Evolution" "Evolutionary Applications" ## [9] "Fish and Fisheries" "Functional Ecology" ## [11] "Global Change Biology" "Global Ecology and Biogeography" ## [13] "Journal of Animal Ecology" "Journal of Applied Ecology" ## [15] "Journal of Biogeography" "Limnology and Oceanography" ## [17] "Mammal Review" "Methods in Ecology and Evolution" ## [19] "Molecular Ecology Resources" "New Phytologist" ``` --- class: inverse, center, middle # Give your code some air --- # Cleaner code with "pipe" operator `%>%` ```r citations_raw %>% rename(journal = 'Journal identity', impactfactor = '5-year journal impact factor', pubyear = 'Year published', colldate = 'Collection date', pubdate = 'Publication date', nbtweets = 'Number of tweets', woscitations = 'Number of Web of Science citations') %>% mutate(journal = as.factor(journal)) ``` ``` ## # A tibble: 1,599 x 12 ## journal impactfactor pubyear Volume Issue Authors colldate pubdate nbtweets ## <fct> <dbl> <dbl> <dbl> <chr> <chr> <chr> <chr> <dbl> ## 1 Ecolog… 16.7 2014 17 12 Morin … 2/1/2016 9/16/2… 18 ## 2 Ecolog… 16.7 2014 17 12 Jucker… 2/1/2016 10/13/… 15 ## 3 Ecolog… 16.7 2014 17 12 Calcag… 2/1/2016 10/21/… 5 ## 4 Ecolog… 16.7 2014 17 11 Segre … 2/1/2016 8/28/2… 9 ## 5 Ecolog… 16.7 2014 17 11 Kaufma… 2/1/2016 8/28/2… 3 ## 6 Ecolog… 16.7 2014 17 10 Nasto … 2/2/2016 7/28/2… 27 ## 7 Ecolog… 16.7 2014 17 10 Tschir… 2/2/2016 8/6/20… 6 ## 8 Ecolog… 16.7 2014 17 9 Barnec… 2/2/2016 6/17/2… 19 ## 9 Ecolog… 16.7 2014 17 9 Pinto-… 2/2/2016 6/12/2… 26 ## 10 Ecolog… 16.7 2014 17 9 Clough… 2/2/2016 7/17/2… 44 ## # … with 1,589 more rows, and 3 more variables: `Number of users` <dbl>, ## # `Twitter reach` <dbl>, woscitations <dbl> ``` --- # Name object ```r *citations <- citations_raw %>% rename(journal = 'Journal identity', impactfactor = '5-year journal impact factor', pubyear = 'Year published', colldate = 'Collection date', pubdate = 'Publication date', nbtweets = 'Number of tweets', woscitations = 'Number of Web of Science citations') %>% mutate(journal = as.factor(journal)) ``` --- # Syntax with pipe * Verb(Subject,Complement) replaced by Subject %>% Verb(Complement) * No need to name unimportant intermediate variables * Clear syntax (readability) <img src="assets/img/logo_pipe.png" width="40%" style="display: block; margin: auto;" /> --- # Base R from [Lise Vaudor's blog](http://perso.ens-lyon.fr/lise.vaudor/) ```r white_and_yolk <- crack(egg, add_seasoning) omelette_batter <- beat(white_and_yolk) omelette_with_chives <- cook(omelette_batter,add_chives) ``` <img src="assets/img/piping_successive.jpg" width="500px" style="display: block; margin: auto;" /> --- # Piping from [Lise Vaudor's blog](http://perso.ens-lyon.fr/lise.vaudor/) ```r egg %>% crack(add_seasoning) %>% beat() %>% cook(add_chives) -> omelette_with_chives ``` <img src="assets/img/piping_piped.png" width="250px" style="display: block; margin: auto;" /> --- class: inverse, center, middle # Tidy, transform --- # Select columns ```r citations %>% select(journal, impactfactor, nbtweets) ``` ``` ## # A tibble: 1,599 x 3 ## journal impactfactor nbtweets ## <fct> <dbl> <dbl> ## 1 Ecology Letters 16.7 18 ## 2 Ecology Letters 16.7 15 ## 3 Ecology Letters 16.7 5 ## 4 Ecology Letters 16.7 9 ## 5 Ecology Letters 16.7 3 ## 6 Ecology Letters 16.7 27 ## 7 Ecology Letters 16.7 6 ## 8 Ecology Letters 16.7 19 ## 9 Ecology Letters 16.7 26 ## 10 Ecology Letters 16.7 44 ## # … with 1,589 more rows ``` --- # Drop columns ```r citations %>% select(-Volume, -Issue, -Authors) ``` ``` ## # A tibble: 1,599 x 9 ## journal impactfactor pubyear colldate pubdate nbtweets `Number of user… ## <fct> <dbl> <dbl> <chr> <chr> <dbl> <dbl> ## 1 Ecolog… 16.7 2014 2/1/2016 9/16/2… 18 16 ## 2 Ecolog… 16.7 2014 2/1/2016 10/13/… 15 12 ## 3 Ecolog… 16.7 2014 2/1/2016 10/21/… 5 4 ## 4 Ecolog… 16.7 2014 2/1/2016 8/28/2… 9 8 ## 5 Ecolog… 16.7 2014 2/1/2016 8/28/2… 3 3 ## 6 Ecolog… 16.7 2014 2/2/2016 7/28/2… 27 23 ## 7 Ecolog… 16.7 2014 2/2/2016 8/6/20… 6 6 ## 8 Ecolog… 16.7 2014 2/2/2016 6/17/2… 19 18 ## 9 Ecolog… 16.7 2014 2/2/2016 6/12/2… 26 23 ## 10 Ecolog… 16.7 2014 2/2/2016 7/17/2… 44 42 ## # … with 1,589 more rows, and 2 more variables: `Twitter reach` <dbl>, ## # woscitations <dbl> ``` --- # Split a column in several columns ```r citations %>% separate(pubdate,c('month','day','year'),'/') ``` ``` ## # A tibble: 1,599 x 14 ## journal impactfactor pubyear Volume Issue Authors colldate month day year ## <fct> <dbl> <dbl> <dbl> <chr> <chr> <chr> <chr> <chr> <chr> ## 1 Ecolog… 16.7 2014 17 12 Morin … 2/1/2016 9 16 2014 ## 2 Ecolog… 16.7 2014 17 12 Jucker… 2/1/2016 10 13 2014 ## 3 Ecolog… 16.7 2014 17 12 Calcag… 2/1/2016 10 21 2014 ## 4 Ecolog… 16.7 2014 17 11 Segre … 2/1/2016 8 28 2014 ## 5 Ecolog… 16.7 2014 17 11 Kaufma… 2/1/2016 8 28 2014 ## 6 Ecolog… 16.7 2014 17 10 Nasto … 2/2/2016 7 28 2014 ## 7 Ecolog… 16.7 2014 17 10 Tschir… 2/2/2016 8 6 2014 ## 8 Ecolog… 16.7 2014 17 9 Barnec… 2/2/2016 6 17 2014 ## 9 Ecolog… 16.7 2014 17 9 Pinto-… 2/2/2016 6 12 2014 ## 10 Ecolog… 16.7 2014 17 9 Clough… 2/2/2016 7 17 2014 ## # … with 1,589 more rows, and 4 more variables: nbtweets <dbl>, `Number of ## # users` <dbl>, `Twitter reach` <dbl>, woscitations <dbl> ``` --- # Transform in Date format... ```r library(lubridate) citations %>% mutate(pubdate = mdy(pubdate), colldate = mdy(colldate)) ``` ``` ## # A tibble: 1,599 x 12 ## journal impactfactor pubyear Volume Issue Authors colldate pubdate ## <fct> <dbl> <dbl> <dbl> <chr> <chr> <date> <date> ## 1 Ecolog… 16.7 2014 17 12 Morin … 2016-02-01 2014-09-16 ## 2 Ecolog… 16.7 2014 17 12 Jucker… 2016-02-01 2014-10-13 ## 3 Ecolog… 16.7 2014 17 12 Calcag… 2016-02-01 2014-10-21 ## 4 Ecolog… 16.7 2014 17 11 Segre … 2016-02-01 2014-08-28 ## 5 Ecolog… 16.7 2014 17 11 Kaufma… 2016-02-01 2014-08-28 ## 6 Ecolog… 16.7 2014 17 10 Nasto … 2016-02-02 2014-07-28 ## 7 Ecolog… 16.7 2014 17 10 Tschir… 2016-02-02 2014-08-06 ## 8 Ecolog… 16.7 2014 17 9 Barnec… 2016-02-02 2014-06-17 ## 9 Ecolog… 16.7 2014 17 9 Pinto-… 2016-02-02 2014-06-12 ## 10 Ecolog… 16.7 2014 17 9 Clough… 2016-02-02 2014-07-17 ## # … with 1,589 more rows, and 4 more variables: nbtweets <dbl>, `Number of ## # users` <dbl>, `Twitter reach` <dbl>, woscitations <dbl> ``` --- # ...for easy manipulation of dates ```r library(lubridate) citations %>% mutate(pubdate = mdy(pubdate), colldate = mdy(colldate), * pubyear2 = year(pubdate)) ``` ``` ## # A tibble: 1,599 x 13 ## journal impactfactor pubyear Volume Issue Authors colldate pubdate ## <fct> <dbl> <dbl> <dbl> <chr> <chr> <date> <date> ## 1 Ecolog… 16.7 2014 17 12 Morin … 2016-02-01 2014-09-16 ## 2 Ecolog… 16.7 2014 17 12 Jucker… 2016-02-01 2014-10-13 ## 3 Ecolog… 16.7 2014 17 12 Calcag… 2016-02-01 2014-10-21 ## 4 Ecolog… 16.7 2014 17 11 Segre … 2016-02-01 2014-08-28 ## 5 Ecolog… 16.7 2014 17 11 Kaufma… 2016-02-01 2014-08-28 ## 6 Ecolog… 16.7 2014 17 10 Nasto … 2016-02-02 2014-07-28 ## 7 Ecolog… 16.7 2014 17 10 Tschir… 2016-02-02 2014-08-06 ## 8 Ecolog… 16.7 2014 17 9 Barnec… 2016-02-02 2014-06-17 ## 9 Ecolog… 16.7 2014 17 9 Pinto-… 2016-02-02 2014-06-12 ## 10 Ecolog… 16.7 2014 17 9 Clough… 2016-02-02 2014-07-17 ## # … with 1,589 more rows, and 5 more variables: nbtweets <dbl>, `Number of ## # users` <dbl>, `Twitter reach` <dbl>, woscitations <dbl>, pubyear2 <dbl> ``` * Check out `?lubridate::lubridate` for more functions --- # How to join tables together? <blockquote class="twitter-tweet" data-lang="fr"><p lang="en" dir="ltr">More <a href="https://twitter.com/hashtag/dplyr?src=hash&ref_src=twsrc%5Etfw">#dplyr</a> 🔧 gifs! It took me a hella long time to wrap my head around the different types of joins when I first started learning them, so here's a few examples with some excellent mini datasets from <a href="https://twitter.com/hashtag/dplyr?src=hash&ref_src=twsrc%5Etfw">#dplyr</a> designed specifically for this purpose! <a href="https://twitter.com/hashtag/rstats?src=hash&ref_src=twsrc%5Etfw">#rstats</a> <a href="https://twitter.com/hashtag/tidyverse?src=hash&ref_src=twsrc%5Etfw">#tidyverse</a> <a href="https://t.co/G56fWmIZSq">pic.twitter.com/G56fWmIZSq</a></p>— Nic Crane (@nic_crane) <a href="https://twitter.com/nic_crane/status/1064237554910806016?ref_src=twsrc%5Etfw">18 novembre 2018</a></blockquote> <script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script> [![Watch the video](assets/mp4/dplyr_join.mp4)](assets/mp4/dplyr_join.mp4) --- ## <https://www.garrickadenbuie.com/project/tidyexplain/> <img src="assets/img/left-join.gif" width="70%" style="display: block; margin: auto;" /> --- class: inverse, center, middle # Easy character manipulation --- # Select rows corresponding to papers with more than 3 authors ```r citations %>% * filter(str_detect(Authors,'et al')) ``` ``` ## # A tibble: 1,280 x 12 ## journal impactfactor pubyear Volume Issue Authors colldate pubdate nbtweets ## <fct> <dbl> <dbl> <dbl> <chr> <chr> <chr> <chr> <dbl> ## 1 Ecolog… 16.7 2014 17 12 Morin … 2/1/2016 9/16/2… 18 ## 2 Ecolog… 16.7 2014 17 12 Jucker… 2/1/2016 10/13/… 15 ## 3 Ecolog… 16.7 2014 17 12 Calcag… 2/1/2016 10/21/… 5 ## 4 Ecolog… 16.7 2014 17 11 Segre … 2/1/2016 8/28/2… 9 ## 5 Ecolog… 16.7 2014 17 11 Kaufma… 2/1/2016 8/28/2… 3 ## 6 Ecolog… 16.7 2014 17 10 Nasto … 2/2/2016 7/28/2… 27 ## 7 Ecolog… 16.7 2014 17 10 Tschir… 2/2/2016 8/6/20… 6 ## 8 Ecolog… 16.7 2014 17 9 Barnec… 2/2/2016 6/17/2… 19 ## 9 Ecolog… 16.7 2014 17 9 Pinto-… 2/2/2016 6/12/2… 26 ## 10 Ecolog… 16.7 2014 17 9 Clough… 2/2/2016 7/17/2… 44 ## # … with 1,270 more rows, and 3 more variables: `Number of users` <dbl>, ## # `Twitter reach` <dbl>, woscitations <dbl> ``` --- # Get column with rows corresponding to papers with more than 3 authors ```r citations %>% * filter(str_detect(Authors,'et al')) %>% * select(Authors) ``` ``` ## # A tibble: 1,280 x 1 ## Authors ## <chr> ## 1 Morin et al ## 2 Jucker et al ## 3 Calcagno et al ## 4 Segre et al ## 5 Kaufman et al ## 6 Nasto et al ## 7 Tschirren et al ## 8 Barnechi et al ## 9 Pinto-Sanchez et al ## 10 Clough et al ## # … with 1,270 more rows ``` --- # Select rows corresponding to papers with less than 3 authors ```r citations %>% * filter(!str_detect(Authors,'et al')) ``` ``` ## # A tibble: 319 x 12 ## journal impactfactor pubyear Volume Issue Authors colldate pubdate nbtweets ## <fct> <dbl> <dbl> <dbl> <chr> <chr> <chr> <chr> <dbl> ## 1 Ecolog… 16.7 2014 17 6 Neutle… 2/15/20… 3/17/2… 8 ## 2 Ecolog… 16.7 2014 17 5 Kellne… 2/15/20… 2/20/2… 18 ## 3 Ecolog… 16.7 2014 17 4 Griffi… 2/15/20… 1/16/2… 4 ## 4 Ecolog… 16.7 2014 17 3 Gremer… 2/15/20… 1/17/2… 4 ## 5 Ecolog… 16.7 2014 17 2 Cavier… 2/15/20… 10/17/… 16 ## 6 Ecolog… 16.7 2014 17 2 Haegma… 2/15/20… 12/5/2… 9 ## 7 Ecolog… 16.7 2013 16 12 Kearney 2/15/20… 10/1/2… 13 ## 8 Ecolog… 16.7 2013 16 9 Locey … 2/15/20… 7/15/2… 28 ## 9 Ecolog… 16.7 2013 16 8 Quinte… 2/15/20… 6/26/2… 120 ## 10 Ecolog… 16.7 2013 16 3 Lesser… 2/15/20… 12/22/… 9 ## # … with 309 more rows, and 3 more variables: `Number of users` <dbl>, `Twitter ## # reach` <dbl>, woscitations <dbl> ``` --- # Get column with rows corresponding to papers with less than 3 authors ```r citations %>% * filter(!str_detect(Authors,'et al')) %>% * select(Authors) ``` ``` ## # A tibble: 319 x 1 ## Authors ## <chr> ## 1 Neutle and Thorne ## 2 Kellner and Asner ## 3 Griffin and Willi ## 4 Gremer and Venable ## 5 Cavieres ## 6 Haegman and Loreau ## 7 Kearney ## 8 Locey and White ## 9 Quintero and Weins ## 10 Lesser and Jackson ## # … with 309 more rows ``` --- # Get column with rows corresponding to papers with less than 3 authors ```r citations %>% filter(!str_detect(Authors,'et al')) %>% * pull(Authors) %>% head(10) ``` ``` ## [1] "Neutle and Thorne" "Kellner and Asner" "Griffin and Willi" ## [4] "Gremer and Venable" "Cavieres" "Haegman and Loreau" ## [7] "Kearney" "Locey and White" "Quintero and Weins" ## [10] "Lesser and Jackson" ``` --- # Select rows corresponding to papers with less than 3 authors in journal with IF < 5 ```r citations %>% * filter(!str_detect(Authors,'et al'), impactfactor < 5) ``` ``` ## # A tibble: 77 x 12 ## journal impactfactor pubyear Volume Issue Authors colldate pubdate nbtweets ## <fct> <dbl> <dbl> <dbl> <chr> <chr> <chr> <chr> <dbl> ## 1 Molecu… 4.9 2014 14 6 Gautier 2/27/20… 5/14/2… 2 ## 2 Molecu… 4.9 2014 14 5 Gambel… 2/27/20… 3/7/20… 7 ## 3 Molecu… 4.9 2014 14 4 Kekkon… 2/27/20… 3/10/2… 4 ## 4 Molecu… 4.9 2014 14 3 Bhatta… 2/27/20… 12/8/2… 0 ## 5 Molecu… 4.9 2014 14 1 Christ… 2/28/20… 10/25/… 0 ## 6 Molecu… 4.9 2013 13 4 Villar… 2/28/20… 5/2/20… 0 ## 7 Molecu… 4.9 2013 13 4 Wang 2/28/20… 4/25/2… 0 ## 8 Molecu… 4.9 2012 12 1 Joly 2/28/20… 9/7/20… 3 ## 9 Animal… 3.21 2014 17 6 Plavsic 2/9/2016 4/17/2… 9 ## 10 Animal… 3.21 2014 17 Supp… Knox a… 2/11/20… 11/13/… 1 ## # … with 67 more rows, and 3 more variables: `Number of users` <dbl>, `Twitter ## # reach` <dbl>, woscitations <dbl> ``` --- # Convert words to lowercase ```r citations %>% * mutate(authors_lowercase = str_to_lower(Authors)) %>% select(authors_lowercase) ``` ``` ## # A tibble: 1,599 x 1 ## authors_lowercase ## <chr> ## 1 morin et al ## 2 jucker et al ## 3 calcagno et al ## 4 segre et al ## 5 kaufman et al ## 6 nasto et al ## 7 tschirren et al ## 8 barnechi et al ## 9 pinto-sanchez et al ## 10 clough et al ## # … with 1,589 more rows ``` --- # Remove all spaces in journal names ```r citations %>% * mutate(journal = str_remove_all(journal," ")) %>% select(journal) %>% unique() %>% head(5) ``` ``` ## # A tibble: 5 x 1 ## journal ## <chr> ## 1 EcologyLetters ## 2 GlobalChangeBiology ## 3 GlobalEcologyandBiogeography ## 4 MolecularEcologyResources ## 5 DiversityandDistributions ``` --- # Explore 📦 stringr and regular expressions * Check out the [vignette on stringr](https://cran.r-project.org/web/packages/stringr/vignettes/stringr.html) for more examples on character manipulation and pattern matching functions. * Check out the [vignette on regular expressions](https://stringr.tidyverse.org/articles/regular-expressions.html) which are a concise and flexible tool for describing patterns in strings. --- class: inverse, center, middle # Basic exploratory data analysis --- # Count ```r citations %>% count(journal, sort = TRUE) ``` ``` ## # A tibble: 20 x 2 ## journal n ## <fct> <int> ## 1 New Phytologist 144 ## 2 Ecology 108 ## 3 Evolution 108 ## 4 Global Change Biology 108 ## 5 Global Ecology and Biogeography 108 ## 6 Journal of Biogeography 108 ## 7 Ecology Letters 106 ## 8 Diversity and Distributions 105 ## 9 Animal Conservation 102 ## 10 Methods in Ecology and Evolution 90 ## 11 Evolutionary Applications 74 ## 12 Functional Ecology 54 ## 13 Journal of Animal Ecology 54 ## 14 Journal of Applied Ecology 54 ## 15 Limnology and Oceanography 54 ## 16 Molecular Ecology Resources 54 ## 17 Conservation Letters 53 ## 18 Ecological Applications 48 ## 19 Fish and Fisheries 36 ## 20 Mammal Review 31 ``` --- # Count ```r citations %>% count(journal, pubyear) %>% head() ``` ``` ## # A tibble: 6 x 3 ## journal pubyear n ## <fct> <dbl> <int> ## 1 Animal Conservation 2012 18 ## 2 Animal Conservation 2013 18 ## 3 Animal Conservation 2014 66 ## 4 Conservation Letters 2012 17 ## 5 Conservation Letters 2013 18 ## 6 Conservation Letters 2014 18 ``` --- # Count sum of tweets per journal ```r citations %>% count(journal, wt = nbtweets, sort = TRUE) ``` ``` ## # A tibble: 20 x 2 ## journal n ## <fct> <dbl> ## 1 Ecology Letters 1538 ## 2 Animal Conservation 1268 ## 3 Journal of Applied Ecology 1012 ## 4 Methods in Ecology and Evolution 699 ## 5 Global Change Biology 613 ## 6 Conservation Letters 542 ## 7 New Phytologist 509 ## 8 Global Ecology and Biogeography 379 ## 9 Ecology 335 ## 10 Evolution 335 ## 11 Journal of Animal Ecology 323 ## 12 Fish and Fisheries 261 ## 13 Evolutionary Applications 238 ## 14 Journal of Biogeography 209 ## 15 Diversity and Distributions 200 ## 16 Mammal Review 166 ## 17 Functional Ecology 155 ## 18 Molecular Ecology Resources 139 ## 19 Ecological Applications 125 ## 20 Limnology and Oceanography 0 ``` --- # Group by variable to calculate stats ```r citations %>% * group_by(journal) %>% * summarise(avg_tweets = mean(nbtweets)) %>% head(10) ``` ``` ## # A tibble: 10 x 2 ## journal avg_tweets ## <fct> <dbl> ## 1 Animal Conservation 12.4 ## 2 Conservation Letters 10.2 ## 3 Diversity and Distributions 1.90 ## 4 Ecological Applications 2.60 ## 5 Ecology 3.10 ## 6 Ecology Letters 14.5 ## 7 Evolution 3.10 ## 8 Evolutionary Applications 3.22 ## 9 Fish and Fisheries 7.25 ## 10 Functional Ecology 2.87 ``` --- # Order stuff ```r citations %>% group_by(journal) %>% summarise(avg_tweets = mean(nbtweets)) %>% * arrange(desc(avg_tweets)) %>% # decreasing order (wo desc for increasing) head(10) ``` ``` ## # A tibble: 10 x 2 ## journal avg_tweets ## <fct> <dbl> ## 1 Journal of Applied Ecology 18.7 ## 2 Ecology Letters 14.5 ## 3 Animal Conservation 12.4 ## 4 Conservation Letters 10.2 ## 5 Methods in Ecology and Evolution 7.77 ## 6 Fish and Fisheries 7.25 ## 7 Journal of Animal Ecology 5.98 ## 8 Global Change Biology 5.68 ## 9 Mammal Review 5.35 ## 10 New Phytologist 3.53 ``` --- # What if we want to work on several columns? <img src="assets/img/dplyr_across.png" width="85%" style="display: block; margin: auto;" /> --- # Compute mean across all numeric columns for each journal ```r citations %>% * group_by(journal) %>% * summarize(across(where(is.numeric), mean)) %>% head() ``` ``` ## `summarise()` ungrouping output (override with `.groups` argument) ``` ``` ## # A tibble: 6 x 8 ## journal impactfactor pubyear Volume nbtweets `Number of user… `Twitter reach` ## <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 Animal… 3.21 2013. 16.5 12.4 9.71 28345. ## 2 Conser… 6.4 2013. 6.02 10.2 8.85 23234. ## 3 Divers… 5.4 2013 19 1.90 1.77 2350. ## 4 Ecolog… 5.06 2013 23 2.60 2.5 5727. ## 5 Ecology 6.16 2013 94 3.10 2.87 6176. ## 6 Ecolog… 16.7 2013. 16.0 14.5 14.0 44748. ## # … with 1 more variable: woscitations <dbl> ``` --- ## <https://github.com/courtiol/Rguides> <img src="assets/img/dplyr_guide_for_one_table_part2.png" width="85%" style="display: block; margin: auto;" /> --- # Tidying tibbles <img src="assets/img/original-dfs-tidy.png" width="70%" style="display: block; margin: auto;" /> --- ## Going from **long** to **wide** format and vice-versa <img src="assets/img/tidyr-longer-wider.gif" width="70%" style="display: block; margin: auto;" /> --- class: inverse, center, middle # Visualize --- # Visualization with ggplot2 * The package ggplot2 implements a **g**rammar of **g**raphics * Operates on data.frames or tibbles, not vectors like base R * Explicitly differentiates between the data and its representation <img src="assets/img/ggplot2_logo.jpg" width="30%" style="display: block; margin: auto;" /> --- # The ggplot2 grammar Grammar element | What it is :---------------- | :----------------------------- **Data** | The data frame being plotted **Geometrics** | The geometric shape that will represent the data | (e.g., point, boxplot, histogram) **Aesthetics** | The aesthetics of the geometric object | (e.g., color, size, shape) <img src="assets/img/ggplot2_logo.jpg" width="30%" style="display: block; margin: auto;" /> --- # Scatterplots ```r *citations %>% * ggplot() + aes(x = nbtweets, y = woscitations) + geom_point() ``` * Pass in the data frame as your first argument --- # Scatterplots ```r citations %>% ggplot() + * aes(x = nbtweets, y = woscitations) + geom_point() ``` * Pass in the data frame as your first argument * Aesthetics maps the data onto plot characteristics, here x and y axes --- # Scatterplots ```r citations %>% ggplot() + aes(x = nbtweets, y = woscitations) + * geom_point() ``` * Pass in the data frame as your first argument * Aesthetics maps the data onto plot characteristics, here x and y axes * Display the data geometrically as points --- # Scatterplots ```r citations %>% ggplot() + aes(x = nbtweets, y = woscitations) + geom_point() ``` <img src="assets/chunks/unnamed-chunk-47-1.png" width="400cm" height="400cm" style="display: block; margin: auto;" /> --- # Scatterplots, with colors ```r citations %>% ggplot() + aes(x = nbtweets, y = woscitations) + * geom_point(color = "red") ``` <img src="assets/chunks/unnamed-chunk-48-1.png" width="400cm" height="400cm" style="display: block; margin: auto;" /> --- # Scatterplots, with species-specific colors ```r citations %>% ggplot() + * aes(x = nbtweets, y = woscitations, color = journal) + geom_point() ``` <img src="assets/chunks/unnamed-chunk-49-1.png" width="400cm" height="400cm" style="display: block; margin: auto;" /> * Placing color inside aesthetic maps it to the data --- # Pick a few journals ```r citations_ecology <- citations %>% mutate(journal = str_to_lower(journal)) %>% # all journals names lowercase filter(journal %in% c('journal of animal ecology','journal of applied ecology','ecology')) # filter citations_ecology ``` ``` ## # A tibble: 216 x 12 ## journal impactfactor pubyear Volume Issue Authors colldate pubdate nbtweets ## <chr> <dbl> <dbl> <dbl> <chr> <chr> <chr> <chr> <dbl> ## 1 ecology 6.16 2014 95 12 Maglia… 3/19/20… 12/1/2… 1 ## 2 ecology 6.16 2014 95 12 Soinen 3/19/20… 12/1/2… 6 ## 3 ecology 6.16 2014 95 12 Graham… 3/19/20… 12/1/2… 1 ## 4 ecology 6.16 2014 95 11 White … 3/19/20… 11/1/2… 9 ## 5 ecology 6.16 2014 95 11 Einars… 3/19/20… 11/1/2… 15 ## 6 ecology 6.16 2014 95 11 Haav a… 3/19/20… 11/1/2… 2 ## 7 ecology 6.16 2014 95 10 Dodds … 3/19/20… 10/1/2… 1 ## 8 ecology 6.16 2014 95 10 Brown … 3/19/20… 10/1/2… 1 ## 9 ecology 6.16 2014 95 10 Wright… 3/19/20… 10/1/2… 0 ## 10 ecology 6.16 2014 95 9 Ramahl… 3/19/20… 9/1/20… 27 ## # … with 206 more rows, and 3 more variables: `Number of users` <dbl>, `Twitter ## # reach` <dbl>, woscitations <dbl> ``` --- # Scatterplots, with species-specific shapes ```r citations_ecology %>% ggplot() + * aes(x = nbtweets, y = woscitations, shape = journal) + geom_point(size=2) ``` <img src="assets/chunks/unnamed-chunk-51-1.png" width="400cm" height="400cm" style="display: block; margin: auto;" /> --- # Scatterplots, lines instead of points ```r citations_ecology %>% ggplot() + aes(x = nbtweets, y = woscitations) + * geom_line() + scale_x_log10() ``` <img src="assets/chunks/unnamed-chunk-52-1.png" width="350cm" height="350cm" style="display: block; margin: auto;" /> --- # Scatterplots, lines with sorting beforehand ```r citations_ecology %>% * arrange(woscitations) %>% ggplot() + aes(x = nbtweets, y = woscitations) + geom_line() + scale_x_log10() ``` <img src="assets/chunks/unnamed-chunk-53-1.png" width="350cm" height="350cm" style="display: block; margin: auto;" /> --- # Scatterplots, add points ```r citations_ecology %>% ggplot() + aes(x = nbtweets, y = woscitations) + geom_line() + * geom_point() + scale_x_log10() ``` <img src="assets/chunks/unnamed-chunk-54-1.png" width="350cm" height="350cm" style="display: block; margin: auto;" /> --- # Scatterplots, add linear trend ```r citations_ecology %>% ggplot() + aes(x = nbtweets, y = woscitations) + geom_point() + * geom_smooth(method = "lm") + scale_x_log10() ``` <img src="assets/chunks/unnamed-chunk-55-1.png" width="350cm" height="350cm" style="display: block; margin: auto;" /> --- # Scatterplots, add smoother ```r citations_ecology %>% ggplot() + aes(x = nbtweets, y = woscitations) + geom_point() + * geom_smooth() + scale_x_log10() ``` <img src="assets/chunks/unnamed-chunk-56-1.png" width="350cm" height="350cm" style="display: block; margin: auto;" /> --- # aes or not aes? * If we are to establish a link between the values of a variable and a graphical feature, ie a mapping, then we need an aes(). * Otherwise, the graphical feature is modified irrespective of the data, then we do not need an aes(). <img src="assets/img/ggplot2_logo.jpg" width="30%" style="display: block; margin: auto;" /> --- # Histograms ```r citations_ecology %>% ggplot() + aes(x = nbtweets) + * geom_histogram() ``` <img src="assets/chunks/unnamed-chunk-58-1.png" width="400cm" height="400cm" style="display: block; margin: auto;" /> --- # Histograms, with colors ```r citations_ecology %>% ggplot() + aes(x = nbtweets) + * geom_histogram(fill = "orange") ``` <img src="assets/chunks/unnamed-chunk-59-1.png" width="400cm" height="400cm" style="display: block; margin: auto;" /> --- # Histograms, with colors ```r citations_ecology %>% ggplot() + aes(x = nbtweets) + * geom_histogram(fill = "orange", color = "brown") ``` <img src="assets/chunks/unnamed-chunk-60-1.png" width="400cm" height="400cm" style="display: block; margin: auto;" /> --- # Histograms, with labels and title ```r citations_ecology %>% ggplot() + aes(x = nbtweets) + geom_histogram(fill = "orange", color = "brown") + * labs(x = "Number of tweets", * y = "Count", * title = "Histogram of the number of tweets") ``` <img src="assets/chunks/unnamed-chunk-61-1.png" width="350cm" height="350cm" style="display: block; margin: auto;" /> --- # Histograms, by species ```r citations_ecology %>% ggplot() + aes(x = nbtweets) + geom_histogram(fill = "orange", color = "brown") + labs(x = "Number of tweets", y = "Count", title = "Histogram of the number of tweets") + * facet_wrap(vars(journal)) ``` <img src="assets/chunks/unnamed-chunk-62-1.png" width="300cm" height="300cm" style="display: block; margin: auto;" /> --- # Boxplots ```r citations_ecology %>% ggplot() + aes(x = "", y = nbtweets) + * geom_boxplot() + scale_y_log10() ``` <img src="assets/chunks/unnamed-chunk-63-1.png" width="350cm" height="350cm" style="display: block; margin: auto;" /> --- # Boxplots with colors ```r citations_ecology %>% ggplot() + aes(x = "", y = nbtweets) + * geom_boxplot(fill = "green") + scale_y_log10() ``` <img src="assets/chunks/unnamed-chunk-64-1.png" width="350cm" height="350cm" style="display: block; margin: auto;" /> --- # Boxplots with colors by species ```r citations_ecology %>% ggplot() + * aes(x = journal, y = nbtweets, fill = journal) + geom_boxplot() + scale_y_log10() ``` <img src="assets/chunks/unnamed-chunk-65-1.png" width="300cm" height="300cm" style="display: block; margin: auto;" /> --- # Get rid of the ticks on x axis ```r citations_ecology %>% ggplot() + aes(x = journal, y = nbtweets, fill = journal) + geom_boxplot() + scale_y_log10() + * theme(axis.text.x = element_blank()) + * labs(x = "") ``` <img src="assets/chunks/unnamed-chunk-66-1.png" width="300cm" height="300cm" style="display: block; margin: auto;" /> --- # Boxplots, user-specified colors by species ```r citations_ecology %>% ggplot() + aes(x = journal, y = nbtweets, fill = journal) + geom_boxplot() + scale_y_log10() + * scale_fill_manual( * values = c("red", "blue", "purple")) + theme(axis.text.x = element_blank()) + labs(x = "") ``` <img src="assets/chunks/unnamed-chunk-67-1.png" width="300cm" height="300cm" style="display: block; margin: auto;" /> --- # Boxplots, change legend settings ```r citations_ecology %>% ggplot() + aes(x = journal, y = nbtweets, fill = journal) + geom_boxplot() + scale_y_log10() + * scale_fill_manual( values = c("red", "blue", "purple"), * name = "Journal name", * labels = c("Ecology", "J Animal Ecology", "J Applied Ecology")) + theme(axis.text.x = element_blank()) + labs(x = "") ``` <img src="assets/chunks/unnamed-chunk-68-1.png" width="270cm" height="270cm" style="display: block; margin: auto;" /> --- # Ugly bar plots ```r citations %>% count(journal) %>% ggplot() + aes(x = journal, y = n) + * geom_col() ``` <img src="assets/chunks/unnamed-chunk-69-1.png" width="350cm" height="350cm" style="display: block; margin: auto;" /> --- # Idem, with flipping ```r citations %>% count(journal) %>% ggplot() + * aes(x = n, y = journal) + geom_col() ``` <img src="assets/chunks/unnamed-chunk-70-1.png" width="350cm" height="350cm" style="display: block; margin: auto;" /> --- # Idem, with factors reordering and flipping ```r citations %>% count(journal) %>% ggplot() + * aes(x = n, y = fct_reorder(journal, n)) + geom_col() ``` <img src="assets/chunks/unnamed-chunk-71-1.png" width="350cm" height="350cm" style="display: block; margin: auto;" /> --- # Further cleaning ```r citations %>% count(journal) %>% ggplot() + aes(x = n, y = fct_reorder(journal, n)) + geom_col() + labs(x = "counts", y = "") ``` <img src="assets/chunks/unnamed-chunk-72-1.png" width="350cm" height="350cm" style="display: block; margin: auto;" /> --- # More about how to (tidy) work with factors * [Be the boss of your factors](https://stat545.com/block029_factors.html) and * [forcats, forcats, vous avez dit forcats ?](https://thinkr.fr/forcats-forcats-vous-avez-dit-forcats/). --- # Density plots ```r citations_ecology %>% ggplot() + aes(x = nbtweets, fill = journal) + * geom_density() + scale_x_log10() ``` <img src="assets/chunks/unnamed-chunk-73-1.png" width="350cm" height="350cm" style="display: block; margin: auto;" /> --- # Density plots, control transparency ```r citations_ecology %>% ggplot() + aes(x = nbtweets, fill = journal) + * geom_density(alpha = 0.5) + scale_x_log10() ``` <img src="assets/chunks/unnamed-chunk-74-1.png" width="350cm" height="350cm" style="display: block; margin: auto;" /> --- # Change default background `B & W theme` ```r citations_ecology %>% ggplot() + aes(x = nbtweets, fill = journal) + geom_density(alpha = 0.5) + scale_x_log10() + * theme_bw() ``` <img src="assets/chunks/unnamed-chunk-75-1.png" width="300cm" height="300cm" style="display: block; margin: auto;" /> --- # Change default background theme `classic theme` ```r citations_ecology %>% ggplot() + aes(x = nbtweets, fill = journal) + geom_density(alpha = 0.5) + scale_x_log10() + * theme_classic() ``` <img src="assets/chunks/unnamed-chunk-76-1.png" width="300cm" height="300cm" style="display: block; margin: auto;" /> --- # Change default background theme `dark theme` ```r citations_ecology %>% ggplot() + aes(x = nbtweets, fill = journal) + geom_density(alpha = 0.5) + scale_x_log10() + * theme_dark() ``` <img src="assets/chunks/unnamed-chunk-77-1.png" width="300cm" height="300cm" style="display: block; margin: auto;" /> --- # More on data visualisation with ggplot2 * [Portfolio](https://www.r-graph-gallery.com/portfolio/ggplot2-package/) of ggplot2 plots * [CĂ©dric Scherer's portfolio](https://cedricscherer.netlify.app/top/dataviz/) of data visualisations * [Top](http://r-statistics.co/Top50-Ggplot2-Visualizations-MasterList-R-Code.html) ggplot2 visualizations * [Interactive](https://dreamrs.github.io/esquisse/) ggplot2 visualizations <img src="assets/img/ggplot2_logo.jpg" width="30%" style="display: block; margin: auto;" /> --- background-image: url(https://github.com/rstudio/hex-stickers/raw/master/SVG/tidyverse.svg?sanitize=true) background-size: 550px background-position: 50% 50% --- # To dive even deeper in the tidyverse * [Learn the tidyverse](https://www.tidyverse.org/learn/): books, workshops and online courses * My selection of books: - [R for Data Science](https://r4ds.had.co.nz/) et [Advanced R](http://adv-r.had.co.nz/) - [Introduction Ă R et au tidyverse](https://juba.github.io/tidyverse/) - [Fundamentals of Data visualization](https://clauswilke.com/dataviz/) - [Data Visualization: A practical introduction](http://socviz.co/) * [Tidy Tuesdays videos](https://www.youtube.com/user/safe4democracy/videos) by D. Robinson chief data scientist at DataCamp * Material of the [2-day workshop Data Science in the tidyverse](https://github.com/cwickham/data-science-in-tidyverse) held at the RStudio 2019 conference * Material of the stat545 course on [Data wrangling, exploration, and analysis with R](https://stat545.com/) at the University of British Columbia * List of best R packages (with their description) on [data import, wrangling and visualization](https://www.computerworld.com/article/2921176/business-intelligence/great-r-packages-for-data-import-wrangling-visualization.html) --- # [How to switch from base R to tidyverse?](https://www.significantdigits.org/2017/10/switching-from-base-r-to-tidyverse/) <img src="assets/img/switch_baseR_tidyverse.png" width="800px" style="display: block; margin: auto;" /> --- # The [RStudio Cheat Sheets](https://www.rstudio.com/resources/cheatsheets/) <img src="assets/img/cheatsheet_dplyr.png" width="600px" style="display: block; margin: auto;" /> --- class: title-slide-final, middle background-size: 55px background-position: 9% 15% # Thanks! ### I created these slides with [xaringan](https://github.com/yihui/xaringan) and [RMarkdown](https://rmarkdown.rstudio.com/) using the [rutgers css](https://github.com/jvcasillas/ru_xaringan) that I slightly modified. ### Credit: I used material from [CĂ©cile Sauder](https://github.com/cecilesauder/RLadiesTidyverse), [Stephanie J. Spielman](http://sjspielman.org/bio5312_fall2017/) and [Julien Barnier](https://juba.github.io/tidyverse/). | | | | :--------------------------------------------------------------------------------------------------------- | :-------------------------------- | |
| **olivier.gimenez@cefe.cnrs.fr** | |
| [**https://oliviergimenez.github.io/**](https://oliviergimenez.github.io/) | |
| [**@oaggimenez**](https://twitter.com/oaggimenez) | |
| [**@oliviergimenez**](https://github.com/oliviergimenez)