class: center, middle, inverse, title-slide # Introduction to the Tidyverse ## Data manipulation ### Olivier Gimenez ### last updated: 2022-05-16 --- # **Tidyverse** - **Ordocosme** in 🇫🇷 with _Tidy_ for "bien rangĂ©" and _verse_ for "univers" - A collection of R 📦 developed by H. Wickham and others at Rstudio <img src="assets/img/wickham_president.jpg" width="50%" style="display: block; margin: auto;" /> --- # **Tidyverse** * "A framework for managing data that aims at making the cleaning and preparing steps [muuuuuuuch] easier" (Julien Barnier). * Main characteristics of a tidy dataset: - each variable is a column - each observation is a raw - each value is in a different cell <img src="assets/img/tidydata.png" width="80%" style="display: block; margin: auto;" /> --- # Is this dataset tidy? ```r #> # A tibble: 12 x 4 #> country year type count #> <chr> <int> <chr> <int> #> 1 Afghanistan 1999 cases 745 #> 2 Afghanistan 1999 population 19987071 #> 3 Afghanistan 2000 cases 2666 #> 4 Afghanistan 2000 population 20595360 #> 5 Brazil 1999 cases 37737 #> 6 Brazil 1999 population 172006362 #> # … with 6 more rows ``` --- # Is this dataset tidy? Nop. ```r #> # A tibble: 12 x 4 #> country year type count #> <chr> <int> <chr> <int> #> 1 Afghanistan 1999 cases 745 #> 2 Afghanistan 1999 population 19987071 #> 3 Afghanistan 2000 cases 2666 #> 4 Afghanistan 2000 population 20595360 #> 5 Brazil 1999 cases 37737 #> 6 Brazil 1999 population 172006362 #> # … with 6 more rows ``` --- # Is this dataset tidy? ```r #> # A tibble: 6 x 3 #> country year rate #> * <chr> <int> <chr> #> 1 Afghanistan 1999 745/19987071 #> 2 Afghanistan 2000 2666/20595360 #> 3 Brazil 1999 37737/172006362 #> 4 Brazil 2000 80488/174504898 #> 5 China 1999 212258/1272915272 #> 6 China 2000 213766/1280428583 ``` --- # Is this dataset tidy? Nop. ```r #> # A tibble: 6 x 3 #> country year rate #> * <chr> <int> <chr> #> 1 Afghanistan 1999 745/19987071 #> 2 Afghanistan 2000 2666/20595360 #> 3 Brazil 1999 37737/172006362 #> 4 Brazil 2000 80488/174504898 #> 5 China 1999 212258/1272915272 #> 6 China 2000 213766/1280428583 ``` --- # Is this dataset tidy? ```r # Spread across two tibbles # cases #> # A tibble: 3 x 3 #> country `1999` `2000` #> * <chr> <int> <int> #> 1 Afghanistan 745 2666 #> 2 Brazil 37737 80488 #> 3 China 212258 213766 # population #> # A tibble: 3 x 3 #> country `1999` `2000` #> * <chr> <int> <int> #> 1 Afghanistan 19987071 20595360 #> 2 Brazil 172006362 174504898 #> 3 China 1272915272 1280428583 ``` --- # Is this dataset tidy? Nop. ```r # Spread across two tibbles # cases #> # A tibble: 3 x 3 #> country `1999` `2000` #> * <chr> <int> <int> #> 1 Afghanistan 745 2666 #> 2 Brazil 37737 80488 #> 3 China 212258 213766 # population #> # A tibble: 3 x 3 #> country `1999` `2000` #> * <chr> <int> <int> #> 1 Afghanistan 19987071 20595360 #> 2 Brazil 172006362 174504898 #> 3 China 1272915272 1280428583 ``` --- # Is this dataset tidy? ```r #> # A tibble: 6 x 4 #> country year cases population #> <chr> <int> <int> <int> #> 1 Afghanistan 1999 745 19987071 #> 2 Afghanistan 2000 2666 20595360 #> 3 Brazil 1999 37737 172006362 #> 4 Brazil 2000 80488 174504898 #> 5 China 1999 212258 1272915272 #> 6 China 2000 213766 1280428583 ``` --- # Is this dataset tidy? Yes! ```r #> # A tibble: 6 x 4 #> country year cases population #> <chr> <int> <int> <int> #> 1 Afghanistan 1999 745 19987071 #> 2 Afghanistan 2000 2666 20595360 #> 3 Brazil 1999 37737 172006362 #> 4 Brazil 2000 80488 174504898 #> 5 China 1999 212258 1272915272 #> 6 China 2000 213766 1280428583 ``` --- # Why is the tidy format useful? * Allows using a consistent format for which powerful tools work * Makes data manipulation pretty natural --- # **Tidyverse** is a collection of R 📦 * `ggplot2` - visualising stuff * `dplyr`, `tidyr` - data manipulation * `purrr` - advanced programming * `readr` - import data * `tibble` - improved data.frame format * `forcats` - working w/ factors * `stringr` - working w/ chain of characters --- # **Tidyverse** is a collection of R 📦 * [`ggplot2` - visualising stuff](https://ggplot2.tidyverse.org/) * [`dplyr`, `tidyr` - data manipulation](https://dplyr.tidyverse.org/) * `purrr` - advanced programming * [`readr` - import data](https://readr.tidyverse.org/) * [`tibble` - improved data.frame format](https://tibble.tidyverse.org/) * [`forcats` - working w/ factors](https://forcats.tidyverse.org/) * [`stringr` - working w/ chain of characters](https://stringr.tidyverse.org/) --- class: middle # Workflow in data science <img src="assets/img/data-science-workflow.png" width="100%" style="display: block; margin: auto;" /> --- class: middle # Workflow in data science, with **Tidyverse** <img src="assets/img/01_tidyverse_data_science.png" width="90%" style="display: block; margin: auto;" /> --- background-image: url(https://github.com/rstudio/hex-stickers/raw/master/SVG/tidyverse.svg?sanitize=true) background-size: 100px background-position: 90% 3% # Load [tidyverse](www.tidyverse.org) 📦 ```r #install.packages("tidyverse") library(tidyverse) ``` --- class: middle ## Case study: # [Using Twitter to predict citation rates of ecological research](https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0166570) <img src="assets/img/paper_workflow.png" width="85%" style="display: block; margin: auto;" /> --- class: inverse, center, middle # Import --- # Import data **readr::read_csv** function: * creates `tibbles` instead of `data.frame` - no names to rows - allows column names with special characters (see next slide) - more clever on screen display than w/ data.frames (see next slide) - [no partial matching on column names](https://stackoverflow.com/questions/58513997/how-to-make-r-stop-accepting-partial-matches-for-column-names) - warning if attempt to access unexisting column * is daaaaaamn fast 🏎 --- # Import data ```r citations_raw <- read_csv('https://raw.githubusercontent.com/oliviergimenez/intro_tidyverse/master/journal.pone.0166570.s001.CSV') citations_raw ``` ``` ## # A tibble: 1,599 Ă— 12 ## `Journal identity` `5-year journal im…` `Year published` Volume Issue Authors ## <chr> <dbl> <dbl> <dbl> <chr> <chr> ## 1 Ecology Letters 16.7 2014 17 12 Morin … ## 2 Ecology Letters 16.7 2014 17 12 Jucker… ## 3 Ecology Letters 16.7 2014 17 12 Calcag… ## 4 Ecology Letters 16.7 2014 17 11 Segre … ## 5 Ecology Letters 16.7 2014 17 11 Kaufma… ## 6 Ecology Letters 16.7 2014 17 10 Nasto … ## 7 Ecology Letters 16.7 2014 17 10 Tschir… ## 8 Ecology Letters 16.7 2014 17 9 Barnec… ## 9 Ecology Letters 16.7 2014 17 9 Pinto-… ## 10 Ecology Letters 16.7 2014 17 9 Clough… ## # … with 1,589 more rows, and 6 more variables: `Collection date` <chr>, ## # `Publication date` <chr>, `Number of tweets` <dbl>, ## # `Number of users` <dbl>, `Twitter reach` <dbl>, ## # `Number of Web of Science citations` <dbl> ``` --- class: inverse, center, middle # Tidy, transform --- # Rename columns ```r citations_temp <- rename(citations_raw, journal = 'Journal identity', impactfactor = '5-year journal impact factor', pubyear = 'Year published', colldate = 'Collection date', pubdate = 'Publication date', nbtweets = 'Number of tweets', woscitations = 'Number of Web of Science citations') citations_temp ``` ``` ## # A tibble: 1,599 Ă— 12 ## journal impactfactor pubyear Volume Issue Authors colldate pubdate nbtweets ## <chr> <dbl> <dbl> <dbl> <chr> <chr> <chr> <chr> <dbl> ## 1 Ecology … 16.7 2014 17 12 Morin … 2/1/2016 9/16/2… 18 ## 2 Ecology … 16.7 2014 17 12 Jucker… 2/1/2016 10/13/… 15 ## 3 Ecology … 16.7 2014 17 12 Calcag… 2/1/2016 10/21/… 5 ## 4 Ecology … 16.7 2014 17 11 Segre … 2/1/2016 8/28/2… 9 ## 5 Ecology … 16.7 2014 17 11 Kaufma… 2/1/2016 8/28/2… 3 ## 6 Ecology … 16.7 2014 17 10 Nasto … 2/2/2016 7/28/2… 27 ## 7 Ecology … 16.7 2014 17 10 Tschir… 2/2/2016 8/6/20… 6 ## 8 Ecology … 16.7 2014 17 9 Barnec… 2/2/2016 6/17/2… 19 ## 9 Ecology … 16.7 2014 17 9 Pinto-… 2/2/2016 6/12/2… 26 ## 10 Ecology … 16.7 2014 17 9 Clough… 2/2/2016 7/17/2… 44 ## # … with 1,589 more rows, and 3 more variables: `Number of users` <dbl>, ## # `Twitter reach` <dbl>, woscitations <dbl> ``` --- # Clean up column names ```r janitor::clean_names(citations_raw) ``` ``` ## # A tibble: 1,599 Ă— 12 ## journal_identity x5_year_journal_impact_… year_published volume issue authors ## <chr> <dbl> <dbl> <dbl> <chr> <chr> ## 1 Ecology Letters 16.7 2014 17 12 Morin … ## 2 Ecology Letters 16.7 2014 17 12 Jucker… ## 3 Ecology Letters 16.7 2014 17 12 Calcag… ## 4 Ecology Letters 16.7 2014 17 11 Segre … ## 5 Ecology Letters 16.7 2014 17 11 Kaufma… ## 6 Ecology Letters 16.7 2014 17 10 Nasto … ## 7 Ecology Letters 16.7 2014 17 10 Tschir… ## 8 Ecology Letters 16.7 2014 17 9 Barnec… ## 9 Ecology Letters 16.7 2014 17 9 Pinto-… ## 10 Ecology Letters 16.7 2014 17 9 Clough… ## # … with 1,589 more rows, and 6 more variables: collection_date <chr>, ## # publication_date <chr>, number_of_tweets <dbl>, number_of_users <dbl>, ## # twitter_reach <dbl>, number_of_web_of_science_citations <dbl> ``` --- # Create (or modify) columns ```r citations <- mutate(citations_temp, journal = as.factor(journal)) citations ``` ``` ## # A tibble: 1,599 Ă— 12 ## journal impactfactor pubyear Volume Issue Authors colldate pubdate nbtweets ## <fct> <dbl> <dbl> <dbl> <chr> <chr> <chr> <chr> <dbl> ## 1 Ecology … 16.7 2014 17 12 Morin … 2/1/2016 9/16/2… 18 ## 2 Ecology … 16.7 2014 17 12 Jucker… 2/1/2016 10/13/… 15 ## 3 Ecology … 16.7 2014 17 12 Calcag… 2/1/2016 10/21/… 5 ## 4 Ecology … 16.7 2014 17 11 Segre … 2/1/2016 8/28/2… 9 ## 5 Ecology … 16.7 2014 17 11 Kaufma… 2/1/2016 8/28/2… 3 ## 6 Ecology … 16.7 2014 17 10 Nasto … 2/2/2016 7/28/2… 27 ## 7 Ecology … 16.7 2014 17 10 Tschir… 2/2/2016 8/6/20… 6 ## 8 Ecology … 16.7 2014 17 9 Barnec… 2/2/2016 6/17/2… 19 ## 9 Ecology … 16.7 2014 17 9 Pinto-… 2/2/2016 6/12/2… 26 ## 10 Ecology … 16.7 2014 17 9 Clough… 2/2/2016 7/17/2… 44 ## # … with 1,589 more rows, and 3 more variables: `Number of users` <dbl>, ## # `Twitter reach` <dbl>, woscitations <dbl> ``` --- # Create (or modify) columns ```r levels(citations$journal) ``` ``` ## [1] "Animal Conservation" "Conservation Letters" ## [3] "Diversity and Distributions" "Ecological Applications" ## [5] "Ecology" "Ecology Letters" ## [7] "Evolution" "Evolutionary Applications" ## [9] "Fish and Fisheries" "Functional Ecology" ## [11] "Global Change Biology" "Global Ecology and Biogeography" ## [13] "Journal of Animal Ecology" "Journal of Applied Ecology" ## [15] "Journal of Biogeography" "Limnology and Oceanography" ## [17] "Mammal Review" "Methods in Ecology and Evolution" ## [19] "Molecular Ecology Resources" "New Phytologist" ``` --- class: inverse, center, middle # Give your code some air --- # Cleaner code with "pipe" operator `%>%` which reads "and then" ```r citations_raw %>% rename(journal = 'Journal identity', impactfactor = '5-year journal impact factor', pubyear = 'Year published', colldate = 'Collection date', pubdate = 'Publication date', nbtweets = 'Number of tweets', woscitations = 'Number of Web of Science citations') %>% mutate(journal = as.factor(journal)) ``` ``` ## # A tibble: 1,599 Ă— 12 ## journal impactfactor pubyear Volume Issue Authors colldate pubdate nbtweets ## <fct> <dbl> <dbl> <dbl> <chr> <chr> <chr> <chr> <dbl> ## 1 Ecology … 16.7 2014 17 12 Morin … 2/1/2016 9/16/2… 18 ## 2 Ecology … 16.7 2014 17 12 Jucker… 2/1/2016 10/13/… 15 ## 3 Ecology … 16.7 2014 17 12 Calcag… 2/1/2016 10/21/… 5 ## 4 Ecology … 16.7 2014 17 11 Segre … 2/1/2016 8/28/2… 9 ## 5 Ecology … 16.7 2014 17 11 Kaufma… 2/1/2016 8/28/2… 3 ## 6 Ecology … 16.7 2014 17 10 Nasto … 2/2/2016 7/28/2… 27 ## 7 Ecology … 16.7 2014 17 10 Tschir… 2/2/2016 8/6/20… 6 ## 8 Ecology … 16.7 2014 17 9 Barnec… 2/2/2016 6/17/2… 19 ## 9 Ecology … 16.7 2014 17 9 Pinto-… 2/2/2016 6/12/2… 26 ## 10 Ecology … 16.7 2014 17 9 Clough… 2/2/2016 7/17/2… 44 ## # … with 1,589 more rows, and 3 more variables: `Number of users` <dbl>, ## # `Twitter reach` <dbl>, woscitations <dbl> ``` --- # Name object ```r *citations <- citations_raw %>% rename(journal = 'Journal identity', impactfactor = '5-year journal impact factor', pubyear = 'Year published', colldate = 'Collection date', pubdate = 'Publication date', nbtweets = 'Number of tweets', woscitations = 'Number of Web of Science citations') %>% mutate(journal = as.factor(journal)) ``` --- # Syntax with pipe * Verb(Subject,Complement) replaced by Subject %>% Verb(Complement) * No need to name unimportant intermediate variables * Clear syntax (readability) <img src="assets/img/logo_pipe.png" width="40%" style="display: block; margin: auto;" /> --- # Base R from [Lise Vaudor's blog](http://perso.ens-lyon.fr/lise.vaudor/) ```r white_and_yolk <- crack(egg, add_seasoning) omelette_batter <- beat(white_and_yolk) omelette_with_chives <- cook(omelette_batter,add_chives) ``` <img src="assets/img/piping_successive.jpg" width="500px" style="display: block; margin: auto;" /> --- # Piping from [Lise Vaudor's blog](http://perso.ens-lyon.fr/lise.vaudor/) ```r egg %>% crack(add_seasoning) %>% beat() %>% cook(add_chives) -> omelette_with_chives ``` <img src="assets/img/piping_piped.png" width="250px" style="display: block; margin: auto;" /> --- class: inverse, center, middle # Tidy, transform --- # Select columns ```r citations %>% select(journal, impactfactor, nbtweets) ``` ``` ## # A tibble: 1,599 Ă— 3 ## journal impactfactor nbtweets ## <fct> <dbl> <dbl> ## 1 Ecology Letters 16.7 18 ## 2 Ecology Letters 16.7 15 ## 3 Ecology Letters 16.7 5 ## 4 Ecology Letters 16.7 9 ## 5 Ecology Letters 16.7 3 ## 6 Ecology Letters 16.7 27 ## 7 Ecology Letters 16.7 6 ## 8 Ecology Letters 16.7 19 ## 9 Ecology Letters 16.7 26 ## 10 Ecology Letters 16.7 44 ## # … with 1,589 more rows ``` --- # Drop columns ```r citations %>% select(-Volume, -Issue, -Authors) ``` ``` ## # A tibble: 1,599 Ă— 9 ## journal impactfactor pubyear colldate pubdate nbtweets `Number of use…` ## <fct> <dbl> <dbl> <chr> <chr> <dbl> <dbl> ## 1 Ecology Lett… 16.7 2014 2/1/2016 9/16/2… 18 16 ## 2 Ecology Lett… 16.7 2014 2/1/2016 10/13/… 15 12 ## 3 Ecology Lett… 16.7 2014 2/1/2016 10/21/… 5 4 ## 4 Ecology Lett… 16.7 2014 2/1/2016 8/28/2… 9 8 ## 5 Ecology Lett… 16.7 2014 2/1/2016 8/28/2… 3 3 ## 6 Ecology Lett… 16.7 2014 2/2/2016 7/28/2… 27 23 ## 7 Ecology Lett… 16.7 2014 2/2/2016 8/6/20… 6 6 ## 8 Ecology Lett… 16.7 2014 2/2/2016 6/17/2… 19 18 ## 9 Ecology Lett… 16.7 2014 2/2/2016 6/12/2… 26 23 ## 10 Ecology Lett… 16.7 2014 2/2/2016 7/17/2… 44 42 ## # … with 1,589 more rows, and 2 more variables: `Twitter reach` <dbl>, ## # woscitations <dbl> ``` --- # Split a column in several columns ```r citations %>% separate(pubdate,c('month','day','year'),'/') ``` ``` ## # A tibble: 1,599 Ă— 14 ## journal impactfactor pubyear Volume Issue Authors colldate month day year ## <fct> <dbl> <dbl> <dbl> <chr> <chr> <chr> <chr> <chr> <chr> ## 1 Ecology… 16.7 2014 17 12 Morin … 2/1/2016 9 16 2014 ## 2 Ecology… 16.7 2014 17 12 Jucker… 2/1/2016 10 13 2014 ## 3 Ecology… 16.7 2014 17 12 Calcag… 2/1/2016 10 21 2014 ## 4 Ecology… 16.7 2014 17 11 Segre … 2/1/2016 8 28 2014 ## 5 Ecology… 16.7 2014 17 11 Kaufma… 2/1/2016 8 28 2014 ## 6 Ecology… 16.7 2014 17 10 Nasto … 2/2/2016 7 28 2014 ## 7 Ecology… 16.7 2014 17 10 Tschir… 2/2/2016 8 6 2014 ## 8 Ecology… 16.7 2014 17 9 Barnec… 2/2/2016 6 17 2014 ## 9 Ecology… 16.7 2014 17 9 Pinto-… 2/2/2016 6 12 2014 ## 10 Ecology… 16.7 2014 17 9 Clough… 2/2/2016 7 17 2014 ## # … with 1,589 more rows, and 4 more variables: nbtweets <dbl>, ## # `Number of users` <dbl>, `Twitter reach` <dbl>, woscitations <dbl> ``` --- # Transform in Date format... ```r library(lubridate) citations %>% mutate(pubdate = mdy(pubdate), colldate = mdy(colldate)) ``` ``` ## # A tibble: 1,599 Ă— 12 ## journal impactfactor pubyear Volume Issue Authors colldate pubdate ## <fct> <dbl> <dbl> <dbl> <chr> <chr> <date> <date> ## 1 Ecology Lett… 16.7 2014 17 12 Morin … 2016-02-01 2014-09-16 ## 2 Ecology Lett… 16.7 2014 17 12 Jucker… 2016-02-01 2014-10-13 ## 3 Ecology Lett… 16.7 2014 17 12 Calcag… 2016-02-01 2014-10-21 ## 4 Ecology Lett… 16.7 2014 17 11 Segre … 2016-02-01 2014-08-28 ## 5 Ecology Lett… 16.7 2014 17 11 Kaufma… 2016-02-01 2014-08-28 ## 6 Ecology Lett… 16.7 2014 17 10 Nasto … 2016-02-02 2014-07-28 ## 7 Ecology Lett… 16.7 2014 17 10 Tschir… 2016-02-02 2014-08-06 ## 8 Ecology Lett… 16.7 2014 17 9 Barnec… 2016-02-02 2014-06-17 ## 9 Ecology Lett… 16.7 2014 17 9 Pinto-… 2016-02-02 2014-06-12 ## 10 Ecology Lett… 16.7 2014 17 9 Clough… 2016-02-02 2014-07-17 ## # … with 1,589 more rows, and 4 more variables: nbtweets <dbl>, ## # `Number of users` <dbl>, `Twitter reach` <dbl>, woscitations <dbl> ``` --- # ...for easy manipulation of dates ```r library(lubridate) citations %>% mutate(pubdate = mdy(pubdate), colldate = mdy(colldate), * pubyear2 = year(pubdate)) ``` ``` ## # A tibble: 1,599 Ă— 13 ## journal impactfactor pubyear Volume Issue Authors colldate pubdate ## <fct> <dbl> <dbl> <dbl> <chr> <chr> <date> <date> ## 1 Ecology Lett… 16.7 2014 17 12 Morin … 2016-02-01 2014-09-16 ## 2 Ecology Lett… 16.7 2014 17 12 Jucker… 2016-02-01 2014-10-13 ## 3 Ecology Lett… 16.7 2014 17 12 Calcag… 2016-02-01 2014-10-21 ## 4 Ecology Lett… 16.7 2014 17 11 Segre … 2016-02-01 2014-08-28 ## 5 Ecology Lett… 16.7 2014 17 11 Kaufma… 2016-02-01 2014-08-28 ## 6 Ecology Lett… 16.7 2014 17 10 Nasto … 2016-02-02 2014-07-28 ## 7 Ecology Lett… 16.7 2014 17 10 Tschir… 2016-02-02 2014-08-06 ## 8 Ecology Lett… 16.7 2014 17 9 Barnec… 2016-02-02 2014-06-17 ## 9 Ecology Lett… 16.7 2014 17 9 Pinto-… 2016-02-02 2014-06-12 ## 10 Ecology Lett… 16.7 2014 17 9 Clough… 2016-02-02 2014-07-17 ## # … with 1,589 more rows, and 5 more variables: nbtweets <dbl>, ## # `Number of users` <dbl>, `Twitter reach` <dbl>, woscitations <dbl>, ## # pubyear2 <dbl> ``` * Check out `?lubridate::lubridate` for more functions --- # How to join tables together? <blockquote class="twitter-tweet" data-lang="fr"><p lang="en" dir="ltr">More <a href="https://twitter.com/hashtag/dplyr?src=hash&ref_src=twsrc%5Etfw">#dplyr</a> 🔧 gifs! It took me a hella long time to wrap my head around the different types of joins when I first started learning them, so here's a few examples with some excellent mini datasets from <a href="https://twitter.com/hashtag/dplyr?src=hash&ref_src=twsrc%5Etfw">#dplyr</a> designed specifically for this purpose! <a href="https://twitter.com/hashtag/rstats?src=hash&ref_src=twsrc%5Etfw">#rstats</a> <a href="https://twitter.com/hashtag/tidyverse?src=hash&ref_src=twsrc%5Etfw">#tidyverse</a> <a href="https://t.co/G56fWmIZSq">pic.twitter.com/G56fWmIZSq</a></p>— Nic Crane (@nic_crane) <a href="https://twitter.com/nic_crane/status/1064237554910806016?ref_src=twsrc%5Etfw">18 novembre 2018</a></blockquote> <script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script> [![Watch the video](assets/mp4/dplyr_join.mp4)](assets/mp4/dplyr_join.mp4) --- ## <https://www.garrickadenbuie.com/project/tidyexplain/> <img src="assets/img/left-join.gif" width="70%" style="display: block; margin: auto;" /> --- class: inverse, center, middle # Easy character manipulation --- # Select rows corresponding to papers with more than 3 authors ```r citations %>% * filter(str_detect(Authors,'et al')) ``` ``` ## # A tibble: 1,280 Ă— 12 ## journal impactfactor pubyear Volume Issue Authors colldate pubdate nbtweets ## <fct> <dbl> <dbl> <dbl> <chr> <chr> <chr> <chr> <dbl> ## 1 Ecology … 16.7 2014 17 12 Morin … 2/1/2016 9/16/2… 18 ## 2 Ecology … 16.7 2014 17 12 Jucker… 2/1/2016 10/13/… 15 ## 3 Ecology … 16.7 2014 17 12 Calcag… 2/1/2016 10/21/… 5 ## 4 Ecology … 16.7 2014 17 11 Segre … 2/1/2016 8/28/2… 9 ## 5 Ecology … 16.7 2014 17 11 Kaufma… 2/1/2016 8/28/2… 3 ## 6 Ecology … 16.7 2014 17 10 Nasto … 2/2/2016 7/28/2… 27 ## 7 Ecology … 16.7 2014 17 10 Tschir… 2/2/2016 8/6/20… 6 ## 8 Ecology … 16.7 2014 17 9 Barnec… 2/2/2016 6/17/2… 19 ## 9 Ecology … 16.7 2014 17 9 Pinto-… 2/2/2016 6/12/2… 26 ## 10 Ecology … 16.7 2014 17 9 Clough… 2/2/2016 7/17/2… 44 ## # … with 1,270 more rows, and 3 more variables: `Number of users` <dbl>, ## # `Twitter reach` <dbl>, woscitations <dbl> ``` --- # Get column with rows corresponding to papers with more than 3 authors ```r citations %>% * filter(str_detect(Authors,'et al')) %>% * select(Authors) ``` ``` ## # A tibble: 1,280 Ă— 1 ## Authors ## <chr> ## 1 Morin et al ## 2 Jucker et al ## 3 Calcagno et al ## 4 Segre et al ## 5 Kaufman et al ## 6 Nasto et al ## 7 Tschirren et al ## 8 Barnechi et al ## 9 Pinto-Sanchez et al ## 10 Clough et al ## # … with 1,270 more rows ``` --- # Select rows corresponding to papers with less than 3 authors ```r citations %>% * filter(!str_detect(Authors,'et al')) ``` ``` ## # A tibble: 319 Ă— 12 ## journal impactfactor pubyear Volume Issue Authors colldate pubdate nbtweets ## <fct> <dbl> <dbl> <dbl> <chr> <chr> <chr> <chr> <dbl> ## 1 Ecology … 16.7 2014 17 6 Neutle… 2/15/20… 3/17/2… 8 ## 2 Ecology … 16.7 2014 17 5 Kellne… 2/15/20… 2/20/2… 18 ## 3 Ecology … 16.7 2014 17 4 Griffi… 2/15/20… 1/16/2… 4 ## 4 Ecology … 16.7 2014 17 3 Gremer… 2/15/20… 1/17/2… 4 ## 5 Ecology … 16.7 2014 17 2 Cavier… 2/15/20… 10/17/… 16 ## 6 Ecology … 16.7 2014 17 2 Haegma… 2/15/20… 12/5/2… 9 ## 7 Ecology … 16.7 2013 16 12 Kearney 2/15/20… 10/1/2… 13 ## 8 Ecology … 16.7 2013 16 9 Locey … 2/15/20… 7/15/2… 28 ## 9 Ecology … 16.7 2013 16 8 Quinte… 2/15/20… 6/26/2… 120 ## 10 Ecology … 16.7 2013 16 3 Lesser… 2/15/20… 12/22/… 9 ## # … with 309 more rows, and 3 more variables: `Number of users` <dbl>, ## # `Twitter reach` <dbl>, woscitations <dbl> ``` --- # Get column with rows corresponding to papers with less than 3 authors ```r citations %>% * filter(!str_detect(Authors,'et al')) %>% * select(Authors) ``` ``` ## # A tibble: 319 Ă— 1 ## Authors ## <chr> ## 1 Neutle and Thorne ## 2 Kellner and Asner ## 3 Griffin and Willi ## 4 Gremer and Venable ## 5 Cavieres ## 6 Haegman and Loreau ## 7 Kearney ## 8 Locey and White ## 9 Quintero and Weins ## 10 Lesser and Jackson ## # … with 309 more rows ``` --- # Get column with rows corresponding to papers with less than 3 authors ```r citations %>% filter(!str_detect(Authors,'et al')) %>% * pull(Authors) %>% head(10) ``` ``` ## [1] "Neutle and Thorne" "Kellner and Asner" "Griffin and Willi" ## [4] "Gremer and Venable" "Cavieres" "Haegman and Loreau" ## [7] "Kearney" "Locey and White" "Quintero and Weins" ## [10] "Lesser and Jackson" ``` --- # Select rows corresponding to papers with less than 3 authors in journal with IF < 5 ```r citations %>% * filter(!str_detect(Authors,'et al'), impactfactor < 5) ``` ``` ## # A tibble: 77 Ă— 12 ## journal impactfactor pubyear Volume Issue Authors colldate pubdate nbtweets ## <fct> <dbl> <dbl> <dbl> <chr> <chr> <chr> <chr> <dbl> ## 1 Molecula… 4.9 2014 14 6 Gautier 2/27/20… 5/14/2… 2 ## 2 Molecula… 4.9 2014 14 5 Gambel… 2/27/20… 3/7/20… 7 ## 3 Molecula… 4.9 2014 14 4 Kekkon… 2/27/20… 3/10/2… 4 ## 4 Molecula… 4.9 2014 14 3 Bhatta… 2/27/20… 12/8/2… 0 ## 5 Molecula… 4.9 2014 14 1 Christ… 2/28/20… 10/25/… 0 ## 6 Molecula… 4.9 2013 13 4 Villar… 2/28/20… 5/2/20… 0 ## 7 Molecula… 4.9 2013 13 4 Wang 2/28/20… 4/25/2… 0 ## 8 Molecula… 4.9 2012 12 1 Joly 2/28/20… 9/7/20… 3 ## 9 Animal C… 3.21 2014 17 6 Plavsic 2/9/2016 4/17/2… 9 ## 10 Animal C… 3.21 2014 17 Supp… Knox a… 2/11/20… 11/13/… 1 ## # … with 67 more rows, and 3 more variables: `Number of users` <dbl>, ## # `Twitter reach` <dbl>, woscitations <dbl> ``` --- # Convert words to lowercase ```r citations %>% * mutate(authors_lowercase = str_to_lower(Authors)) %>% select(authors_lowercase) ``` ``` ## # A tibble: 1,599 Ă— 1 ## authors_lowercase ## <chr> ## 1 morin et al ## 2 jucker et al ## 3 calcagno et al ## 4 segre et al ## 5 kaufman et al ## 6 nasto et al ## 7 tschirren et al ## 8 barnechi et al ## 9 pinto-sanchez et al ## 10 clough et al ## # … with 1,589 more rows ``` --- # Remove all spaces in journal names ```r citations %>% * mutate(journal = str_remove_all(journal," ")) %>% select(journal) %>% unique() %>% head(5) ``` ``` ## # A tibble: 5 Ă— 1 ## journal ## <chr> ## 1 EcologyLetters ## 2 GlobalChangeBiology ## 3 GlobalEcologyandBiogeography ## 4 MolecularEcologyResources ## 5 DiversityandDistributions ``` --- # Explore 📦 stringr and regular expressions * Check out the [vignette on stringr](https://cran.r-project.org/web/packages/stringr/vignettes/stringr.html) for more examples on character manipulation and pattern matching functions. * Check out the [vignette on regular expressions](https://stringr.tidyverse.org/articles/regular-expressions.html) which are a concise and flexible tool for describing patterns in strings. --- class: inverse, center, middle # Basic exploratory data analysis --- # Count ```r citations %>% count(journal, sort = TRUE) ``` ``` ## # A tibble: 20 Ă— 2 ## journal n ## <fct> <int> ## 1 New Phytologist 144 ## 2 Ecology 108 ## 3 Evolution 108 ## 4 Global Change Biology 108 ## 5 Global Ecology and Biogeography 108 ## 6 Journal of Biogeography 108 ## 7 Ecology Letters 106 ## 8 Diversity and Distributions 105 ## 9 Animal Conservation 102 ## 10 Methods in Ecology and Evolution 90 ## 11 Evolutionary Applications 74 ## 12 Functional Ecology 54 ## 13 Journal of Animal Ecology 54 ## 14 Journal of Applied Ecology 54 ## 15 Limnology and Oceanography 54 ## 16 Molecular Ecology Resources 54 ## 17 Conservation Letters 53 ## 18 Ecological Applications 48 ## 19 Fish and Fisheries 36 ## 20 Mammal Review 31 ``` --- # Count ```r citations %>% count(journal, pubyear) ``` ``` ## # A tibble: 59 Ă— 3 ## journal pubyear n ## <fct> <dbl> <int> ## 1 Animal Conservation 2012 18 ## 2 Animal Conservation 2013 18 ## 3 Animal Conservation 2014 66 ## 4 Conservation Letters 2012 17 ## 5 Conservation Letters 2013 18 ## 6 Conservation Letters 2014 18 ## 7 Diversity and Distributions 2012 36 ## 8 Diversity and Distributions 2013 33 ## 9 Diversity and Distributions 2014 36 ## 10 Ecological Applications 2012 24 ## # … with 49 more rows ``` --- # Count sum of tweets per journal ```r citations %>% count(journal, wt = nbtweets, sort = TRUE) ``` ``` ## # A tibble: 20 Ă— 2 ## journal n ## <fct> <dbl> ## 1 Ecology Letters 1538 ## 2 Animal Conservation 1268 ## 3 Journal of Applied Ecology 1012 ## 4 Methods in Ecology and Evolution 699 ## 5 Global Change Biology 613 ## 6 Conservation Letters 542 ## 7 New Phytologist 509 ## 8 Global Ecology and Biogeography 379 ## 9 Ecology 335 ## 10 Evolution 335 ## 11 Journal of Animal Ecology 323 ## 12 Fish and Fisheries 261 ## 13 Evolutionary Applications 238 ## 14 Journal of Biogeography 209 ## 15 Diversity and Distributions 200 ## 16 Mammal Review 166 ## 17 Functional Ecology 155 ## 18 Molecular Ecology Resources 139 ## 19 Ecological Applications 125 ## 20 Limnology and Oceanography 0 ``` --- # Group by variable to calculate stats ```r citations %>% * group_by(journal) %>% * summarise(avg_tweets = mean(nbtweets)) ``` ``` ## # A tibble: 20 Ă— 2 ## journal avg_tweets ## <fct> <dbl> ## 1 Animal Conservation 12.4 ## 2 Conservation Letters 10.2 ## 3 Diversity and Distributions 1.90 ## 4 Ecological Applications 2.60 ## 5 Ecology 3.10 ## 6 Ecology Letters 14.5 ## 7 Evolution 3.10 ## 8 Evolutionary Applications 3.22 ## 9 Fish and Fisheries 7.25 ## 10 Functional Ecology 2.87 ## 11 Global Change Biology 5.68 ## 12 Global Ecology and Biogeography 3.51 ## 13 Journal of Animal Ecology 5.98 ## 14 Journal of Applied Ecology 18.7 ## 15 Journal of Biogeography 1.94 ## 16 Limnology and Oceanography 0 ## 17 Mammal Review 5.35 ## 18 Methods in Ecology and Evolution 7.77 ## 19 Molecular Ecology Resources 2.57 ## 20 New Phytologist 3.53 ``` --- # Order stuff ```r citations %>% group_by(journal) %>% summarise(avg_tweets = mean(nbtweets)) %>% * arrange(desc(avg_tweets)) # decreasing order (wo desc for increasing) ``` ``` ## # A tibble: 20 Ă— 2 ## journal avg_tweets ## <fct> <dbl> ## 1 Journal of Applied Ecology 18.7 ## 2 Ecology Letters 14.5 ## 3 Animal Conservation 12.4 ## 4 Conservation Letters 10.2 ## 5 Methods in Ecology and Evolution 7.77 ## 6 Fish and Fisheries 7.25 ## 7 Journal of Animal Ecology 5.98 ## 8 Global Change Biology 5.68 ## 9 Mammal Review 5.35 ## 10 New Phytologist 3.53 ## 11 Global Ecology and Biogeography 3.51 ## 12 Evolutionary Applications 3.22 ## 13 Ecology 3.10 ## 14 Evolution 3.10 ## 15 Functional Ecology 2.87 ## 16 Ecological Applications 2.60 ## 17 Molecular Ecology Resources 2.57 ## 18 Journal of Biogeography 1.94 ## 19 Diversity and Distributions 1.90 ## 20 Limnology and Oceanography 0 ``` --- # What if we want to work on several columns? <img src="assets/img/dplyr_across.png" width="85%" style="display: block; margin: auto;" /> --- # Compute mean across all numeric columns for each journal ```r citations %>% * group_by(journal) %>% * summarize(across(where(is.numeric), mean)) ``` ``` ## # A tibble: 20 Ă— 8 ## journal impactfactor pubyear Volume nbtweets `Number of use…` `Twitter reach` ## <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 Animal… 3.21 2013. 16.5 12.4 9.71 28345. ## 2 Conser… 6.4 2013. 6.02 10.2 8.85 23234. ## 3 Divers… 5.4 2013 19 1.90 1.77 2350. ## 4 Ecolog… 5.06 2013 23 2.60 2.5 5727. ## 5 Ecology 6.16 2013 94 3.10 2.87 6176. ## 6 Ecolog… 16.7 2013. 16.0 14.5 14.0 44748. ## 7 Evolut… 5.25 2013 67 3.10 2.93 7762. ## 8 Evolut… 4.6 2013. 6.05 3.22 3.07 13185. ## 9 Fish a… 8.1 2013 14 7.25 6.19 12097. ## 10 Functi… 5.28 2013 27 2.87 2.74 3809. ## 11 Global… 8.7 2013 19 5.68 4.94 9652. ## 12 Global… 7.18 2013 22 3.51 3.15 8995. ## 13 Journa… 5.32 2013. 81.9 5.98 5.59 36112. ## 14 Journa… 5.93 2013 50 18.7 15.8 43839. ## 15 Journa… 4.59 2013 40 1.94 1.86 11632. ## 16 Limnol… 4.4 2013 58 0 0 0 ## 17 Mammal… 4.3 2013. 43.0 5.35 4.39 10485. ## 18 Method… 7.39 2013. 4.2 7.77 7.14 15867. ## 19 Molecu… 4.9 2013 13 2.57 2.24 19844. ## 20 New Ph… 7.8 2013 198. 3.53 3.15 6340. ## # … with 1 more variable: woscitations <dbl> ``` --- ## <https://github.com/courtiol/Rguides> <img src="assets/img/dplyr_guide_for_one_table_part2.png" width="85%" style="display: block; margin: auto;" /> --- # Tidying tibbles <img src="assets/img/original-dfs-tidy.png" width="70%" style="display: block; margin: auto;" /> --- ## Going from **long** to **wide** format and vice-versa <img src="assets/img/tidyr-longer-wider.gif" width="70%" style="display: block; margin: auto;" /> --- # To dive deeper in data manipulation with the tidyverse * [Learn the tidyverse](https://www.tidyverse.org/learn/): books, workshops and online courses * My selection of books: - [R for Data Science](https://r4ds.had.co.nz/) et [Advanced R](http://adv-r.had.co.nz/) - [Introduction Ă R et au tidyverse](https://juba.github.io/tidyverse/) * [Tidy Tuesdays videos](https://www.youtube.com/user/safe4democracy/videos) by D. Robinson * Material of the [2-day workshop Data Science in the tidyverse](https://github.com/cwickham/data-science-in-tidyverse) held at the RStudio 2019 conference * Material of the stat545 course on [Data wrangling, exploration, and analysis with R](https://stat545.com/) at the University of British Columbia * List of best R packages (with their description) on [data import, wrangling and visualization](https://www.computerworld.com/article/2921176/business-intelligence/great-r-packages-for-data-import-wrangling-visualization.html) --- # [How to switch from base R to tidyverse?](https://www.significantdigits.org/2017/10/switching-from-base-r-to-tidyverse/) <img src="assets/img/switch_baseR_tidyverse.png" width="800px" style="display: block; margin: auto;" /> --- # The [RStudio Cheat Sheets](https://www.rstudio.com/resources/cheatsheets/) <img src="assets/img/cheatsheet_dplyr.png" width="600px" style="display: block; margin: auto;" />