Scientific research is all about networking

I read this awesome post (in French) by Baptiste Coulmont, professor in sociology, who explored the French academic network in sociology. Coulmont used the composition of PhD commitees to determine academic links between colleagues. The approach very appealing because it uses public data available from the website these.fr. Here, I used Coulmont’s R code to produce the French academic network in ecology. This was a nice opportunity to illustrate how to work in the tidyverse and to do some web scraping using the rvest package.

Get the data

Load the packages we need:

library(RCurl)
library(tidyverse)
library(lubridate)
library(scales)
library(hrbrthemes)
library(data.table)
# devtools::install_github("privefl/bigreadr")
library(bigreadr)

We now prepare the URL requests. The total number of PhDs is around 88000 on the period 2015-2018. Because the website uses slices of 1000 on each page, we proceed in sequence:

i <- 1:88
i <- i*1000
URL <-paste0("http://www.theses.fr/?q=&fq=dateSoutenance:[2015-01-01T23:59:59Z%2BTO%2B2018-12-31T23:59:59Z]&checkedfacets=&start=",i,"&sort=none&status=&access=&prevision=&filtrepersonne=&zone1=titreRAs&val1=&op1=AND&zone2=auteurs&val2=&op2=AND&zone3=etabSoutenances&val3=&zone4=dateSoutenance&val4a=&val4b=&type=&lng=&checkedfacets=&format=csv")

Alternatively, the search can be done by hand directly from the theses.fr website. [François-Xavier Coudert] ( https://www.coudert.name/) also provides the search results for the 2015-2018 period.

We proceed with the requests, and store everything in a csv file:

j <- 1
SERP <- 1
for(j in 1:length(URL)){ # loop over the slices
  SERP[j] <- getURL(URL[j])
  write.csv(SERP,"SERP_2.csv",append=F)
}
rm(SERP,i,j,URL)

We keep only the PhDs in the field (Discipline) of ecology. This is basically the only change I have made to Coulmont’s neat code.

theses <- read.csv("SERP_2.csv",sep=";",quote="",skip=1,stringsAsFactors = F)
#theses %>% 
#  pull(X..Discipline..) %>% 
#  unique()

ecology <- theses %>% filter(grepl("ecologie",X..Discipline..,ignore.case=T)) %>% # keep PhDs with Displine == ecologie
  filter(X..Date.de.soutenance..!="") %>% # remove PhDs with missing dates of defense
  filter(X..Statut..=="soutenue") # keep only PhDs that have been defended

We now have the id of all PhDs in ecology defended during the period 2015-2018. We will use the id to get the composition of all PhD commitees. Getting this composition requires scraping the web page of each PhD, and to get the ID of each PhD. For doing so, we use the rvest package (see the excellent posts by Maëlle Salmon for examples).

library(rvest)
identifiants <- ecology$X..Identifiant.de.la.these.. # get PhD ids
reseau_total <- data_frame(noms_jury="",
                           liens_jury="",
                           these="",
                           directeurs="",
                           liens_directeurs="")

for (i in 1:length(identifiants)) {
  
  # get info on current PhD
  data_theses_eco <- read_html( paste0("http://www.theses.fr/",identifiants[i]) ) 

  # get name PhD supervisor for 
  directeurs <- bind_cols(
    directeurs = data_theses_eco  %>%
      html_nodes("div .donnees-ombre p") %>%
      .[[1]] %>%
      html_nodes("a") %>%
      html_text()
    ,
    liens_directeurs = data_theses_eco  %>%
      html_nodes("div .donnees-ombre p") %>%
      .[[1]] %>%
      html_nodes("a") %>%
      html_attr(name="href")
  ) %>% mutate(  these = identifiants[i] )
  
  # get names of people in commitees
    jury <- bind_cols( 
    noms_jury = data_theses_eco %>%
      html_nodes("div .donnees p a") %>%
      html_text()
    ,
    liens_jury = data_theses_eco %>%
      html_nodes("div .donnees p a") %>%
      html_attr(name="href")
  ) %>% mutate(  these = identifiants[i] )
    
  # put all together
    reseau <- jury %>% left_join(directeurs,by="these") 
    reseau_total <- bind_rows(reseau_total,reseau)
}

Build the network

Load the packages we need, and the data we got at the previous step:

library(igraph)
library(ggraph)
library(ggrepel)
load('reseau_total.RData')

Coulmont defined a weighted link between two colleagues i and j as follows: 3 if i and j are both supervisors, 2 if i is a supervisor and j a PhD commitee member and 1 if both i and j are PhD commitee members. A colleague may accumulate several weights.

directions_theses <- reseau_total %>% 
  select(these,directeurs) %>% 
  unique() %>% 
  group_by(these) %>% 
  mutate(N=n()) %>%
  filter(N==2) %>% # keep co-supervision w/ 2 supervisors 
  mutate(rang=rank(directeurs)) %>% 
  spread(key=rang,value=directeurs) %>% 
  ungroup() %>% 
  select(nom1=`1`,nom2=`2`) %>% 
  mutate(poids=3)

directions_jury <- reseau_total %>% 
  select(nom1=noms_jury,nom2=directeurs) %>% 
  filter( nom1 != "") %>%
  mutate(poids=2) %>%
  group_by(nom1,nom2) %>% 
  summarize(poids=sum(poids))

jury_jury <- reseau_total %>% 
  select(noms_jury,these) %>% 
  unique() %>% 
  filter(noms_jury!="")

g_j <-  graph_from_data_frame(jury_jury,directed=F)
V(g_j)$type <- V(g_j)$name %in% jury_jury$noms_jury
g_j_1 <- bipartite_projection(g_j,which="true")
jurys <- as_long_data_frame(g_j_1) %>%
  select(nom1=`ver[el[, 1], ]`, nom2=`ver2[el[, 2], ]`, poids=weight)

reseau_petit <- bind_rows(directions_theses,directions_jury,jurys) %>%
  group_by(nom1,nom2) %>% 
  summarize(poids=sum(poids)) # data.frame from which the network will be created

Each node in the network has a size proportional to its betweenness score. We also determine communities using the walktrap algorithm that will be colored differently. The width of an edge is proportional to the strength of the link between the two corresponding nodes.

g <- graph_from_data_frame(reseau_petit, directed=F) # create network from data.frame
g <- simplify(g,edge.attr.comb = sum)
V(g)$degres <-  degree(g)
V(g)$label <- gsub("^\\S+\\s+(.+)$","\\1",V(g)$name)
V(g)$communaute <- as.character(cluster_walktrap(g, steps=15)$membership) # determine communities
V(g)$closeness <- (5*closeness(g))^10
V(g)$btwns <- betweenness(g) # network metric betweeness
V(g)$eigen_centr <- eigen_centrality(g)$vector
g <- delete_edges(g, which(E(g)$poids<5) ) # delete edges with weight <= 4
V(g)$cluster_number <- clusters(g)$membership # to which community you belong
g <- induced_subgraph(g, V(g)$cluster_number== which( max(clusters(g)$csize) == clusters(g)$csize) )
E(g)$weight <- 1/E(g)$poids # width of edge proportional to weight
V(g)$label <- ifelse(V(g)$degres<20,"",V(g)$label) # do not display all names

Plot the network

We now plot the network. For clarity, we only indicate the names of colleagues who were part of several phD commitees.

ggraph(g,layout="igraph",algorithm="fr") + 
  geom_edge_link(aes(width=.1*poids), alpha=.1, 
                 end_cap = circle(5, 'mm'), 
                 start_cap = circle(5, 'mm')) +
  geom_node_point(aes(size=eigen_centr), color="white",alpha=1) +
  geom_node_point(aes(color=communaute,size=eigen_centr), alpha=.5) +
  scale_size_area(max_size = 20) +
  geom_node_text(aes(label=label),size=3,repel=T,box.padding = 0.15) +
  labs(title="Réseaux des écologues",
       subtitle="Soutenances de thèses entre 2015 et 2018",
       caption="Sources : theses.fr \n Code par B. Coulmont, modifié par O. Gimenez") +
  theme_graph(foreground = 'white', fg_text_colour = 'white',
              base_family = "Helvetica") +
  theme(legend.position="none",
        text=element_text(size=16,family="Helvetica"),
        plot.margin = unit(c(0.2, 0.2, 0.2, 0.2), units="line"))

# save
ggsave(filename = "ecology_network.pdf",width=30,height = 20)

I played around the defaults Coulmont used to build and plot the network. It helps in getting a better understanding of the network and the links between colleagues working in ecology. Overall, I indeed feel very much connected to my colleagues in Montpellier, Lyon and Grenoble. I should probably go out of my comfort zone and interact even more with my colleagues from La Rochelle, Marseille and Aix-en-Provence 😃

As always, data and code are available from GitHub.

Statistics, ecology & social sciences

I’m a scientist working at the interface of animal ecology, statistical modeling and social sciences.

comments powered by Disqus

Related