The new year is around the corner and you might be interested in figuring out what single-cell conferences to attend in 2020. A list of some single-cell conferences in 2020 came across my twitter feed the other day and I started to peruse it. If you are interested in attending of them, I thought I write up a quick blogpost to help make some comparisons between the single-cell conferences.
Load packages
First we will load some packages.
suppressMessages({
library(here)
library(tidyverse)
library(rvest)
library(UpSetR)
library(gender)
})
here()
## [1] "/Users/shicks/Documents/github/websites/website-hicks-source"
Load data
First let’s create a dataframe with the short and long name for each conference and the url. I only considered conferences that had a list of names for the organizing committee and confirmed speakers available as of Dec 2019.
url_wellcome2020 <- "https://coursesandconferences.wellcomegenomecampus.org/our-events/single-cell-biology-2020/"
url_cellsymp <- "http://www.cell-symposia.com/conceptual-single-cells-2020/"
url_keystone <- "https://www.keystonesymposia.org/ks/Online/Events/2020F1/Details.aspx?EventKey=2020F1&Tabs=2#Tabs"
url_grc_scgenomics <- "https://www.grc.org/single-cell-genomics-conference/2020/"
url_grc_sccancerbio <- "https://www.grc.org/single-cell-cancer-biology-conference/2020/"
url_emrg_tech <- "https://www.vibconferences.be/events/emerging-technologies-in-single-cell-research#speakers"
url_confs <- tibble(name_conf = c("wellcome", "cell_symp", "keystone", "grc_scgenomics", "grc_sccancerbio", "emrg_tech"),
name_long = c("Wellcome Genome Campus: Single Cell Biology",
"Cell Symposia: The Conceptual Power of Single Cell Biology", "Keystone Symposia: Single Cell Biology", "Gordon Research Conference: Single-Cell Genomics", "Gordon Research Conference: Single-Cell Cancer Biology", "Emerging Technologies in Single Cell Research"),
url = c(url_wellcome2020, url_cellsymp, url_keystone, url_grc_scgenomics, url_grc_sccancerbio, url_emrg_tech))
url_confs
## # A tibble: 6 x 3
## name_conf name_long url
## <chr> <chr> <chr>
## 1 wellcome Wellcome Genome Campus: Sin… https://coursesandconferences.…
## 2 cell_symp Cell Symposia: The Conceptu… http://www.cell-symposia.com/c…
## 3 keystone Keystone Symposia: Single C… https://www.keystonesymposia.o…
## 4 grc_scgenom… Gordon Research Conference:… https://www.grc.org/single-cel…
## 5 grc_sccance… Gordon Research Conference:… https://www.grc.org/single-cel…
## 6 emrg_tech Emerging Technologies in Si… https://www.vibconferences.be/…
Next, I used a combination of rvest
or by hand adding in the names of the organizing committees and speakers for each conference (depending on my frustration level with rvest and/or XML/HTML).
Wellcome Genome Campus: Single Cell Biology
wellcome_committee <- c("Ellen Rothenberg", "Sarah Teichmann",
"Fabian Theis", "Itai Yanai")
wellcome_speakers <- c("Kathy Cheah", "Polly Fordyce", "Eileen Furlong",
"Gillian Griffiths", "Guoji Guo", "Muzz Haniffa",
"Joakim Lundeberg", "Samantha Morris", "Mats Nilsson",
"Rahul Satija", "Timm Schroeder", "Fabian Theis",
"Barbara Treutlein", "Ludovic Vallier", "Roser Vento",
"Itai Yanai")
wellcome_all <- unique(c(wellcome_committee, wellcome_speakers))
Cell Symposia: The Conceptual Power of Single Cell Biology
h <- read_html(url_confs[which(url_confs$name_conf == "cell_symp"),]$url)
conf_names <- h %>%
html_nodes(".blue .bold") %>%
html_text()
cell_symp_committee <- conf_names[c(22, 24,25)]
cell_symp_committee[2:3] <- stringr::str_sub(string = cell_symp_committee[2:3], end = -3)
cell_symp_speakers <- conf_names[1:21]
cell_symp_all <- unique(c(cell_symp_committee, cell_symp_speakers))
Keystone Symposia: Single Cell Biology
keystone_committee <- c("Charles Ansong", "Nikolaus Rajewsky",
"Massimiliano Pagani")
keystone_speakers <- c("Eileen E.M. Furlong", "Barbara Treutlein", "Ido Amit",
"Hans Clevers", "Charles Ansong", "Sarah Teichmann",
"Fabian Theis", "Matthias Mann", "Jeffrey A. Whitsett",
"Nikolaus Rajewsky", "Alexander Schier", "Stefano Piccolo",
"Julia Laskin", "Bernd Bodenmiller", "Evan Macosko",
"Massimiliano Pagani", "Evan W. Newell", "Peter Lichter",
"Alexander van Oudenaarden", "Iannis Aifantis", "Aron Jaffe",
"Sten Linnarsson", "Ana Pombo", "Bosiljka Tasic", "Liqun Luo",
"David Van Valen", "Jörg Vogel", "Angela Ciuffi")
keystone_all <- unique(c(keystone_committee, keystone_speakers))
Gordon Research Conference: Single-Cell Genomics
h <- read_html(url_confs[which(url_confs$name_conf == "grc_scgenomics"),]$url)
conf_names <- h %>%
html_nodes(".name strong") %>%
html_text()
grc_scgenomics_committee <- c("Xiaoliang Sunney Xie", "Stephen R. Quake",
"Xiaowei Zhuang", "Barbara Treutlein")
grc_scgenomics_speakers <- conf_names
grc_scgenomics_all <- unique(c(grc_scgenomics_committee, grc_scgenomics_speakers))
Gordon Research Conference: Single-Cell Cancer Biology
h <- read_html(url_confs[which(url_confs$name_conf == "grc_sccancerbio"),]$url)
conf_names <- h %>%
html_nodes(".name strong") %>%
html_text()
grc_sccancerbio_committee <- c("Kai Tan", "Nicholas Navin", "Mario Suva",
"Sohrab Shah")
grc_sccancerbio_speakers <- conf_names
grc_sccancerbio_all <- unique(c(grc_sccancerbio_committee, grc_sccancerbio_speakers))
Emerging Technologies in Single Cell Research
emrg_tech_committee <- c("Jean-Christophe Marine", "Diether Lambrechts",
"Stein Aerts", "Yvan Saeys", "Martin Guilliams",
"Ana Pombo", "Helen Parkinson", "Amos Tanay",
"Evy Vierstraete")
emrg_tech_speakers <- c("Leeat Keren", "Miao-Ping Chien", "Jop Kind",
"Klass Mulder", "Nitzan Mor", "Celine Vallot",
"Nikolaus Rajewsky", "Amos Tanay", "Ana Pombo",
"Oana Ursu")
emrg_tech_all <- unique(c(emrg_tech_committee, emrg_tech_speakers))
Find overlaps
Next, I combined all the organizing committee members and speakers into a big list into an UpSetr
plot to find out how many overlaps there were between confirmed individuals attending the conferences.
Note, the default is to use nsets=5
, where nsets
is the number of sets to consider. Here we have 6 conferences, so 6 sets. So I bumped it up to nsets=6
to show overlap between
all 6 conferences.
data_list <- list(wellcome = wellcome_all, cell_symp = cell_symp_all,
keystone = keystone_all,
grc_scgenomics = grc_scgenomics_all,
grc_sccancerbio = grc_sccancerbio_all,
emrg_tech = emrg_tech_all)
upset(fromList(data_list), nsets = 6, order.by = "freq")
So you can see there are quite a few individuals who are confirmed speakers or on the organzing committee for multiple single-cell conferences in 2020 confirming my quick scan of the websites. There are at least two individuals who are scheduled to attend three out of the six single-cell conferences!
Let’s do a bit of digging to see who these individuals are. Just think if you miss them at one conference, you might be able to catch them at another later in the year! :)
Data wrangling
Next, I converted the list of names for each conference into a dataframe with one column referring to the conference name, the second is the name of the individual.
df <- tibble(name_conf = names(unlist(data_list)),
name_ind = c(unlist(data_list)))
df
## # A tibble: 165 x 2
## name_conf name_ind
## <chr> <chr>
## 1 wellcome1 Ellen Rothenberg
## 2 wellcome2 Sarah Teichmann
## 3 wellcome3 Fabian Theis
## 4 wellcome4 Itai Yanai
## 5 wellcome5 Kathy Cheah
## 6 wellcome6 Polly Fordyce
## 7 wellcome7 Eileen Furlong
## 8 wellcome8 Gillian Griffiths
## 9 wellcome9 Guoji Guo
## 10 wellcome10 Muzz Haniffa
## # … with 155 more rows
I removed the numbers from the end of the conference name
df$name_conf <- gsub('[[:digit:]]+', '', df$name_conf)
head(df$name_conf)
## [1] "wellcome" "wellcome" "wellcome" "wellcome" "wellcome" "wellcome"
And had to do some manual text wrangling to remove the middle initials of a few individuals and change the name of two other individuals as the names to reflect either the full name of an individual or that another individual’s last name was referred to differently across conference websites.
df[match(c("Eileen E.M. Furlong", "Jeffrey A. Whitsett", "Evan W. Newell",
"Stephen R. Quake"), df$name_ind), ]$name_ind <-
c("Eileen Furlong", "Jeffrey Whitsett", "Evan Newell", "Stephen Quake")
df[grep("Muzz Haniffa", df$name_ind),]$name_ind <- "Muzlifah Haniffa"
df[grep("Vento", df$name_ind),]$name_ind <- "Roser Vento-Tormo"
Next, I split the full names of the individuals into first and last names and converted the data frame into a tibble.
df <- cbind(df, plyr::ldply(stringr::str_split(
df$name_ind, pattern = " ", n = 2)))
colnames(df)[match(c("V1", "V2"), colnames(df))] <-
c("name_first", "name_last")
df <- as_tibble(df)
df
## # A tibble: 165 x 4
## name_conf name_ind name_first name_last
## <chr> <chr> <chr> <chr>
## 1 wellcome Ellen Rothenberg Ellen Rothenberg
## 2 wellcome Sarah Teichmann Sarah Teichmann
## 3 wellcome Fabian Theis Fabian Theis
## 4 wellcome Itai Yanai Itai Yanai
## 5 wellcome Kathy Cheah Kathy Cheah
## 6 wellcome Polly Fordyce Polly Fordyce
## 7 wellcome Eileen Furlong Eileen Furlong
## 8 wellcome Gillian Griffiths Gillian Griffiths
## 9 wellcome Guoji Guo Guoji Guo
## 10 wellcome Muzlifah Haniffa Muzlifah Haniffa
## # … with 155 more rows
Exploratory data analysis
Let’s do some exploratory data analysis (EDA).
First let’s see who are the individuals who are attending multiple single-cell conferences this summer as confirmed speakers or on the organizing committee.
df %>%
group_by(name_ind) %>%
summarize(tot = n()) %>%
filter(tot > 1) %>%
arrange(desc(tot)) %>%
as.data.frame()
## name_ind tot
## 1 Barbara Treutlein 3
## 2 Bosiljka Tasic 3
## 3 Amos Tanay 2
## 4 Ana Pombo 2
## 5 Bernd Bodenmiller 2
## 6 Celine Vallot 2
## 7 Charles Gawad 2
## 8 Dana Pe'er 2
## 9 Eileen Furlong 2
## 10 Ellen Rothenberg 2
## 11 Fabian Theis 2
## 12 Iannis Aifantis 2
## 13 Ido Amit 2
## 14 Joakim Lundeberg 2
## 15 Nicholas Navin 2
## 16 Nikolaus Rajewsky 2
## 17 Polly Fordyce 2
## 18 Rahul Satija 2
## 19 Roser Vento-Tormo 2
## 20 Samantha Morris 2
## 21 Sarah Teichmann 2
## 22 Xiaowei Zhuang 2
## 23 Zemin Zhang 2
Lots of great speakers on this list!
I also wanted to get a rough idea of what the gender balance was for each of the conferences.
To do this, I used the gender
R package to infers state-recorded gender categories from first names using historical datasets.
Inferring the gender
Here I’m using the gender()
function with the method = "genderize"
, which uses the Genderize.io API. Reading the documention, this is based on “user profiles across major social networks.”
As there is an API limit on Genderize.io, I saved the dataset and load it in directly so I do not accidentally hit my limit each time I knit this R Markdown.
if(!file.exists(here("static", "data", "sc2020_genderize.csv"))){
df_genderize <- gender(unique(df$name_first), method = "genderize")
write_csv(df_genderize, here("static", "data", "sc2020_genderize.csv"))
} else {
df_genderize <- read_csv(here("static", "data", "sc2020_genderize.csv"))
}
## Parsed with column specification:
## cols(
## name = col_character(),
## gender = col_character(),
## proportion_female = col_double(),
## proportion_male = col_double()
## )
df_genderize
## # A tibble: 129 x 4
## name gender proportion_female proportion_male
## <chr> <chr> <dbl> <dbl>
## 1 Ellen female 0.98 0.02
## 2 Sarah female 0.98 0.02
## 3 Fabian male 0.01 0.99
## 4 Itai male 0.0900 0.91
## 5 Kathy female 0.98 0.02
## 6 Polly female 0.9 0.100
## 7 Eileen female 0.98 0.02
## 8 Gillian female 0.92 0.0800
## 9 Guoji male 0 1
## 10 Muzlifah female 1 0
## # … with 119 more rows
We see for each first name, we get returned the proportion of male (or female) names.
Then, I combine our data frame (df
) above with the df_genderize
data frame using a left_join()
function from dplyr
colnames(df_genderize)[1] <- "name_first"
df <- dplyr::left_join(df, df_genderize, by = "name_first")
df %>%
select(name_conf, name_ind, gender, proportion_female, proportion_male)
## # A tibble: 165 x 5
## name_conf name_ind gender proportion_female proportion_male
## <chr> <chr> <chr> <dbl> <dbl>
## 1 wellcome Ellen Rothenberg female 0.98 0.02
## 2 wellcome Sarah Teichmann female 0.98 0.02
## 3 wellcome Fabian Theis male 0.01 0.99
## 4 wellcome Itai Yanai male 0.0900 0.91
## 5 wellcome Kathy Cheah female 0.98 0.02
## 6 wellcome Polly Fordyce female 0.9 0.100
## 7 wellcome Eileen Furlong female 0.98 0.02
## 8 wellcome Gillian Griffiths female 0.92 0.0800
## 9 wellcome Guoji Guo male 0 1
## 10 wellcome Muzlifah Haniffa female 1 0
## # … with 155 more rows
After doing some digging, I noticed two incorrectly predicted gender labels. For example, Xiaowei Zhuang is referred to as a “she” in this wikipedia page, therefore, I modified the prediction here.
df[grep("Xiaowei", df$name_ind),]$gender <- "female"
df[grep("Xiaowei", df$name_ind),]$proportion_female <-
1 - unique(df[grep("Xiaowei", df$name_ind),]$proportion_female)
df[grep("Xiaowei", df$name_ind),]$proportion_male <-
1 - unique(df[grep("Xiaowei", df$name_ind),]$proportion_male)
df[grep("Xiaowei", df$name_ind),]
## # A tibble: 2 x 7
## name_conf name_ind name_first name_last gender proportion_fema…
## <chr> <chr> <chr> <chr> <chr> <dbl>
## 1 cell_symp Xiaowei… Xiaowei Zhuang female 0.54
## 2 grc_scge… Xiaowei… Xiaowei Zhuang female 0.54
## # … with 1 more variable: proportion_male <dbl>
I also noticed Liqun Luo is referred to as a “he” in the Wikipedia page, so I modified the label.
df[grep("Liqun", df$name_ind),]$gender <- "male"
df[grep("Liqun", df$name_ind),]$proportion_female <-
1 - unique(df[grep("Liqun", df$name_ind),]$proportion_female)
df[grep("Liqun", df$name_ind),]$proportion_male <-
1 - unique(df[grep("Liqun", df$name_ind),]$proportion_male)
df[grep("Liqun", df$name_ind),]
## # A tibble: 1 x 7
## name_conf name_ind name_first name_last gender proportion_fema…
## <chr> <chr> <chr> <chr> <chr> <dbl>
## 1 keystone Liqun L… Liqun Luo male 0.38
## # … with 1 more variable: proportion_male <dbl>
There were also some names that had no gender predictions returned.
df[is.na(df$proportion_female),]
## # A tibble: 5 x 7
## name_conf name_ind name_first name_last gender proportion_fema…
## <chr> <chr> <chr> <chr> <chr> <dbl>
## 1 grc_scge… Chengha… Chenghang Zong <NA> NA
## 2 grc_scge… Fuchou … Fuchou Tang <NA> NA
## 3 grc_scca… Liynat … Liynat Jerby-Ar… <NA> NA
## 4 emrg_tech Leeat K… Leeat Keren <NA> NA
## 5 emrg_tech Miao-Pi… Miao-Ping Chien <NA> NA
## # … with 1 more variable: proportion_male <dbl>
I used some of my google-fu to use my best judgement on what the individual’s gender might be. However, it is worth noting that gender is not binary and I am only performing this part of the analysis to get a better guess of the gender-balance for the population as a whole for each conference.
df[match(c("Leeat Keren", "Miao-Ping Chien", "Liynat Jerby-Arnon"), df$name_ind),]$gender <- "female"
df[match(c("Chenghang Zong", "Fuchou Tang"), df$name_ind),]$gender <- "male"
Finally, I created a plot to show the gender balance of confirmed speakers and individuals on the organizing committees across the six conferences.
df %>%
left_join(url_confs, by = "name_conf") %>%
group_by(name_long, gender) %>%
summarize(total = n()) %>%
ggplot(aes(x = name_long, y = total, fill = gender)) +
geom_bar(stat="identity", position = "fill") + coord_flip() +
xlab("Conference") +
ylab("Proportion") +
ggtitle("Confirmed speakers and organizers at six single-cell conferences in 2020")
As you can see, there seems to be a difference in the (predicted) gender balance across the six conferences.
Anyways, I had fun exploring a bit of the landscape of the single-cell conferences coming up in 2020! Hopefully this was helpful for someone else too. :)
Happy holidays and Happy New Year!