Overview

I needed some information on all my publications for “bean counting” purposes related to preparing my promotion materials. In the past, I also needed similar information for NSF grant applications.

Instead of doing things by hand, there are nicer/faster ways using R. The following shows a few things one can do with the scholar package. I describe an alternative approach using the bibliometrix package in part 2 of this post.

The RMarkdown file to run this analysis is here.

Notes

• As of this writing, the scholar R package seems semi-dormant and not under active development. If Google changes their API for Scholar and the package isn’t updated, the below code might stop working.

• A problem I keep encountering with Google Scholar is that it starts blocking requests, even after what I consider are not that many attempts to retrieve data. I notice that when I try to pull references from Google Scholar using JabRef and also with the code below. If that happens to you, try a different computer, or clear cookies. This is a well known problem and if you search online, you find others complaining about it. I haven’t found a great solution yet, other than not using the Google Scholar data. I describe such an approach in part 2 of this post. However, some analyses are only able with Google Scholar information.

• To minimize chances of getting locked out by Google, I wrote the code below such that it only sends a request if there isn’t a local file already containing that data. To refresh data, delete the local files.

Required packages

library(scholar)
library(dplyr)
library(tidyr)
library(knitr)
library(ggplot2)


Get all citations for an individual

First, I’m using Google Scholar to get all citations for a specific author (in this case, myself).

#Define the person to analyze
scholar_id="bruHK0YAAAAJ"
#either load existing file of publications
#or get a new one from Google Scholar
#delete the file to force an update
if (file.exists('citations.Rds'))
{
} else {
#get citations
cites <- scholar::get_citation_history(scholar_id)
saveRDS(cites,'citations.Rds')
}


Compare citations for different time periods

For my purpose, I want to compare citations between 2 time periods (my Assistant Professor time and my Associate Professor time). I’m splitting them into 2. I’m doing this analysis at the beginning of 2020 and want only full years. The code snippets below give me what I need, two time periods 2009-2014 and 2014-2019.

period_1_start = 2009
period_2_start = 2015
cites_1 <- cites %>% dplyr::filter((year>=period_1_start & year<period_2_start ))
#remove last year since it's not a full year
cites_2 <- cites %>% dplyr::filter((year>=period_2_start & year<2020 ))


Fitting a linear model to both time segments to look at increase in citations over both periods.

fit1=lm(cites ~ year, data = cites_1)
fit2=lm(cites ~ year, data = cites_2)
inc1 = fit1$coefficients["year"] inc2 = fit2$coefficients["year"]
print(sprintf('Annual increase for periods 1 and 2 are %f, %f',inc1,inc2))

## [1] "Annual increase for periods 1 and 2 are 22.257143, 43.100000"


Making a figure to show citation count increases

# combine data above into single data frame
#add a variable to indicate period 1 and period 2
cites_1$group = "1" cites_2$group = "2"
cites_df = rbind(cites_1,cites_2)
xlabel = cites_df$year[seq(1,nrow(cites_df),by=2)] #make the plot and show linear fit lines p1 <- ggplot(data = cites_df, aes(year, cites, colour=group, shape=group)) + geom_point(size = I(4)) + geom_smooth(method="lm",aes(group = group), se = F, size=1.5) + scale_x_continuous(name = "Year", breaks = xlabel, labels = xlabel) + scale_y_continuous("Citations according to Google Scholar") + theme_bw(base_size=14) + theme(legend.position="none") + geom_text(aes(NULL,NULL),x=2010.8,y=150,label="Average annual \n increase 22%",color="black",size=5.5) + geom_text(aes(NULL,NULL),x=2017,y=150,label="Average annual \n increase 43%",color="black",size=5.5) #open a new graphics window #note that this is Windows specific. Use quartz() for MacOS ww=5; wh=5; windows(width=ww, height=wh) print(p1)  ## geom_smooth() using formula 'y ~ x'  dev.print(device=png,width=ww,height=wh,units="in",res=600,file="citations.png")  ## png ## 2  Getting list of publications Above I got citations, but not publications. This function retrieves all publications for a specific author and returns it as a data frame. #get all pubs for an author (or multiple) if (file.exists('publications.Rds')) { publications <- readRDS('publications.Rds') } else { #get citations publications <- scholar::get_publications(scholar_id) saveRDS(publications,'publications.Rds') }  Quick peek at publications glimpse(publications)  ## Rows: 90 ## Columns: 8 ##$ title   <fct> "Severe outcomes are associated with genogroup 2 genotype 4 no~
## $author <fct> "R Desai, CD Hembree, A Handel, JE Matthews, BW Dickey, S McDo~ ##$ journal <fct> "Clinical infectious diseases", "BMC public health", "Journal ~
## $number <fct> "55 (2), 189-193", "11 (S1), S7", "7 (42), 35-47", "3 (12)", "~ ##$ cites   <dbl> 163, 158, 129, 124, 123, 115, 105, 89, 71, 71, 55, 53, 52, 49,~
## $year <dbl> 2012, 2011, 2010, 2007, 2006, 2012, 2006, 2017, 2016, 2008, 20~ ##$ cid     <fct> 1979732925283755485, 10982184786304722425, 1038596204985444772~
## $pubid <fct> 5nxA0vEk-isC, _FxGoFyzp5QC, 9yKSN-GCB0IC, d1gkVwhDpl0C, u5HHmV~  This shows the variables obtained in the data frame. One thing I notice is that this contains more entries than I have peer-reviewed publications. Since most people’s Google Scholar profile (including my own) list items beyond peer-reviewed journal articles, one likely needs to do some manual cleaning before analysis. That is not ideal. I’ll do/show a few more possible analyses, but decided to do the analyses below using the approach in part 2. Making a table of journals and impact factors The scholar package has a function that allows one to get impact factors for journals. This data doesn’t actually come from Google Scholar, instead the package comes with an internal spreadsheet/table with impact factors. Looking a bit into the scholar package indicates that the data was taken from some spreadsheet posted on ResearchGate (probably not fully legal). Either way, let’s give it a try. #here I only want publications since 2015 pub_reduced <- publications %>% dplyr::filter(year>2014) ifdata <- scholar::get_impactfactor(pub_reduced$journal)

## The impact factor data is out-of-date and we may remove this function in future release.

#Google SCholar collects all kinds of 'publications'
#including items other than standard peer-reviewed papers
#this sorts and removes some non-journal entries
iftable <- ifdata %>% dplyr::arrange(desc(ImpactFactor) ) %>% tidyr::drop_na()
knitr::kable(iftable)

Journal Cites ImpactFactor Eigenfactor
CA-A CANCER JOURNAL FOR CLINICIANS 28839 244.585 0.06603
NATURE REVIEWS IMMUNOLOGY 39215 41.982 0.08536
NATURE 710766 41.577 1.35581
AMERICAN JOURNAL OF RESPIRATORY AND CRITICAL CARE MEDICINE 61024 15.239 0.08683
Nature Communications 178348 12.353 0.92656
PLOS BIOLOGY 28750 9.163 0.05868
eLife 25592 7.616 0.19592
eLife 25592 7.616 0.19592
PHILOSOPHICAL TRANSACTIONS OF THE ROYAL SOCIETY B-BIOLOGICAL SCIENCES 41872 5.666 0.07129
Frontiers in Immunology 16999 5.511 0.06747
RHEUMATOLOGY 18744 5.245 0.03381
RHEUMATOLOGY 18744 5.245 0.03381
PROCEEDINGS OF THE ROYAL SOCIETY B-BIOLOGICAL SCIENCES 51704 4.847 0.08756
PROCEEDINGS OF THE ROYAL SOCIETY B-BIOLOGICAL SCIENCES 51704 4.847 0.08756
PLoS Computational Biology 23758 3.955 0.08279
PLoS Computational Biology 23758 3.955 0.08279
SOCIOLOGICAL METHODS & RESEARCH 4177 3.625 0.00462
Epidemics 576 3.364 0.00241
PATHOLOGY 2638 3.068 0.00446
PLoS One 582877 2.766 1.86235
PLoS One 582877 2.766 1.86235
PLoS One 582877 2.766 1.86235
PLoS One 582877 2.766 1.86235
PLoS One 582877 2.766 1.86235
PLoS One 582877 2.766 1.86235
BMC INFECTIOUS DISEASES 13612 2.620 0.04109
BMC IMMUNOLOGY 1784 2.615 0.00321
NEPHROLOGY 3115 2.178 0.00605
BULLETIN OF MATHEMATICAL BIOLOGY 3736 1.484 0.00447
JAPANESE JOURNAL OF INFECTIOUS DISEASES 1722 1.014 0.00242
COMPUTATIONAL STATISTICS 958 0.828 0.00337
JOURNAL OF APPLIED STATISTICS 2352 0.699 0.00379

OK so this doesn’t quite work. I know for instance that I didn’t publish anything in Cancer Journal for Clinicians and the 2 Rheumatology entries are workshop presentations. Oddly, when I look at publications$journal there is no Cancer Journal listed. Somehow this is a bug created by the get_impactfactor() function. I could fix that by hand. The bigger problem is what to do with all those publications that are not peer-reviewed papers. I could remove them from my Google scholar profile. But I kind of want to keep them there since some of them link to useful stuff. I could alternatively manually clean things at this step. This somewhat defeats the purpose of automation. Getting list of co-authors Another useful piece of information to have, e.g. for NSF grants, is a table with all co-authors. Unfortunately, get_publications() only pulls from the main Google Scholar page, which cuts off the author list. To get all authors, one needs to run through each paper using get_complete_authors(). The problem is that Google cuts off access if one sends too many queries. If you get error messages, it might be that Google blocked you. See the Notes section. allauthors = list() if (file.exists('allauthors.Rds')) { allauthors <- readRDS('allauthors.Rds') } else { for (n in 1:nrow(publications)) { allauthors[[n]] = get_complete_authors(id = scholar_id, pubid = publications[n,]$pubid)
}
saveRDS(allauthors,'allauthors.Rds')
}


Theoretically, if the above code runs without Google blocking things, I should end up with a list of all co-authors which I could then turn into a table. The problem is still that it pulls all entries on my Google Scholar profile, and not just peer-reviewed papers. With a bit of cleaning I could get what I need. But overall I don’t like this approach too much.

Discussion

While the scholar package has some nice features, it has 2 major problems:

• Google blocking the script if it decides too many requests are made (that can happen quickly).
• Since most people’s Google Scholar profile (including my own) list items beyond peer-reviewed journal articles, one likely needs to do some manual cleaning before analysis.

I do keep all my published, peer-reviewed papers in a BibTeX bibliography file in my reference manager (I’m using Zotero and/or Jabref). I know that file is clean and only contains peer-reviewed papers. Unfortunately, the scholar package can’t read in such data. In part 2 of this post series, I’ll use a different R package to produce the journal and author tables I tried making above.

The one feature only available through Google Scholar is the citation record and the analysis I did at the beginning if this post.

Andreas Handel
Professor

Data analysis and modeling with a focus on infectious diseases.