Using R to analyze publications - part 2

Some examples using bibliometrix

Overview

I needed some information on all my publications for “bean counting” purposes related to preparing my promotion materials. In the past, I also needed similar information for NSF grant applications.

Instead of doing things by hand, there are nicer/faster ways using R. in part 1, I did a few things using the scholar package. While some parts worked nicely, I encountered 2 problems. First, since my Google Scholar record lists items other than peer-reviewed journal articles, they show up in the analysis and need to be cleaned out. Second, Google Scholar doesn’t like automated queries through the API and is quick to block, at which point things don’t work anymore.

To get around these issues, I decided to give a different R package a try, namely bibliometrix. The workflow is somewhat different.

The RMarkdown file to run this analysis is here.

Required packages

Loading data

I keep all references to my published papers in a BibTeX file, managed through Zotero/Jabref. I know this file is clean and correct. I’m loading it here for processing. If you don’t have such a file, make one using your favorite reference manager. Or create it through a saved search on a bibliographic database, as described in the bibliometrix vignette.

#read bib file
rawrefs <- readFiles("mypublishedpapers.bib") 
#turn file of references into data frame
pubs <- bibliometrix::convert2df(rawrefs, dbsource = "isi", format = "bibtex") 
## 
## Converting your isi collection into a bibliographic dataframe
## 
## Articles extracted   64 
## Done!

Each row of the data frame created by the convert2df function is a publication, the columns contain information for each publication. For a list of what each column variable codes for, see here.

Analyzing 2 time periods

For my purpose, I want to analyze 2 different time periods and compare them. Therefore, I split the data frame containing publications, then run the analysis on each.

#get all pubs for an author (or multiple)
period_1_start = 2009
period_2_start = 2015
#here I want to separately look at publications in the 2 time periods I defined above
pubs_old <- pubs %>% dplyr::filter((PY>=period_1_start & PY<period_2_start ))
pubs_new <- pubs %>% dplyr::filter(PY>=period_2_start)
res_old <- bibliometrix::biblioAnalysis(pubs_old, sep = ";") #perform analysis
res_new <- bibliometrix::biblioAnalysis(pubs_new, sep = ";") #perform analysis

General information

The summary functions provide a lot of information in a fairly readable format. I apply them here to both time periods so I can compare.

Time period 1

summary(res_old, k = 10)
## 
## 
## Main Information about data
## 
##  Documents                             19 
##  Sources (Journals, Books, etc.)       12 
##  Keywords Plus (ID)                    0 
##  Author's Keywords (DE)                0 
##  Period                                2009 - 2014 
##  Average citations per documents       NaN 
## 
##  Authors                               45 
##  Author Appearances                    80 
##  Authors of single-authored documents  0 
##  Authors of multi-authored documents   45 
##  Single-authored documents             0 
## 
##  Documents per Author                  0.422 
##  Authors per Document                  2.37 
##  Co-Authors per Documents              4.21 
##  Collaboration Index                   2.37 
##  
## 
## Annual Scientific Production
## 
##  Year    Articles
##     2009        5
##     2010        2
##     2011        1
##     2012        3
##     2013        2
##     2014        6
## 
## Annual Percentage Growth Rate 3.713729 
## 
## 
## Most Productive Authors
## 
##    Authors        Articles Authors        Articles Fractionalized
## 1   HANDEL A            19 HANDEL A                          5.55
## 2   ANTIA R              6 ANTIA R                           1.78
## 3   DOHERTY PC           3 LONGINI IM                        1.00
## 4   LA GRUTA NL          3 DOHERTY PC                        0.56
## 5   LONGINI IM           3 LA GRUTA NL                       0.56
## 6   THOMAS PG            3 THOMAS PG                         0.56
## 7   PILYUGIN SS          2 BEAUCHEMIN CAA                    0.50
## 8   ROHANI P             2 LI Y                              0.50
## 9   STALLKNECHT D        2 ROHANI P                          0.50
## 10  TURNER SJ            2 ROZEN DE                          0.50
## 
## 
## Top manuscripts per citations
## 
##              Paper          TC TCperYear
## 1  BEAUCHEMIN CAA, 2011,    NA        NA
## 2  CUKALAC T, 2014,         NA        NA
## 3  DESAI R, 2012,           NA        NA
## 4  FUNG ICH, 2012,          NA        NA
## 5  HANDEL A, 2009,          NA        NA
## 6  HANDEL A, 2009, -a       NA        NA
## 7  HANDEL A, 2009, -a-b     NA        NA
## 8  HANDEL A, 2009, -a-b-c   NA        NA
## 9  HANDEL A, 2009, -a-b-c-d NA        NA
## 10 HANDEL A, 2010,          NA        NA
## 
## 
## Most Relevant Sources
## 
##                                     Sources        Articles
## 1  JOURNAL OF THE ROYAL SOCIETY INTERFACE                 3
## 2  JOURNAL OF THEORETICAL BIOLOGY                         3
## 3  PLOS ONE                                               2
## 4  PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES        2
## 5  THE JOURNAL OF IMMUNOLOGY                              2
## 6  BMC EVOLUTIONARY BIOLOGY                               1
## 7  BMC PUBLIC HEALTH                                      1
## 8  CLINICAL INFECTIOUS DISEASES                           1
## 9  EPIDEMICS                                              1
## 10 INFECTION GENETICS AND EVOLUTION                       1

Time period 2

summary(res_new, k = 10)
## 
## 
## Main Information about data
## 
##  Documents                             33 
##  Sources (Journals, Books, etc.)       24 
##  Keywords Plus (ID)                    0 
##  Author's Keywords (DE)                0 
##  Period                                2015 - 2020 
##  Average citations per documents       NaN 
## 
##  Authors                               211 
##  Author Appearances                    345 
##  Authors of single-authored documents  1 
##  Authors of multi-authored documents   210 
##  Single-authored documents             2 
## 
##  Documents per Author                  0.156 
##  Authors per Document                  6.39 
##  Co-Authors per Documents              10.5 
##  Collaboration Index                   6.77 
##  
## 
## Annual Scientific Production
## 
##  Year    Articles
##     2015        5
##     2016        7
##     2017        3
##     2018        7
##     2019        8
##     2020        3
## 
## Annual Percentage Growth Rate -9.711955 
## 
## 
## Most Productive Authors
## 
##    Authors        Articles Authors        Articles Fractionalized
## 1     HANDEL A          33    HANDEL A                      7.299
## 2     WHALEN CC          9    WHALEN CC                     0.828
## 3     MARTINEZ L         6    ANTIA R                       0.810
## 4     ANTIA R            5    EBELL MH                      0.571
## 5     ZALWANGO S         5    THOMAS PG                     0.571
## 6     LA GRUTA NL        4    MARTINEZ L                    0.561
## 7     SHEN Y             4    LA GRUTA NL                   0.540
## 8     THOMAS PG          4    MCKAY B                       0.533
## 9     KAKAIRE R          3    SHEN Y                        0.523
## 10    KIWANUKA N         3    ZALWANGO S                    0.513
## 
## 
## Top manuscripts per citations
## 
##            Paper          TC TCperYear
## 1  ANTIA A, 2018,         NA        NA
## 2  BIRD NL, 2015,         NA        NA
## 3  CASTELLANOS ME, 2018,  NA        NA
## 4  DALE AP, 2019,         NA        NA
## 5  DEVASIA T, 2015,       NA        NA
## 6  HANDEL A, 2015,        NA        NA
## 7  HANDEL A, 2015, -a     NA        NA
## 8  HANDEL A, 2017,        NA        NA
## 9  HANDEL A, 2018,        NA        NA
## 10 HANDEL A, 2018, -a     NA        NA
## 
## 
## Most Relevant Sources
## 
##                                                Sources        Articles
## 1  PLOS ONE                                                          4
## 2  PLOS COMPUTATIONAL BIOLOGY                                        2
## 3  THE LANCET GLOBAL HEALTH                                          2
## 4  THE LANCET RESPIRATORY MEDICINE                                   2
## 5  AMERICAN JOURNAL OF RESPIRATORY AND CRITICAL CARE MEDICINE        1
## 6  BMC IMMUNOLOGY                                                    1
## 7  BMC INFECTIOUS DISEASES                                           1
## 8  BULLETIN OF MATHEMATICAL BIOLOGY                                  1
## 9  COMPUTATIONAL STATISTICS                                          1
## 10 CURRENT OPINION IN SYSTEMS BIOLOGY                                1

Note that some values are reported as NA, e.g. the citations. Depending on which source you got the original data from, that information might be included or not. In my case, it is not.

Getting a table of co-authors

This can be useful for NSF applications. For reasons nobody understands, that agency still asks for a list of all co-authors. An insane request in the age of modern science. If one wanted to do that, the following gives a table.

Here is the full table of my co-authors in the first period dataset.

#removing the 1st one since that's me
authortable = data.frame(res_old$Authors[-1])
colnames(authortable) = c('Co-author name', 'Number of publications')
knitr::kable(authortable)
Co-author nameNumber of publications
ANTIA R6
DOHERTY PC3
LA GRUTA NL3
LONGINI IM3
THOMAS PG3
PILYUGIN SS2
ROHANI P2
STALLKNECHT D2
TURNER SJ2
AKIN V1
BEAUCHEMIN CAA1
BIRD NL1
BROWN J1
CHADDERTON J1
CUKALAC T1
DESAI R1
DICKEY BW1
FUNG ICH1
HALL AJ1
HALL D1
HEMBREE CD1
JACKWOOD MW1
KEDZIERSKA K1
KJER-NIELSEN LARS KNL1
KOTSIMBOS TC1
LEBARBENCHON C1
LEON JS1
LEVIN BR1
LI Y1
LOPMAN B1
MARGOLIS E1
MATTHEWS JE1
MCDONALD S1
MIFSUD NA1
MOFFAT JM1
NGUYEN THO1
PARASHAR UD1
PELLICCI DG1
ROWNTREE LC1
ROZEN DE1
WHALEN CC1
YATES A1
ZARNITSYNA V1
ZHENG N1

Since I have many more co-authors in the second period, I’m not printing a table with all, instead I’m just doing those with whom I have more than 2 joint publications.

#removing the 1st one since that's me
authortable = data.frame(res_new$Authors[-1])
authortable <- authortable %>% dplyr::filter(Freq>2)
colnames(authortable) = c('Co-author name', 'Number of publications')
knitr::kable(authortable)
Co-author nameNumber of publications
WHALEN CC9
MARTINEZ L6
ANTIA R5
ZALWANGO S5
LA GRUTA NL4
SHEN Y4
THOMAS PG4
KAKAIRE R3
KIWANUKA N3
MCBRYDE ES3
MCKAY B3
SUMNER T3
TRAUER JM3

Making a table of journals

It can be useful to get a list of all journals in which you published. I’m doing this here for the second time period. With just the bibliometrix package, I can get a list of publications and how often I have published in each.

journaltable = data.frame(res_new$Sources)
#knitr::kable(journaltable) #uncomment this to print the table

It might also be nice to get some journal metrics, such as impact factors. While this is possible with the scholar package, the bibliometrix package doesn’t have it.

However, the scholar package doesn’t really get that data from Google Scholar, instead it has an internal spreadsheet/table with impact factors (according to the documentation, taken - probably not fully legally - from some spreadsheet posted on ResearchGate). We can thus access those impact factors stored in the scholar package without having to connect to Google Scholar. As long as the journal names stored in the scholar package are close to the ones we have here, we might get matches.

library(scholar)
ifvalues = scholar::get_impactfactor(journaltable[,1], max.distance = 0.1)
journaltable = cbind(journaltable, ifvalues$ImpactFactor)
colnames(journaltable) = c('Journal','Number of Pubs','Impact Factor')
knitr::kable(journaltable)
JournalNumber of PubsImpact Factor
PLOS ONE42.766
PLOS COMPUTATIONAL BIOLOGY23.955
THE LANCET GLOBAL HEALTH2NA
THE LANCET RESPIRATORY MEDICINE221.466
AMERICAN JOURNAL OF RESPIRATORY AND CRITICAL CARE MEDICINE115.239
BMC IMMUNOLOGY12.615
BMC INFECTIOUS DISEASES12.620
BULLETIN OF MATHEMATICAL BIOLOGY11.484
COMPUTATIONAL STATISTICS10.828
CURRENT OPINION IN SYSTEMS BIOLOGY1NA
ELIFE17.616
EPIDEMICS13.364
EPIDEMIOLOGY AND INFECTION12.044
FRONTIERS IN IMMUNOLOGY15.511
JOURNAL OF APPLIED STATISTICS10.699
NATURE141.577
NATURE COMMUNICATIONS112.353
NATURE REVIEWS IMMUNOLOGY141.982
PHILOSOPHICAL TRANSACTIONS OF THE ROYAL SOCIETY B: BIOLOGICAL SCIENCES15.666
PLOS BIOLOGY19.163
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES1244.585
SOCIOLOGICAL METHODS & RESEARCH13.625
THE JOURNAL OF INFECTIOUS DISEASES15.186
THE JOURNAL OF THE AMERICAN BOARD OF FAMILY MEDICINE12.515

Ok that’s not too bad. It couldn’t find the Lancet Global Health, Current Opinion Systems Biology does indeed not have an impact factor (as of this writing), and PNAS is clearly wrong. The others seem reasonable. But since I don’t know what year those IF are from, and if the rest is fully reliable, I would take this with a grain of salt.

Discussion

The bibliometrix package doesn’t suffer from the problems that I encountered in part 1 of this post when I tried the scholar package (and Google Scholar). The downside is that I can’t get some of the information, e.g. my annual citations. So it seems there is not (yet) a comprehensive solution, and using both packages seems best.

A larger overall problem is that a lot of this information is controlled by corporations (Google, Elsevier, Clarivate Analytics, etc.), which might or might not allow R packages and individual users (who don’t subscribe to their offerings) to access certain information. As such, R packages accessing this information will need to adjust to whatever the companies allow.

Avatar
Andreas Handel
Associate Professor

Data Analysis and Modeling with a focus on infectious diseases.