Adventures in bibliometrics – the pitfalls of Google Scholar Citations

How do you measure the impact of research? It’s a huge question and lots of people have tried to come up with answers, but it’s kind of a pointless question. Believe me, I tried!

One of the core issues is how do you define impact? If you’re looking for qualitative methods, it will be extremely labor intensive and nobody has the time or funding for that. So we settle for quantitative measures, which is why impact factors are currency – even though they are problematic as hell.

Citations are crude and only indicate how often a work has been cited. Not why it was cited, or if that citation indicates anything, just that it was cited. For applied research, like transportation, this misses out of work that ends up in policy and practice. What is more impactful? A journal article that has been cited 150 times by other academic articles or a report that was the basis for legislation that was adopted? Bibliometrics would tell you the former because there aren’t really systems to capture the later.

So with that background, I was trying a very quick and dirty impact report of the work from my institute recently. I collected citations from our researchers into Zotero (with Publish or Perish as much as possible), and then tried to get a sense of the collected citation counts. This was about 2,500 publications between 2020-2022. The Zotero plugin for citation counts was able to add that info from CrossRef to the records, but that doesn’t really tell a complete picture and leads to the question – what are CrossRef citations. (You know, they’re citations in CrossRef…) Google Scholar citations are better because they pull from a larger datasource, but it’s also a blackbox and difficult to know what you’re really getting.

I used the Zotero Google Scholar Citation Count plugin to get the data. It took some fussing to throttle the plugin so it wouldn’t think I’m a bot and lock me out of Google Scholar, but it worked! And then when I crunched the data I realized it was off. The most egregious example was a report that according to the plugin had 0010763 citations, but when we looked on Google Scholar proper it had 0. I raised the issue on github and the developer pointed out that the Google Scholar API not great is non-existent. (Citation counts are scraped from search results.) So we didn’t use Google Scholar citations for that project.

Then I decided to run the Google Scholar Citation Count plugin on a much smaller corpus – my institute’s 323 publications. I had a student worker validate the plugin’s results and well… it wasn’t great. There were differences between the citations pulled from the Google Scholar Citation Count plugin and from Google Scholar in 189 of the 323 reports. More than half of the results were inaccurate, and in most cases the discrepancy was very large. You can see the data here.

So what did this little exercise teach me? That the Google Scholar API is an unmitigated mess and treat anything I get from it with suspicion. As long as the Zotero plugins rely on searching and scraping to collect data, I don’t think I can really trust the results. I wish the plugin worked better, but it’s clear that with such a high error rate I can’t trust it for some of my crude bibliometric work. I thought this was interesting enough to share.

This post has been edited to correct my inaccuracies and misunderstanding around the non-existent Google Scholar API. Big thanks to Sebastian Karcher for pointing out my errors on Mastodon.





Leave a Reply