Adventures in text mining part 1: Getting started

Fishing flickr photo by Homini:) shared under a Creative Commons (BY) license

Someday I may sit down and write this all out into a good research paper I can shop around to open access academic venues, but I thought it would be cool and possibly helpful to somebody out there to talk about my research as I’m doing it. I decided it’s time to knuckle down and try different types of citation analysis. I will be testing out different tools and methods to come up with a framework for data collection and analysis. I’m starting in the classic tradition of the data rich by going on a fishing expedition.

I’ve been collecting publication data for my institute since 2015 when we had an external academic review. They needed a bibliography from all academic publications from the institute in the last decade, so I brute forced it in Zotero. Coming up with a library of a couple thousand citations. Since then I’ve used Zotero to keep track of our publications. I like using Zotero because it generates a fairly rich data set which has been useful in a number of ways. I know what is the most cited paper of the last decade. I also know about 30% of our publications are through Elsevier, so I need to pay attention to changes in their OA policies. But is there anything else this data can tell me?

This is where I went fishing with the free text analysis platform Voyant. I put in a CSV of all of our publications from FY2015-2016 just to see what I’d get. It was this:

Yeah, I hate word clouds. This is garbage because I didn’t clean the data at all, instead just uploading the raw CSV file. Most of the top words are related to how Zotero organizes things, not the publications themselves. So I went about adding some stop words to mute these results. The default list in Voyant is quite good for text analysis of a literary corpus, which is not how I’m using it. It stands to reason that the stop words would need to be refined to make analysis of this data meaningful. So I nixed words like “storage”, “zotero”, “05”, “http”, “users”, and many other terms that seemed to be more about Zotero and the file systems. Here are the updated results:

This could use some more work, but now research topics are actually visible. Of course transportation is present, and I’m not surprised by “data”, “model”, “traffic”, “systems”, or “time.” Those are all common themes/terms used in our research. The surprise to me was “control” because I didn’t think control theory or control systems factored for that much of our research, though they are fundamental to some areas of autonomous and connected vehicles. I just never realized we published so much about it. Of course this is probably a reflection of the publication rates of different disciplines, but that’s a different fishing trip.

Leave a Reply