Adventures in text mining part 1: Getting started

Fishing flickr photo by Homini:) shared under a Creative Commons (BY) license

Someday I may sit down and write this all out into a good research paper I can shop around to open access academic venues, but I thought it would be cool and possibly helpful to somebody out there to talk about my research as I’m doing it. I decided it’s time to knuckle down and try different types of citation analysis. I will be testing out different tools and methods to come up with a framework for data collection and analysis. I’m starting in the classic tradition of the data rich by going on a fishing expedition.

I’ve been collecting publication data for my institute since 2015 when we had an external academic review. They needed a bibliography from all academic publications from the institute in the last decade, so I brute forced it in Zotero. Coming up with a library of a couple thousand citations. Since then I’ve used Zotero to keep track of our publications. I like using Zotero because it generates a fairly rich data set which has been useful in a number of ways. I know what is the most cited paper of the last decade. I also know about 30% of our publications are through Elsevier, so I need to pay attention to changes in their OA policies. But is there anything else this data can tell me?

This is where I went fishing with the free text analysis platform Voyant. I put in a CSV of all of our publications from FY2015-2016 just to see what I’d get. It was this:

Yeah, I hate word clouds. This is garbage because I didn’t clean the data at all, instead just uploading the raw CSV file. Most of the top words are related to how Zotero organizes things, not the publications themselves. So I went about adding some stop words to mute these results. The default list in Voyant is quite good for text analysis of a literary corpus, which is not how I’m using it. It stands to reason that the stop words would need to be refined to make analysis of this data meaningful. So I nixed words like “storage”, “zotero”, “05”, “http”, “users”, and many other terms that seemed to be more about Zotero and the file systems. Here are the updated results:

This could use some more work, but now research topics are actually visible. Of course transportation is present, and I’m not surprised by “data”, “model”, “traffic”, “systems”, or “time.” Those are all common themes/terms used in our research. The surprise to me was “control” because I didn’t think control theory or control systems factored for that much of our research, though they are fundamental to some areas of autonomous and connected vehicles. I just never realized we published so much about it. Of course this is probably a reflection of the publication rates of different disciplines, but that’s a different fishing trip.

Confusion Reigns With Open Access Mandates. Thanks, Elsevier.

flickr photo shared by Ape Lad under a Creative Commons ( BY-NC-ND ) license

One of the most tiring problems with Open Access (OA) policies is that the ground keeps shifting making it extremely difficult to be up to date. It’s easy to understand why so many authors (academics and researchers) roll their eyes and see these mandates as an obstacle, because they are despite the efforts of many OA advocates. Even with the White House’s public access to research memo, the road to implementation has been prolonged. One reason for this is the way many of the big, established academic publishers have “embraced” OA: too often the policies are cryptic, either frustrating or deceiving authors. Take for example this week’s announcement from Elsevier on article sharing. The always reliable Kevin Smith breaks it down:

Two major features of this retreat from openness need to be highlighted.  First, it imposes an embargo of at least one year on all self-archiving of final authors’ manuscripts, and those embargoes can be as long as four years.  Second, when the time finally does roll around when an author can make her own work available through an institutional repository, Elsevier now dictates how that access is to be controlled, mandating the most restrictive form of Creative Commons license, the CC-BY-NC-ND license for all green open access.

Smith also links to Elsevier’s 50-page document listing all of the different embargo periods for its journals. It’s no wonder why people are confused and frustrated.

For perspective, let’s use my local users as an example. We have a UC OA policy and the local UC Berkeley guide. Not all of our funding comes from federal sources, so the OSTP mandate doesn’t cover all of our publications though the UC mandate will (but not grad students, yet). Then you have to look where our researchers publish and want to publish – for transportation 6 of the top 10 journals ranked by impact factor are published by Elsevier. (I’m currently working on a data set to see how often we’ve published in these journals in the last decade, results forthcoming, but I can say from data collection it’s considerable.) Edited to add: I’ve run some preliminary data. Based on journals ITS researcher have published articles in 3 or more times since 2005, Elsevier accounts for 31% (150 articles), TRB is 27% (128), ACS is 8% (37), IEEE is 7% (32), and ASCE is 6% (27). So Elsevier’s OA policies are something I try to understand despite the confusion, and even I’m frustrated even though I’d say I’m a pretty optimistic OA advocate.

I’m not going to go so far as Smith as to suggest it’s time for another boycott because I know my faculty won’t really go for it, but I do think we need to have a conversation about what their choices mean and the cycles of publishing and tenure. It would also be great to have more OA options for them to publish in. The Journal of Transport and Land Use and the Journal of Public Transportation are great, but they only cover limited areas. So hey transportation faculty- if you’re reading this, let’s make a difference. Consider publishing OA and maybe even starting a new journal. I’m here to help.

How can you innovate without research?



Yesterday I went down to Google HQ to see Secretary Foxx hold a fireside chat with Erich Schmidt to discuss DOT’s new 30-year plan Beyond Traffic.  I’ve never been to an event like this before, and it seemed the audience was industry more than the usual transportation wonks (though we were there).  There was a very active back channel on #BeyondTraffic on Twitter, connecting people in the room to those watching it online.

Foxx discussed the funding issues, MAP-21 won’t lasts much longer and in the last 6 years Congress has passed 32 short-term measures to extend funding because they can’t actually pass long-term funding.  Yesterday Foxx announced the White House’s ambitious $94.7 billion transportation investment plan.  I’m not holding my breath.  I wouldn’t be surprised if we get back down to the wire in May when the Highway Trust Fund runs out of money. That’s politics. (This is also something I know the general public, like Eric Schmidt, doesn’t know a whole lot about but it’s vital.)

The whole Beyond Traffic blue paper is also politics: bold proclamations, neat infographics, but light on the details. Foxx hit most of the high points that appealed to the Silicon Valley crowd – UAVs and connected/autonomous vehicles, and regulations for them. I did appreciate that Foxx said that promoting multimodal transportation system is about providing choices for people. He also stressed the important of land-use on transportation, which is hugely important in sustainability. (Which also lead to Schmidt extolling the success of Google buses, ignoring their role in perpetuating terrible land-use patterns in the Bay Area.) Bike/ped stuff was largely absent from the discussion.

Also largely absent was talking about research.

Research is inherent to all of these innovations. How do you improve and develop new practices without it? The problem is that funding allocated to research keeps dwindling. Politicians want to fund highways, (and maybe) rail, self-driving cars, but not the research and required research infrastructure to get there. Which is why we have to constantly advocate for communicating the value of research when it should be self evident.

So I asked Foxx about this, about funding research and the required data and IT infrastructure to facilitate collaboration across modes. He replied like a true politician, that DOT is “bullish” about research despite funding cuts, and it’s still a priority. Not really an answer but as much as he could give. I mostly asked the question because I wanted to make sure it got on record that people do care about research funding (namely people working for research bodies) and to make sure there was at least some women represented in the question queue. (Two out of ten or so? That’s pretty shabby, but also another blog post.) Judging from the response of many of my colleagues on Twitter, they appreciated having the issue elevated.

Transportation has some unique funding issues, such as the failure and inability to raise the gas tax to sustainable funding levels, but this issue of funding research is happening across disciplines. Money talks and subject that can garner private sector investment, such as self-driving cars (hey Uber and CMU!), but what about topics that aren’t financially lucrative but no less important, such as rural transit? And what about paying for the infrastructure to conduct research, such as data centers and libraries? We have to constantly advocate and push for our cause even though the immediate ROI might not be evident. This new funding model and philosophy is very pragmatic, but also pretty short sighted. Which is why I’m worried about these long range 30-year plans. Research programs and libraries have helped have that long view and memory to make sure we progress effectively and don’t duplicate efforts, but nobody wants to pay for it. I don’t think Beyond Traffic alone is going to change that.

All slow pop songs sound the same? On Fair Use and some such.

Have you heard Sam Smith’s “Stay With Me”? You must have. It’s all over the place. Everybody loves it. (Even if you can’t really dance to it.) Some people pointed out the chorus sounds a lot like another ubiquitous pop song of yesteryear – Tom Petty’s “I Won’t Back Down”. Not sure? check this out:

Now you might say, “Oh, there’s nothing new under the sun! Chords are chords! Any similarity is incidental!” And you might be right. Smith was born three years after Petty’s song was all over radio, so he has some plausible deniability, maybe.

But this week it emerged that the two settled out of court in October and that Tom Petty and Jeff Lynne would receive song writing credits and royalties. If the song wins a Grammy this year, does this mean Petty and Lynne are included? We’ll see.

These sorts of lawsuits are fairly common, though Robin Thicke’s pre-emptive lawsuit against the estate of Marvin Gaye was a new twist. The remix of the two songs is pretty good, and really with enough drugs is everything a remix? In a somewhat ironic twist of fate, one of the most well known lawsuits of this kind is Bright Tunes Music vs. Harrisongs because George Harrison’s “My Sweet Lord” sounds an awful lot like The Chiffons’ “He’s So Fine”. What do you think? Harrison (as seen in the video) sang and played guitar on Petty’s “I Won’t Back Down”.

Fair Use with music is interesting because of the laws on what can and cannot be covered by copyright and the artistic intent. A counter example I’ve been obsessing over is the song “Respect”. The original 1965 version by Otis Redding is a stomper, driven by Al Jackson’s drumming and peppered with the Memphis Horns. It’s so clearly a Stax song. Of course the 1967 cover by Aretha Franklin is iconic – an anthem for women all over. It’s also a very different song, so different I think it’s foolish to compare the two. (I tried for most of 2014.) The words and the basic melodies are the same, but the arrangements made them stand apart. Check out the Ike & Tina Turner cover which is a pretty perfect combination of the two.

Lots of covers differ much from the original that they could be considered completely different songs, but as long as the melody or the lyrics are used, then they need permission. The words are obvious – as evident from “Respect” even though the music doesn’t fully match up. In the Smith/Petty case, it’s a little bit more uncertain but there’s enough there. What’s this mean for librarians? Nothing new really, but it’s interesting when the lines are less blurred. (Pun intended.) Of course, there could be a whole follow up on sampling.