I am very pleased to be attending the Workshop on Application Programming Interfaces for the Digital Humanities sponsored by sshrc and hosted by the amazing Bill Turkel in his role as a member of NiCHE.
Here are a few things I’m thinking about going into Day 2:
In talking about apis, we’re necessarily talking about access and the political and cultural issues that surround access to cultural heritage materials. It’s one thing for a library (say) to make some data collection available and to allow you to browse, search, and display it in various ways. It’s another thing to allow other people to come along and create their own ways of browsing, searching, viewing (which is what api access is really about). I think we need to insist on the necessity of this form of access as essential to the future of digital work in the humanities and social sciences. At the same time, we need to be respectful of those who are understandably nervous about it. How do we articulate the benefits of this kind of access? How do we persuade content providers that this kind of access is good for the institutions that provide it, and not just for the people who take advantage of the new entry point?
There’s a movable wall when it comes to apis. I heard a lot of people yesterday describing elaborate ideas about data mining with textual resources (or something similarly ambitious), but in every case, I noticed that the idea was predicated not on access to a series of data points, but on access to the entire dataset. This raises a fundamental question (for designers) on where you put the “wall” between the resource and the user. You could imagine an api that had a single function called “get_all()” Call that, and you can mirror the entire dataset and do what you like. You could also have an api with dozens of highly granular hooks that return nicely formatted data structures, and so forth. The former is undoubtedly the most flexible, but it’s also the hardest to work with (particularly if you’re a novice programmer). But again, it’s a kind of shifting wall. If it’s data mining you’re after, you could do all that mining back on the archive side and make the results available through the (highly granular) api. These aren’t mutually exclusive, of course; Flickr, for example, offers both kinds. Still, I think thinking about this helps to highlight some of the design challenges one encounters with apis in general.
I think we need to think more carefully about “impedance mismatches” between data sources. There was a lot of talk yesterday about mashing this humanities resource to that humanities resource, but I think there were also some hand-waving assumptions (I was guilty as much as anyone) about the degree to which that data is tractable from an interoperability standpoint. Some of the most successful web service apis are successful, I think, because the data is simple and easy to work with (lat/longs, metar data, stats arranged as key-value pairs, etc.). Humanities resources are often quite a bit more complicated, and there’s far less agreement about how that data should be formatted. It’s true that the tei (for example) provides a degree of metadata standardization, but it’s mostly silent about how the content itself should be formatted. That is, when you actually look at the content of the “tags” (whether it’s xml or something else entirely), you find that people are defining things at radically different levels of granularity and with different ordering schemes. I don’t want to declare that the sky is falling; I just want to point out that some of this might be quite a bit more difficult than it sounds. And it’s a tough problem, because defining complicated interoperability standards in this space really does, in my opinion, run against the spirit of the thing.
I’ve had a wonderful time at this gathering, which includes so many talented librarians, scholars, and hackers (many of whom manage to combine all three skill sets). I can’t help but think that great things will come of this.