High Performance Computing for English Majors
[hpc has been coming up a lot lately in conversations I’ve been having with other dh specialists. Or it was, before I went in for sinus surgery a week ago. I’m still recovering from that, and I’m not really sure about my ability to blog coherently. So please accept this essay from the archives. It’s from a talk I gave at mla in 2006.]
There are people in this world who spend untold amounts of time tweaking and tuning their cars for some perceived need for high performance that seldom materializes on roadways intended for passenger automobiles. They spend hours “modding” their rides: changing the gas-to-air ratio, boring out the cylinders, fiddling with the feathers and springs on the shock absorbers, and injecting nitrous oxide into the fuel line in order to get “Dude, like 450 horsepower” out of a sedan principally designed to ferry children to and from school.
I have precisely this relationship with computers. The latest chip, the fastest disks, the most efficient bus architectures all fill me with a kind of atavistic frisson. And once I lay my hands on the geek equivalent of nos, I start rebuilding the kernel, changing the shared memory footprint, altering the thread model, reconfiguring the drive geometry, and adding optimization flags to my C compiler. It is true that my machines are often on the verge of melting, but that’s the price of perfection. There’s even a special version of the Linux kernel for bleeding-edge speed freaks called the “Love Kernel.” It’s essentially the standard Linux kernel with hundreds of high-speed performance patches applied indiscriminately. Here’s a quote from the readme for the Love kernel:
IMPORTANT: steel300 and OneOfOne remind you that the patches here are sometimes experimental and could explode upon impact, make your [soda|pop] really bland, or other badness. We aren’t responsible for that, but we will mention that these patches will also make your kernel ROCK LIKE NINJA.
And that’s what I want to do. I want my computers to rock like ninja.
In a sense, ordinary training in software design is responsible for creating this insane desire for speed. The entire study of algorithms and data structures is framed by a concern with the trade-offs between time and space. If you undertake formal study of these matters, you find that much of what you’re doing is calculating the best and worst case scenarios for storage and retrieval within a particular data structure or under the strictures of a certain algorithm. After awhile, you can’t help but equate faster, smaller, and more scalable with better.
But if you study software engineering and design methodology at any level of detail — or better yet, start writing production code — you quickly discover that this equation is downright dangerous. Code optimization is fine when you’re talking about a fake implementation of a sorting algorithm. In a large, complex system intended for actual users, however, premature optimization is more than likely to result in brittle, unreadable code. And this assumes that you understand where the bottlenecks are in the first place. This is why even a brief foray as a computational test pilot will cause one to develop certain rational instincts about code efficiency. You begin to lower the bar to something like “fast enough” in order to create code that is more easily maintained and understood. You begin to distrust any optimization that isn’t completely verifiable using profilers and benchmarking tools. You begin to realize that it might be safer and more efficient to drive the kids to school in a minivan. Or at least you realize that this is the rational position, even as you irrationally try to break the sound barrier.
I have been writing software for use in the context of digital humanities for about ten years. During that time, I have written thousands of lines of code, but all of it has fallen neatly into one of two categories. Either it was intended to deliver data to the Web, or it was intended to perform some kind of data analysis operation offline. That covers a lot of different types of systems, of course. Sometimes the data being delivered to the Web consisted of reams of gis data that had to be paired with text, styled, and delivered to a client framework that would render a real-time animated map. Sometimes the offline data analysis consisted of computing complex graph theoretical algorithms for the purpose of studying relationships within a corpus. But in the former case, network latency had the effect of making most of my shrewd optimizations seem futile. Why work for hours on some little speed hack when the processing that occurs prior to network delivery and rendering is only a small fraction of the total end-to-end userspace time? In the latter case, it really didn’t matter how long the analysis took. I was the only one who needed the data, and there really wasn’t any particular rush. Who cares if it takes fifteen minutes — or even fifteen hours — to crunch the numbers?
For the last few years, I have been giving talks in which I proclaim an “age of tools” in digital humanities, and the evangelium goes something like this: Over the last twenty years, we have spent millions digitizing texts and putting them online. The resulting digital full-text archives are among the greatest achievements in digital humanities. Yet for all their wonder, they remain committed to a vision of digital textuality firmly ensconced within the metaphor of the physical library. You can browse the text, read the text, search the text, and even download the text, but you can’t really do much beyond that. It is time to start thinking of ways to exploit this data with analytical tools and visualizations. Ideally, such tools should be an integral part of the experience of working with Web-based text collections.
Several of my colleagues in the field are working on something like this, including my fellow panelists [Greg Crane and Geoff Rockwell - ed.]. My own contribution is as a member of the Nora Project, which endeavors to implement the credo outlined above with an emphasis on particular varieties of text analysis — including, most significantly, data mining and machine learning algorithms. I won’t speak for Geoff and Greg, but I think I know why I’m here today talking about high-performance. It’s because for the first time in my career, caffeine-addled speed optimizations seem not only warranted, but necessary.
They’re necessary, because when we talk about large, full-text archives empowered by text analytical tools and visualizations, we’re really talking about trying to take procedures traditionally thought of as batch-processing jobs and importing them into a world in which, as Jacob Nielson famously noted, you have eight seconds to do something interesting.
Our data mining operations rely on massive matrices of data drawn from text corpora. For example, we might have a giant table (consisting of millions of cells) where one column is filled with word frequency counts, another one is filled with markers indicating the presence or absence of a certain feature, another is filled with ratios between nouns and verbs, and so on. We start out not knowing what any of this data really means, but we do know that texts (or parts of texts) in the corpus cluster in certain ways. There are genre distinctions, years of composition, different authors, different countries of origin. So we add one more column of data indicating the “label” for the particular text or text section. Text classification is the process of using statistics to figure out what patterns of low-level features conspire to make a text fit a particular label. So the usual method involves having a domain expert label some of the texts, and then setting the data mining algorithms loose on the rest of the matrix, so it can generate a set of predictive rules. If the rules are robust (and this is the exciting part) you should have a system that can correctly assign labels for texts it has never seen before. And, of course, the labels can be anything at all.
We’ve used data mining to create things like systems that can detect eroticism and sentimentality in English poetry and prose. And as soon as we say that, two objections emerge immediately. First, “Do we really need a system that can tell us that a particular Shakespeare play is a history? Don’t we already know that?” And second, “Who decides what passages are erotic or sentimental in the first place?” The first objection is an entirely sensible one, but what really intrigues us is the fact that the system often gets it “wrong” in some thoroughly thrilling way. The first time we ran a data mining operation on Shakespeare, it calmly informed us that both Romeo and Juliet and Othello are comedies. The computer scientists on the team were ready to go back to the drawing board, but the literary critics were more excited than ever, because, of course, a number of influential critics have noted that these two plays follow the basic dramatic structure of comedy, and all we wanted to do was look at the generated rules to see what low-level features are complicit in this subtle moment of generic ambiguity. The second objection — “who decides what the labels are” — is also a sensible objection, but we have an easy answer to that one. The user should decide. The user should be able to choose what vectors go into the matrix, and choose the labels.
And that brings me, at long last, to the main topic of this panel. Because until recently, no one has thought of data mining as a live, interactive process. To undertake meaningful data mining on full-text archives of literary texts, you need to parse the xml documents, tokenize them, run a series of natural language processing algorithms (to determining things like parts-of-speech), check them against a gazetteer (for named-entity resolution), and then crunch all the numbers. Then you need to assemble all of that data into a matrix. Then you need to do the actual data mining algorithm. Then you need to deliver it to the client and render it. This always takes hours, and it occasionally takes days. If you’re offline, it doesn’t matter (though even offline, you want to come to this problem fully armed with high-performance equipment). Online, it violates Nielson’s eight-second rule in a way that borders on the grotesque.
It’s possible to approach the optimization of this process in a thoroughly rational manner. First, you look at the whole end-to-end system and try to divide the operation into things that bind early and things that bind late. There’s no reason to parse the xml data and do the feature extraction live. All of that can be done at the pre-processing stage and loaded into a datastore of some kind. It might take days to do that, but if you’re clever, you can get a ton of “canned” data ready to be loaded into a matrix for analysis. After you’ve done that, you can think about ways to minimize the amount of data the system has to analyze, perhaps by segmenting the data in such a way that the system has less material to sort through as it loads the matrix. You might then look for obvious inefficiencies in the analysis layer itself, and try to optimize those as much as you can (without creating brittle, difficult-to-understand code). Finally, you can figure out ways to distribute the analytical process across multiple processors.
We’ve done all of that. We’ve canned it, chunked it, speed-hacked it, and even figured out a way to multithread the process across any arbitrary number of processors. The resulting system is dazzlingly fast. It’s just not fast enough for the Web. And so it is time, we think, to turn to some serious hardware.
And when we say serious, we’re not talking about expensive servers (we’ve got those). We’re talking about seriously expensive servers — distributed clusters of the sort that are used for things like particle physics, weather simulation, and the video rendering for Attack of the Clones. And that’s a problem.
It’s a problem, because in the context of a university, “high-performance computing” isn’t a technical term at all. It’s a financial act of faith made by very senior members of the administration, and a site of intense territorial protection by the “hard” scientists who help to make that act of faith seem less fraught with religious peril. A bunch of English professors who want to get into high-performance computing need to convince administrators that they should get a piece of the pie, and they need to convince the physicists that literary critics have just as much of a right to these resources as anyone else. Which should be an easy matter. All we need to do is talk to the people who are exploring the origins of the universe, and ask them to step aside for a moment while we look for dirty words in Dickinson.
And, of course, we won’t be asking them to step aside “for a moment.” Nearly everything done on these systems represents a batch job. The experiment (or the video rendering task) might take a long time, but it usually has a beginning and an end. We’re talking about ongoing processes running on a kind of supercollider Web server. Perhaps we need our own high-performance cluster? But then, who pays for such a thing? Digital humanities can bring in grant dollars, but most of the funding agencies we deal with are loath to fund even moderate amounts of overhead. Perhaps we are in over our heads.
Now, I’ve already confessed to being a semi-delusional, speed-obsessed maniac. Perhaps all of this represents nothing more than the idle fantasy of someone who wants “Dude, like 450 million words per second.” Surely, there’s much that we can do to bring about the age of tools without pouring millions of dollars into hardware. Why be so ambitious at this early stage? Do we really need to be thinking about high-performance computing for English majors?
I think we do need to be thinking about it — not because it’s a thing we need to have today, but because it’s a battle we’re going to need to fight tomorrow. To get where we are now in terms of text collections, we had to fight for resources that were unheard of among humanists. We were successful in that effort, not because we came up with outstanding technical arguments, but because we succeeded in effecting a cultural change at our institutions. We were able to convince Vice Presidents for Research that we could attract students and grant dollars. We were able to convince University Presidents that digital humanities was something of wide interest to the public (not to mention donors). We were able to convince library deans that research efforts in this area could pay dividends in terms of prestige. And finally, we were able to convince our own professional societies (including the mla) that scholarship in this area was essential to the future of the academy (witness, for example, that most astonishing of documents, the “Guidelines for Evaluating Work with Digital Media” put out by the mla this year).
Of course, one need not act like a Ninja in order to rock like one. The best way to get into the high-stakes game of high-performance computing is to create compelling reasons to participate. I continue to believe that bringing analytical procedures to existing digital archives — particularly those that are as easy to use as search engines — is a worthy, if ambitious goal. Shadetree mechanics might have little hope of building their own highways, but clever digital humanists, by remaining committed to broad visions of the power of full-text archives, might well create the conditions in which high-performance becomes an ordinary part of our work as a discipline.