Stephen Ramsay    Home

MorphExtractor

As it happens, I am on leave this semester (for the first time ever), and I’ve basically decided that I’m going to write software the entire time.

I’m not quite ready to say what I’m working on, because some of it involves unfamiliar territory. For the moment, I’m mostly reading apis and language manuals. But even I can’t do that all day, and so I’ve been trying to knock off a couple of quick tools that I’ve longed for while working on other things.

The first of these is MorphExtractor.

Here at cdrh, we do a lot of work with MorphAdorner – the amazing program that Phil “Pib” Burns developed when we worked together on the monk project. Essentially, MorphAdorner takes xml files and tags sentence boundaries and tokens while “adorning” them with morphological data (part of speech, lemma, etc.) This is particularly useful to us, since my colleague Brian Pytlik-Zillig can do positively anything with xslt – including build an entire text analysis system using only xslt.

It’s surely nothing for Brian to hack out any kind of stylesheet he wants in a few seconds, but what I wanted was a grep-like tool that could extract tokens from MorphAdorner files quickly and easily, so I could munge them around and throw them into R, or mallet, or whatever. I admit there aren’t a lot of people in this situation, but for the few who are…

It’s written in C, and requires only libxml2 and pkg-config. It should build out of the box on any Linux machine, and on os x with the right dependencies installed via Homebrew or something similar. I’ve labeled it a beta, because I haven’t tested it very extensively or built it on a wide variety of platforms, but it appears to be stable and is feature complete.

If autotools is your idea of fun, you can check out the GitHub repo. autoreconf -i, etc.

blog comments powered by Disqus