I started my THATcamp Bay Area weekend in a bootcamp session on Text Mining with Aditi Muralidharan, a graduate student at UC Berkely. (@silverasm & http://mininghumanities.com). Links to the slides from the session are here. The session was geared for people who collect the data then ask “what do I do with all this stuff?!?” This definitely describes me. I have hours and hours of collected oral histories plus a few diaries and log books I’d love to analyze.
I’ve never done anything remotely close to text mining, which is why I attended this session. Here’s what I learned:
- In its most basic definition, text is words.
- N-grams: are sequences of 1 or 2 or 3, etc words. In the sentence “I had a great time and learned so much at THATcamp” the 1-grams are I, had, a, great, time, and, learned, so, much, at, and THATcamp. The 2-grams are “I had”, “had a”, “a great”, “great time”, “time and”, “and learned”, “learned so”, “so much”, “much at”, “at THATcamp.” The 3-grams are “I had a”, “had a great”, “a great time”, etc. Text mining can examine how many times an n-gram appears. Sentences can be built by stringing together n-grams. Sometimes they sound a bit off.
- Words have roles as parts of speech, e.g. nouns, verbs, adjectives, etc. (I think I spent half my elementary school years diagramming sentences in English and then in French.) Each part-of-speech (POS) has a behaviour in relation to other words in the sentence. In the process of “parts-of-speech-tagging,” every word in a corpus is tagged (i.e. POS tagging). The corpus used to train a parser matters: some have been trained using the Wall Street Journal; others use much older texts. Once a parser has been trained it can be used to analyze a specific text or set of texts – for example, to return all the adjectives in sentences that contain the words “woman” + “should.”
- Stanford Tregex was given as an example for visualizing results from parser as trees. This makes results much easier to analyze.
- Metaphors, irony, sarcasm, emoticons, etc. are hard for parsers to spot. Basically this is all of literature. Computational linguists haven’t dealt with this yet, but perhaps this is where digital humanists come into the conversation.
- Computational musicologists have been analyzing sound bites since the late 1960s but there doesn’t seem to be much cross-pollinating there yet.
- Topic modeling is a way to group words that are frequently used together; these are often sematically coherent. “Dynamic topic models” shows topics plus how they have changed over time. A popular toolkit for topic modeling is Mallet.
A variety of tools were suggested:
An example of text-mining an historical diary done by Cameron Blevins @historying at StanfordU:
Some limitations of text mining:
- So far there has been an emphasis on English language tools so other language tools are not very good yet. Arabic and French language tools are getting there, but I didn’t catch which ones were worth checking out.
- The first step of text-mining is to digitize records. OCR, mechanical turk, and grad students were suggested as possibilities for getting through this stage.
I’ll be working through this list (just as soon as I get my text in a digital format that can be processed).