I started my THATcamp Bay Area weekend in a bootcamp session on Text Mining with Aditi Muralidharan, a graduate student at UC Berkely. (@silverasm & http://mininghumanities.com). Links to the slides from the session are here. The session was geared for people who collect the data then ask “what do I do with all this stuff?!?” This definitely describes me. I have hours and hours of collected oral histories plus a few diaries and log books I’d love to analyze.
I’ve never done anything remotely close to text mining, which is why I attended this session. Here’s what I learned:
- In its most basic definition, text is words.
- N-grams: are sequences of 1 or 2 or 3, etc words. In the sentence “I had a great time and learned so much at THATcamp” the 1-grams are I, had, a, great, time, and, learned, so, much, at, and THATcamp. The 2-grams are “I had”, “had a”, “a great”, “great time”, “time and”, “and learned”, “learned so”, “so much”, “much at”, “at THATcamp.” The 3-grams are “I had a”, “had a great”, “a great time”, etc. Text mining can examine how many times an n-gram appears. Sentences can be built by stringing together n-grams. Sometimes they sound a bit off.
- Words have roles as parts of speech, e.g. nouns, verbs, adjectives, etc. (I think I spent half my elementary school years diagramming sentences in English and then in French.) Each part-of-speech (POS) has a behaviour in relation to other words in the sentence. In the process of “parts-of-speech-tagging,” every word in a corpus is tagged (i.e. POS tagging). The corpus used to train a parser matters: some have been trained using the Wall Street Journal; others use much older texts. Once a parser has been trained it can be used to analyze a specific text or set of texts – for example, to return all the adjectives in sentences that contain the words “woman” + “should.”
- Stanford Tregex was given as an example for visualizing results from parser as trees. This makes results much easier to analyze.
- Metaphors, irony, sarcasm, emoticons, etc. are hard for parsers to spot. Basically this is all of literature. 😉 Computational linguists haven’t dealt with this yet, but perhaps this is where digital humanists come into the conversation.
- Computational musicologists have been analyzing sound bites since the late 1960s but there doesn’t seem to be much cross-pollinating there yet.
- Topic modeling is a way to group words that are frequently used together; these are often sematically coherent. “Dynamic topic models” shows topics plus how they have changed over time. A popular toolkit for topic modeling is Mallet.
A variety of tools were suggested:
An example of text-mining an historical diary done by Cameron Blevins @historying at StanfordU:
Some limitations of text mining:
- So far there has been an emphasis on English language tools so other language tools are not very good yet. Arabic and French language tools are getting there, but I didn’t catch which ones were worth checking out.
- The first step of text-mining is to digitize records. OCR, mechanical turk, and grad students were suggested as possibilities for getting through this stage.
I’ll be working through this list (just as soon as I get my text in a digital format that can be processed).
11 October 2010, 10:19 pm
[…] have notes for Text mining, Organizing an Unconference, Augmented Reality 4 Poets, Google Fusion Tables, and some […]
12 October 2010, 2:01 pm
[…] Candace Nast […]
12 October 2010, 2:32 pm
A different but complementary tool is the Winnow classifier. You can try it by using http://winnowtag.org, which uses Winnow to create smart tags to find related items even when the amount of content is very large. You can create and share your own smart tags at winnowtag.org, and you can use Winnow directly in your own projects.
winnowTag.org downloads and tags 7,500 feeds daily and keeps the items for three months, thus currently has about 700,000 items on a huge variety of topics. So winnowTag.org shows the accuracy and performance of the recommendations made by the Winnow classifier (a naive Bayesian variant). Here are a couple of illustrative tags:
entomology: http://winnowtag.org/#mode=all&tag_ids=468
space: http://winnowtag.org/#mode=all&tag_ids=11
A higher number at the left of an item means the Winnow classifier is more certain that item is a correct match for the selected tag. winnowtag.org features are explained in the Help tab of http://doc.winnowtag.org, and http://doc.winnowtag.org/open-source has info on Winnow.