Experiments in Text Mining Tools


My understanding of text mining, based on our readings, is that it functions as a “zoomed out,” or “macroscopic” (Weingart), perspective on the collection/text in question. This can be useful in revealing patterns that are not at all apparent unless one can view a data set as a whole, rather than at just individual instances—being able to view trends with the help of technology, as a means of better understanding or being able to better tell a narrative.

I used a couple different texts/search terms for my experiments with text mining tools. For Voyant, I used the text of Persuasion by Jane Austen. I was motivated to choose this text because I’m teaching it right now to a class of undergrad English lit majors. I was curious to see what patterns text mining would reveal and if that would contribute to mine and my students’ understanding of the text. Also, ease of access was a factor; the entire text is available in the public domain.
My other experiments used words/terms related to feminist publishing, as that is what I (presently) am planning to write my dissertation on.

Voyant
Producing a word cloud through Voyant indicated that the most frequently used words in the text of Persuasion are the names of the main characters. This did not surprise me, but it did reinforced the idea that this is a novel about relationships; it’s primary focus is not action or drama, but rather people and their connections. This is backed up by the phrase counter, which indicates that many of the novel’s most frequently used phrases are prepositional phrases. Once again, this isn’t surprising, but it does indicate an interesting phenomenon about writing (that I doubt is unique to Jane Austen), which is how often we write/speak of things or people in relation to other things/people (i.e. “On the subject,” “at hand”).

I intend to show the results to my students to see what kinds of meaning they pull from the word map and other stats available on Voyant.

Google nGram Viewer—
Because I’m interested in the history of publishing, I searched the words “writing” and “publishing” together in the ngram Viewer. I was surprised at how much more frequently “writing” appeared than “publishing.” I experimented with time frames, starting with 1800-1920 and widening eventually to 1700-2000, thinking perhaps more advanced technology would make publishing a more popular term as time progressed, but in fact, over a greater span of time the use of “writing” increased in proportion to “publishing.”

I added the term “books,” which turned up some interesting results: between approximately 1770 and 1980, “books” appears more frequently (to varying degrees) than “writing.” However, around 1980, the two terms basically swap places, with writing becoming more prominent.

This kind of unexpected anomaly makes me think this tool can be useful (a literary scholar’s macroscope, if you will) in revealing unexpected patterns than I can use as a starting point for research; for instance, what factors change to make the usage of one term more popular than another? Drilling down to specifics may or may not turn up anything interesting, but such is the nature of research – dead ends are part of the process – so using nGram viewer seems like a great place to starting point to investigate further.

JSTOR DfR—

I found this tool useful more as a research aid than as a text mining tool, largely (I’m guessing) because of my area of research: I want to understand what is said about women and publishing together by scholars, not necessarily the sheer number of mentions those terms get together in JSTOR’s primary sources. That being said, even with a 21 year time spread, narrowed to my discipline, and limited to book chapters and articles, I still returned more than 30,000 results, which indicates that there is a lot of discussion on this topic among scholars.
Scrolling the results gave me insight into the kinds of books and journals would be great starting points for my research.

I also tried mining just the 19th century pamphlets for these terms; the results may not be useful to me since I don’t necessarily need primary historical records for my research, but the findings were interesting nonetheless: many reports on court proceedings and pamphlets on women’s suffrage, for instance.


Weingart, Scott. “The Joys of Big Data for Historians.” The Historian's Macroscope: Big Digital History, 8 Dec. 2014, http://www.themacroscope.org/?page_id=595.


Comments