At the start of this project I was unfamiliar with text analysis in general, outside of some basic procedures that Dr. Cameron Blevins walked me through as part of his “Introduction to Digital Humanities” class, which I took in the fall of 2019. In researching various text analysis techniques, my goal was to identify methods where I could actually understand what the code was doing, rather than trusting blindly in some mythical algorithm to give me results. Two of the analysis methods I finally decided on, word frequency and keyword in context, are basic methods that use a computer’s processing power to do something that could be done manually, even if it would take a long time. I also wanted to try one slightly more complex technique. Benjamin Schmidt’s essay, “Do Digital Humanists Need to Understand Algorithms?” suggests that the answer is not always, as long as they “understand the transformations that algorithms attempt to bring about.” With this in mind I selected word vector analysis as a third method. The implementation of word vector analysis relies on neural network architecture and a lot of very complex math. However, I felt that understanding how it transforms words and then making connections between those relationships was within my reach. Essentially word vector models transform a word within a document into a vector in space. Spatial relationships between words can then be represented mathematically to determine how often words co-occur or are located near one another in the text and allow for those contextual relationships to be expressed. My very vague memories of trigonometry and matrix math made this process something I could visualize, even if there is no way I could have performed these calculations on my own. Obviously this is an oversimplification of what the word vector model is doing, but by examining the texts themselves I hoped I could uncover why those connections were being made by the model.
I performed my data analysis using the Python programming language within Jupyter notebooks. I chose this language as it was the language I learned in both Dr. Blevins’s class and in “Analyzing Complex Digitized Data,” a course I took with Dr. Laura Nelson in the fall of 2020. I am very grateful to Dr. Nelson for allowing me to work with my own data set for class assignments. This is where I worked out the text analysis techniques used for this project. Her feedback was helpful as I started my initial explorations. Using Python allowed me to take advantage of the tools within Python’s Natural Language Toolkit (NLTK) for text processing and the Gensim library for the word vector analysis portion of the project.
All of the Jupyter notebooks with my project code can be found in the project’s GitHub Repository.
Data Preprocessing
The scanned text arrived from HathiTrust in a “Pairtree” format. According to the Pairtree project description, Pairtree is a “a filesystem hierarchy for holding objects that are located by mapping identifier strings to object directory (or folder) paths two characters at a time.” The text arrived as many hundreds of separate small text files within 28 numbered folders (one for each journal). These files generate the text-searchable portion of the digitized journal images. The small text files are numbered sequentially within the folders. To reassemble the journals I wrote a Python program that sorts the files by their file number in the folder and then opens each file and pastes it into a single text document. Journal issues are in chronological order within the volumes, but some volumes only contain partial years and a few span over two years.
Over the summer of 2020 I manually separated the volumes into individual monthly issues. This was a long and painful process due to the poor quality of the scanned text which made searching for particular keywords (like months) an unreliable method of finding issue breaks. I investigated additional pre-processing, but decided not to edit the contents of the files further than splitting them up into issues. All other pre-processing was done within Python as part of the code for each analysis method.
Text Analysis Methods
Word Frequency Over Time.
As part of this work I made lists of words to use as “safety,” “health,” and “environmental” keywords. These lists were developed by both looking through sample issues to identify articles that I felt were related to these topics and the kinds of words used as well as general searches with modern words associated with these topics. I also looked for synonyms and antonyms for those initial words for additional searches. One limitation of this approach is that many words can be used in multiple contexts that are extremely different from one another. For this reason my keyword lists were iterative and involved more than one analysis method. I looked at a large number of words in my initial lists and then narrowed down the final lists using keyword in context to remove some of the more problematic multiple-use words (for instance “crushed” which was also pulling up “crushed stone” which is part of the quarry industry’s product line and thus was appearing frequently). I also removed words from the lists if they produced no or very few results unless they were words I was particularly interested in (for instance, “phthisis,” a term sometimes used in place of silicosis).
The Python code for this process opens each document, converts the text to lowercase, uses NLTK tools to split the text into individual words (with punctuation separated from words) and searches the list of individual words for matches to each word provided in the keyword list. The output of this code is a count of how many instances of each word appears within each issue. I entered the final counts into an Excel spreadsheet (using the OpenRefine program to tidy up the data). In the spreadsheet I normalized the word counts to their appearance per 1000 words to accommodate the fact that not all of the issues have the same number of words. This allows numbers of words in each issue to be compared directly to those in other issues. I then converted the spreadsheets to .csv files that I re-loaded into Python to make visualizations of word use over time using the “Matplotlib” and “Seaborn” visualization packages.
Results for this method can be found here
Keywords In Context.
For my Keyword in context text analysis the Python code uses NLTK’s “concordance” function. This allowed me to search the text for a specific word, and then returned each instance of that word along with approximately 30 characters that appear on either side of the word. In the code I wrote for this exercise I processed the text to be all lowercase, but did not remove strange characters or punctuation. (NLTK’s built-in tokenizing function separates punctuation from words so they do not affect the word search process.)
The word frequency over time exercise was invaluable in informing me which keywords were worth looking at in context. After I ran my code on the text of the journal issue, I moved the results to spreadsheets so that I could look at them in one place. I wanted a location where I could store and evaluate the results easily to see if the words extracted were related or not related to the topic of my study. I could have also done this in a text file, but found spreadsheet cells easier to work with.
Keyword in context analysis also proved essential for interpreting the results of the word vector analysis. I often ran the “similar” words produced in the word vector analysis through my keyword in context program to see why they might be turning up in similar contexts to my target word.
Results for this method can be found here
Word Vector Analysis.
For this portion of the project I trained three Word2Vec models using the Gensim (“generate similar”) natural language processing library in Python. The transformation of words in a document to spatial relationships in Word2Vec models allows one to express relationships between words based on how often they co-occur. Querying the models allows for identifying semantic relationships between a keyword and other words that are used within similar contexts.
For the first model I uploaded all 266 texts (approximately 52,125,670 words total, although it is likely many of those “words” are OCR errors), processed them to lower case, removed punctuation, and turned them into a list of individual words by sentence. My preliminary results turned up several words that were pieces of hyphenated words (“effi-“ and “ciency” and “en-“ and “gine”). For my final models I did a “find and replace” during the preprocessing period to remove hyphens and combine the word pieces together. Once I cleaned the data, I trained a SkipGram model using the Gensim library. Once I had the model I then made queries based on words on my keylist to see what other words were used in the same semantic context as well as created scatterplot visualizations using the tools in the “scikit-learn” Python data analysis package.
Because I was interested to see if language around safety changed over time, I decided to perform the same keyword investigations on two smaller models. One model consisted of issues from 1888 through 1910, and the other consisted of all issues from 1911 through 1922. Because some issues were missing, these two models were roughly the same size – about 20 million words each.