Text Corpus Analysis

1. Explain what documents or document collections you are working with.

The text document I’m working with is “fc.txt” and “dc.txt”. I will use “Voyant tool” and “AntConc” these two tools to analysis the text.

2. Post at least two screen captures on your website from the text analysis tools (for each text, or group of texts) that suggest to you there might be something interesting to compare or contrast.

dc.txt:

fc.txt:

3. Start drafting some explanation of what you can see in screen captures. What ideas do you have about how these documents/document sets compare/contrast?

According to the screen captures above, we can discover some specific word data and text collection in two documents; such as word frequency, phrase frequency, most to least frequent word, specific word trend, and different word frequency in different N-Gram size.

Voyant:

In “dc.txt”, the most frequent word is “said”(570)(this can also be discovered in “Cirrus”),; it as 162013 total words and 9418 unique word forms. I pick up “wonderful” as one of the interesting word to compare. The document frequency of “wonderful” is 28 and it has 173 relatives, and this word’s trend in document segment 6-7 is the highest and segment 2-3 is the lowest. The left side of “wonderful” contain the word “a” , “is”…, and the right side of it contain “place”, “when”….

In “fc.txt”, the most frequent word is “man”(131), its frequency in segment 4-6 s the highest; it has 75275 total words and 7027 unique word forms. As the same word “interesting” as the “dc.txt”; the document frequency of “wonderful” is 17 and it has 226 relatives, the word’s trend is going lower and lower as the text go down. The left side of the “wonderful” contain the word “of”, “these”…, and the right side of it contain “man”, “facts”…..

In conclusion, the frequency of the “most frequent word” in “dc.txt” is much higher than “fc.txt”(570 to 131). “dc.txt” has more total words and as well as unique word forms. For the comparison of “wonderful”, the document frequency of “wonderful” in “dc.txt” is higher, but it has the lower relatives. They have the different word trend and left/right side relative words of “interesting” as well.

AntConc:

In “dc.txt”, when I set the N-Gram size to 2, the two word phrase “of the” is the most frequent phrase and it has 865 frequency. When I set the N-Gram size to 3, the three word phrase “I could see” is the most frequent one and it has 73 frequency. When I set the N-Gram to 4, the four word phrase “dr seward s diary” is the most frequent one and it has 39 frequency. When I set the Min. Freq to 10, the least frequent phrase is “madam mina”(2 N-Gram), its frequency is 87. I choose the interesting phrase “of the” as the compare element; the words show up with “of the” are “great”, “count”, “night”… The word “great” is the most frequent word that show up with.

In “fc.txt”, when I set the N-Gram size to 2, the two word phrase “of the” is the most frequent phrase and it has 527 frequency. When I set the N-Gram to 4, the four word phrase “but I did not” is the most frequent one and it has 10 frequency. When I set the Min. Freq to 10, the least frequent phrase(2 N-Gram) is “as if”, its frequency is 33. As the same phrase “of the”, the words show up with “of the” are “most”, “same”, “old”….. The word “most” is the most frequent word that show up with.

In conclusion, the two documents have the same “most frequent phrase” within N-Gram 2, which is “of the”; the frequency of the phrase “of the” in “dc.txt” is higher than the other. (865 to 527). When the N-Gram is set to 4, they have different “most frequent phrase” and its frequency; this frequency in “fc.txt” is lower than the other. (10 to 39). When I set the Min. Freq to 10, the frequency of the least frequent phrase in 2 N-Gram is different; “dc.txt” has much more frequency than the other(87 to 33). As the compare phrase “of the”, they have the different show up words and the most frequent word that show up with.