Word Clouds with Mathematica

One of the examples in my introductory classes is to create a word cloud from a text file containing the text of one of their reading assignments. A word cloud is a graphical representation of the frequency with which words occur in a section of text. The underlying assumption is that the more often a word is used, the more important it is to the topic. To meet that  assumption, we have to remove common words such as “the”, “and”, “a”, etc. as well as combine different forms of a word such as plurals and singular versions – at least for the important terms.

The example used below includes the text from an introductory chapter on Plate Tectonics, that is assigned early in the course. The overall goal is to show students a use of computational thinking (and computing) that helps in the analysis of something other than mathematical computations.

Before making the word cloud we have to scrub the text to remove non-words such as numbers, and to remove words that carry little information about the topic such as “the” – Mathematica calls these terms stop words. I use the following code to make a word cloud and then adapts some of the scrubbing functions to better clean the specific text that I am processing.

Import the text file

I created the text file using the scanned images of the textbook and the Mac OSX tool pdfPenPro. I copied and pasted the text into an editor and saved the file. We start by reading
the text into a string variable called d0, which we can use to manipulate the text.

(* you will have to replace the path to the text file using 
   the the menu command: Insert->File Path...
   We load and store the text into an item we'll call d0 *)
d0 = Import["Chapter02.txt"];

Replace common plurals in the text

Edit the list of important plural words to insure the importance of a term is not diminished because it is sometimes pluralized. I examined the initial word cloud to identify plural and singular terms that should be combined, then I add them to command below, which merges replaces plurals in the text with their singular forms. I am sure there are more elegant ways to do this with Mathematica, but this works and for simple applications, does not require too much effort.

d0 = StringReplace[d0,{
      "plates"->"plate",
      "earthquakes"->"earthquake",
      "tectonics"->"tectonic"}];

Delete uninformative words in the text

Not all the words in the file carry any useful information for a word cloud. We can use the DeleteStopwords command to remove common words like “the”, “is”, etc. .I first convert the string d0 to a list of words with TextWords, then convert it to lower case with ToLowerCase. Then I Select words in the list that that contain only characters, removing items with other characters (^(,[]…) – be aware that this will remove words with apostrophes. Finally, I remove a custom list of uninformative words that I see in the earlier versions of the cloud. I am sure there are more elegant ways to do this, and this would not be possible for an automated process, but that’s not my focus.

w0 = DeleteStopwords[ToLowerCase[TextWords[d0]]];
w1 = Select[w0, LetterQ[#] &];
words = Select[ w1, # != "new"
     && # != "million"
     && # != "found"
     && # != "figure"
     && # != "occur"
     && # != "km"
     && # != "mi"
     && # != "relatively"
     && # != "called"
     && # != "away"
     && # != "chapter"
     && # != "shown"
     && # != "ing"
    &];

Build the word-cloud graphic

Mathematica has a built-in function to create the word cloud, so once we have the list of words, the rest is easy.

WordCloud[words]

Word cloud created from the text of an introductory text’s chapter on plate tectonics.

You can customize cloud, but the default settings are fine – and we are done.

Credits

I used the syntax highlighting tool at https://andrewsun.com/tools/syntax-highlighter/ to typeset the Mathematica code.