Monthly Archives: December 2010

Culturomics meets Darth Vader

A few days ago, Google introduced a site they called Culturomics in which users could enter two or more terms and get a graph representing their frequency of occurrence with the Google Books archives. Depending on your word selection, you can get some interesting results. For instance, the verb form ain’t has been attested since 1840 with the now “correct” not appearing widely until 1840.

I can tell this service has successfully grabbed the attention of the popular imagination, since some sci-fi fans put in some genre-themed word pairs. Apparently Darth Vader IS more popular than Luke Skywalker. However there are some issues to consider.

One critique comes from Mark Davies, a member of the Corpus of Historical American English (COHA). One feature that the COHA interface has that the n-gram doesn’t have is that it lists frequently found co-occuring words (or “collocates”). For instance, in 1900, the word gay may frequently occur along side words such as “happiness, light, carefree”. Today it is much more likely to co-occur along words such as “rights” or “marriage” (especially in news articles).

Davies also notes that the Google tool doesn’t yet distinguish between parts of speech or differences in usage/meaning, and this can be very important. For instance, a chart of twitter shows a peak circa 1900, but at that time it referred to sounds a bird might make (or perhaps the sound of gossipy chit chat). Today it generally refers to the Twitter service – but there is no way to distinguish this use.

Similarly, the tool doesn’t also allow you to view the types of passage in which a word occurs. For instance, the word ain’t continues to be found in written text, but it may be that after 1840, the context for a lot of the uses is in writing guides saying to AVOID “ain’t”. That makes a difference in how to analyze usage of “ain’t” over time.

That’s not to say that there is no use to the Culturomics tool, especially if the terms are very specific and unambiguous (e.g. Darth Vader), but I do have to agree with Davies that it doesn’t let you track the subtleties very well. But fortunately, there are other tools out there that linguists can use. But I will have to admit that the interface will be more complex.

Postscript: Jan 4, 2011

There have been several experiments online, especially on Language Log (e.g. northeaster vs nor’easter) working to see how the Google engine works, and Google is responding. One feature I initially missed is that you can narrow your corpus a bit. For instance, I ran the isn’t/ain’t pair again but restricted it the the “English fiction” corpus – this should rule out pesky grammar books (although it would be distinguish quote in “dialect”). It’s still interesting to note that “isn’t” is not a clear winner until sometime after 1900.

Postscript 2: Jan 7, 2011

One issue that could also be problematic is synonyms/heteronyms as well as multiple usage. For instance, a report of dove may not distinguish the irregular past from the bird of peace.

Pragmatics and Statistics

One of my Listservs led me to this interesting article by John Allen Paulos about the distinction between the “literary and scientific cultures”. As part of the discussion, Paulos discusses some cases where knowing a narrative background affects how probability is assessed.

Consider the following two statements.

  1. Sarah is a bank teller.
  2. Sarah is a bank teller and has a philosophy degree.

The answer is that the first option is more probable because only one condition needs to be met. In order for the second to be true, two conditions are required – being a bank teller and having a degree in philosophy degree.

Now consider this version from Paulos in which the teller is given a brief bio:

Linda is single, in her early 30s, outspoken, and exceedingly smart. A philosophy major in college, she has devoted herself to issues such as nuclear non-proliferation. So which of the following is more likely?:

  1. Linda is a bank teller.
  2. Linda is a bank teller and is active in the feminist movement.

The finding is that more people will be that the second option is the most likely – i.e. that Linda is a bank teller and in the feminist movement, even though it requires the fulfillment of two conditions.

There are several philosophical tacks one can take to the problem, but I think one factor is that the story along with the presentation of the information affects the construction of the model used to evaluate the statements.

Someone reading first scenario without the narrative probably constructs the intended model where the probability of being a bank teller versus a bank teller with a philosophy degree is evaluated across all adult women. It’s easy to see that fulfilling condition A is more probably than fulfilling condition A and B.

The second scenario with Linda though probably causes most people to build a model not across all adult women but across all adult women who have a philosophy degree and who were activists in their youth. It’s NOT the same pool of candidates, and there is a legitimate reason to think probability judgments COULD be different. Interestingly, if you presented the two Linda options as

  1. Linda is a bank teller.
  2. Linda is active in the feminist movement.

then the conclusion would likely be that Linda being active in the feminist movement is more likely than her being a bank teller. In other words, readers could be using the narrative to build a stereotyped persona where someone who was politically active in college remains active. In the same vein, most people likely assume that someone with a philosophy degree becomes a teller only as a last resort and that most tellers have a degree in accounting or other related field. This is one possible source of the fallacy.

I would also argue that the presentation of the options causes the pragmatic engine to introduce another logical trap. Because both options allow that Linda is a bank teller, this could mean that readers assume the Linda ends up as a bank teller (even though that’s not what the option says). Thus, readers could be interpreting the Linda options as:

  1. Linda is a bank teller who is not active in the feminist movement.
  2. Linda is a bank teller who is active in the feminist movement.

There is a further pragmatic interpretation that option a) “Linda is a bank teller” means that she is not politically active at all. That’s not literally the case (for instance, option a) does allow that Linda could still be active in the anti nuclear proliferation movement, but not the feminist movement). In pragmatic land though, omitting information is interpreted as meaning it doesn’t exist. That’s why people often consider not saying something to be “lying.”

So to summarize, I think the skewed probability judgments aren’t just a result of people being sucked into a mini soap opera, but to two factors the narrative introduces – 1) narrowing the set of women to those with philosophy degrees, which leads to different stereotypes and 2) the options leading to misconstrued pragmatics which differ from what the literal meaning is.

The ability to reasonably construct a pragmatic meaning behind a literal statement is critical for social relations and reducing conversational length. But it can lead to some glitches like the narrative above.

Weird Bronze Age Bull Dancing Moment

If you’ve ever had a Greek archaeology or Bronze Age Mediterranean archaeology course, then you’ve probably seen the bull leapers in ancient frescoes from Crete.

Bull with three men - one somersaulting off back, one at horns and one preparing to leap
Photo courtesy of Dimitris Agelakis. Licensed under Creative Commons.

I’ve heard some comparisons to this Cretan bull leaping to Spanish bull fighting, but clearly flipping over a bull is not the same as waving a cape and stabbing it…or is it?

It turns out that there is a type of bull sport from the Gascony region of France called course landaise which includes sauteurs (literally “leapers”) who do, in fact, flip elegantly over a bull (or technically a horned female cow). It’s quite elegant and rarely injures the bovine (although leapers can get banged up). This kind of a bull sport I can get into.

In terms of the overall origin of Iberian/Southern French bullfighting, a connection is made to the Roman gladiatorial games, and it does make sense considering that some forms of bullfighting involve combat. Still the bull dancing is tantalizing, because the dancing version is also said to exist in Southern Spain. Prehistoric Iberia and Crete are two places which have lots of archaeology, few written records, but lots of cross-cultural contacts (esp Iberia).

Even if no direct connection exists, the course landaise shows what the Cretan version may have been like and why it was worth preserving in a fresco in ancient Knossos.