"Word Bang", The Evolution of Words and Language

From Santa Fe Institute Events Wiki

"Word Bang", The Evolution of Words and Language

Nicholas Foti,Julie Granka,Erika Fille Legara,Thomas Maillart,Giovanni Petri,

The evolution of words and Language is thought to reflect the evolution of society. When new conceptual jumps (technology, art, philosophy) occur, very often a new bench of word are needed and thus introduced in the common language and eventually in official dictionaries.

By mining, two datasets and , we intend to uncover mechanisms of evolution of language (i) in general and over a long period (Gutenberg Project) and (ii) for a specific, yet fast evolving, subset of language (MIT Technology Review).

Initial Project Idea (by Dan Rockmore)

(Dan Rockmore) - In a class on complex systems that I teach at Dartmouth one of the final projects seemed to indicate from a small and somewhat biased sample of English words, that word origins (as indicated by one of the online dictionaries) seem clustered at certain times. As a start I would propose a mining of this info in some online dictionary, performing some initial analysis and see if "there is a there, there.." and if so, keep on going.


Julie: I'm (broadly) interested in looking at changes in word frequencies over time. Most simply, it might be interesting to see if there are words with drastic frequency changes: from very low to high frequency, or from intermediate frequency to extinction. Then, we could see if there are any reasons to explain these patterns (this would probably be more interpretable for the technology data). - Erika: Tracking these changes from articles taken from the Gutenberg Project might not tell us anything since each book talks about a particular story/plot. What do you think? I have to agree though that these patterns might be more common in technology blogs where the theme (broader: technology) is uniform all throughout our data set.

Julie: I think it would also be interesting to see if there are any correlations between words; i.e., pairs of words where frequencies are positively or negatively correlated over time. It might also be interesting to implement a clustering approach like the one Jure Leskovec used when he clustered the volume curves based on their shapes - but using word frequency changes over time instead. We could see if these results make sense given the word co-occurrence networks. I'm thinking that looking at the frequency data on a finer scale like this may be difficult to do with a very large number of words and sampling points - maybe we should think about if there are a particular set of words we would like to look at? - Erika: Regarding the co-occurrence network, the issue raised is well-founded. There could be a way out of this through some filtering methods. I believe that Nick is familiar with such techniques wherein certain edges, which are deemed "weak", are filtered out of the network thereby leaving us with "strongly" connected nodes. Looking at certain words is also a possibility but I'm just wondering how we are going to pick/choose some of them from a wide range of other content words. This might prove to be difficult and could easily be like cherry picking. :)

Julie: I think we should also look at changes in word network statistics over time, calculated from networks that people who know how to construct networks construct. - Erika: Yes! We could track changes in the networks' topological properties through time.

Gio: I think it would be interesting to have a look at possible "towing" effects, which kind of falls in the co-occurence+clustering argument mentioned above, basically whether a peak in the use of a certain words correlates to secondary peaks of others (more or less a retarded correlation between frequencies). If we can agree on some sort of similarity measure between words or meanings, we might calculate correlation matrices of frequencies over time (say for example 1 timesnap per year for gutemberg, one per month on techo review) and from them build networks whose topological properties one could study. Second thing, I would be very very interested in looking at memory-recurrence effects in frequencies of words/topics. The easiest method would be looking at the power spectrum tilt of timeseries of a chosen set of words (basically frequencies of frequencies, physics madness!). If we do see something 1/f-like it would be a nice first hint that there are temporal long-range memory effects (so basically we always talk about the same stuff). Then, one could look at the typical timescale for words related to different topics/subjects. I'll put in some more stuff as I think about it. Anyone has ideas about sensible word-metrics? Common roots? Spatial proximity in texts? Enlighten me!

Thomas: To reduce complexity and maybe convergence time of our project, I would propose to develop an algorithm that would pick out a set of words (e.g. 10 words or variable) that are sufficient to describe/summarize the topic of a piece of text and helps distinguish from pieces of text that would be similar (e.g. two blog post on a similar project). Of course, this algorithm would at first extract the most frequent words (excluding common words like "the", "a", etc...). In order to extract the distinctive concept, it will be necessary to find words that are not necessarily frequent, but still important. For that, I think about using clustering of words: if two words are close to each others, they won't probably be enough distinctive and thus, would appear redundant in the keyword list. While being a very much "engineering" approach, I believe it very much close to the evolutionary question and correlation between words.