Actions

"Word Bang", The Evolution of Words and Language

From Santa Fe Institute Events Wiki

Revision as of 16:46, 14 June 2010 by Gpetri (talk | contribs) (New page: == "Word Bang", The Evolution of Words and Language== Nicholas Foti,Julie Granka,Erika Fille Legara,Thomas Maillart,Giovanni Petri, The evolution of words and Language...)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)

"Word Bang", The Evolution of Words and Language

Nicholas Foti,Julie Granka,Erika Fille Legara,Thomas Maillart,Giovanni Petri,

The evolution of words and Language is thought to reflect the evolution of society. When new conceptual jumps (technology, art, philosophy) occur, very often a new bench of word are needed and thus introduced in the common language and eventually in official dictionaries.

By mining, two datasets and , we intend to uncover mechanisms of evolution of language (i) in general and over a long period (Gutenberg Project) and (ii) for a specific, yet fast evolving, subset of language (MIT Technology Review).


Initial Project Idea (by Dan Rockmore)

(Dan Rockmore) - In a class on complex systems that I teach at Dartmouth one of the final projects seemed to indicate from a small and somewhat biased sample of English words, that word origins (as indicated by one of the online dictionaries) seem clustered at certain times. As a start I would propose a mining of this info in some online dictionary, performing some initial analysis and see if "there is a there, there.." and if so, keep on going. http://tuvalu.santafe.edu/events/workshops/index.php?title=CSSS_2010_Santa_Fe-Projects_%26_Working_Groups&action=edit

Ideas/Developments

Julie: I'm (broadly) interested in looking at changes in word frequencies over time. Most simply, it might be interesting to see if there are words with drastic frequency changes: from very low to high frequency, or from intermediate frequency to extinction. Then, we could see if there are any reasons to explain these patterns (this would probably be more interpretable for the technology data). - Erika: Tracking these changes from articles taken from the Gutenberg Project might not tell us anything since each book talks about a particular story/plot. What do you think? I have to agree though that these patterns might be more common in technology blogs where the theme (broader: technology) is uniform all throughout our data set.

Julie: I think it would also be interesting to see if there are any correlations between words; i.e., pairs of words where frequencies are positively or negatively correlated over time. It might also be interesting to implement a clustering approach like the one Jure Leskovec used when he clustered the volume curves based on their shapes - but using word frequency changes over time instead. We could see if these results make sense given the word co-occurrence networks. I'm thinking that looking at the frequency data on a finer scale like this may be difficult to do with a very large number of words and sampling points - maybe we should think about if there are a particular set of words we would like to look at? - Erika: Regarding the co-occurrence network, the issue raised is well-founded. There could be a way out of this through some filtering methods. I believe that Nick is familiar with such techniques wherein certain edges, which are deemed "weak", are filtered out of the network thereby leaving us with "strongly" connected nodes. Looking at certain words is also a possibility but I'm just wondering how we are going to pick/choose some of them from a wide range of other content words. This might prove to be difficult and could easily be like cherry picking. :)

Julie: I think we should also look at changes in word network statistics over time, calculated from networks that people who know how to construct networks construct. - Erika: Yes! We could track changes in the networks' topological properties through time.