Lexical networks: Difference between revisions

Revision as of 06:05, 15 June 2007

CSSS Santa Fe 2007

(suggested by Sayan)

Interested people: (please add yourself if you are interested )

3.

Lexical networks are graphs encoding the co-occurrence of words in large texts. (If the text is sufficiently large, we can pretend that the network encodes the entire language).

In the graph, two connected words are adjacent, and the degree of a given word is the number of edges that connect it with other words. You can take a look at this paper to see what lexical networks look like: (it's a short, readable paper): The small-world of human language' by Cancho i Ferrer and Solé.

Thinking of going along either of the two possible directions for the project (but open to other suggestions):

(1) Perhaps, something along the lines of how to identify synonyms from within a lexical network / exploring suitable "metrics" for synonyms

(2) Perhaps, something along the lines of exploring attack tolerance in lexical networks (e.g. tolerance to knocking out some nodes in the network). [An interesting paper to look at here may be: "Albert, R., Jeong, H., & Barabasi, A.-L. (2000). Error and attack tolerance of complex networks. Nature, 406, 378-381.] Evidently, this can have some interesting connections to the Healing strategies for networks project, as well.

A motivation for thinking about (1) is that questions/mechanisms of analogy-making and comparison-making at all levels of cognition tend to be very interesting questions, and so (1) fits in well with some broader questions in that regard.

A motivation for thinking about (2) is that several people in the summer school are thinking of working on projects on fault tolerance / attack tolerance in small-world networks -- e.g. in biological /metabolic networks, in neural networks, etc, and so (2) would "mesh" well with similar projects with other kinds of scale-free networks that others in the summer school are thinking of working on, leading to exchange of ideas, etc.

Now, of course (1) and (2) above could actually end up being pursuable in terms of / within the same project: for example, the identification of synonyms within a lexical network could lead to attack tolerance (e.g. how to design a "self-healing" lexical network so that, if some of the nodes in the network are taken out, synonyms can step in to take over for the words thus taken out...)

An interesting exchange with John Mahoney on June 8

On 6/8/07, John Mahoney <jrmahone (at) ucdavis (dot) edu> wrote:

> might be interesting to try to model a lex net where synonyms display resistance to becoming

> equally used. attempting to use ideas from Page's talk on consistency and coherence. some say

> ketchup.. some say catsup ?

> > so basically thinking about the difference between semantic equivalence and equivalent use.

> > ?-john

Hi John,

Thanks for this interesting idea.

It makes me think of each node (word) maybe as a "basin of attraction", drawing meanings into it, with some (a few) meanings that are perched precariously on the rim between two basins, and capable of going down the direction of either word.

We might also have a natural tendency to "compartmentalize" the world into discrete, non-overlapping categories (this would make sense from the viewpoint of evolutionary history, I think -- it might simply make sense to carve up the world into discrete categories if you're a hunter-gatherer on the savannahs trying to make split-second decisions). And so maybe we can say, from one point of view, that "language abhors synonyms" ?

And there is the competing pressure, for reasons of building fault-tolerance into the system, etc, to get some redundancy in there, too.

So from the point of view of a lexical network, there are good reasons, perhaps, for nodes to be similar (what you called "semantic equivalence") as well as for nodes to be dissimilar (what you called "non-equivalence of use").

Thanks again, am putting this up on the Wiki.

More thoughts from Sayan and Hannah on June 12

Earlier, we mentioned going along either of the two possible directions for the project:

(1) Perhaps, something along the lines of how to identify synonyms from

within a lexical network / exploring suitable "metrics" for synonyms

(2) Perhaps, something along the lines of exploring attack tolerance in
lexical networks

After talking over the past few days, and also attending Mark Newman's talk today, we are now tentatively thinking of settling on option (1) above (synonyms) rather than option (2).

Narrowing down our ideas after further discussion, here is our thinking at this point:

(1) We'll create a lexical net with data (text, i.e. a corpus) from a given domain, and then "grow" the net thus built with more data from the domain. The second set of data should be similar enough to the first set of data, but with systematic variation from the former.

Specifically, we are thinking of using the data consisting of the abstracts (or maybe even full papers) of the Santa Fe Institute working paper series for this purpose. Our thinking behind this is as follows: in the early days, most working papers in SFI were from the hard sciences, while, with time, more and more papers have been from the social sciences. So, if we create a lexical net using data that uses the text of the earlier papers as a corpus, and then "grow" the net into a new net using data thaty uses the text of the later papers as a corpus, our working hypothesis is that we may end up getting subtle but interesting differences between these two lexical nets.

Synonymy is an issue that Hannah is interested in, and Sayan is also interested in the issue of synonymy as Sayan is generally interested in the issue of analogy and comparison (wherever they occur). So, we discussed that what we might both be interested in, might be to find measures of "synonymy" in the two lexical networks thus built. We might find some interesting differences, given that the nature of the texts upon which the earlier lexical network would be built (more hard-science papers) is subtly different from the nature of the texts upon which the later lexical network would be built (relatively more social science papers), while both sets of papers belong to a common domain (all are texts about "complex systems", after all).

As to a suitable metric/measure for "synonymy", we're thinking of using one of the standard measures of similarity of nodes in a network (such as cosine similarity, for example), on selected, salient nodes. (Actually, neither of us knew anything about statistical tools for measuring similarity of nodes in a network before Mark Newman gave his talk about networks today here at the summer school, so we may well be erring or being somewhat naive in the choice of measure here: "cosine similarity" just seems like a good thing. If you have any suggestions as to a better way of thinking about synonymy, or perhaps measures more suited specifically to this type of network, we're very interested in hearing about it.)

Modified ideas

[Sayan sent off this email to Mark Newman today. (It incorporates some additional/new thoughts.]

Hi Prof. Newman,

This is Sayan, a grad student at U-M and one of the participants of the current Santa Fe Institute Summer School. First of all, many thanks for your interesting talks here earlier this week. It gave me some ideas for the group project on lexical networks which I and a couple of others are thinking of doing here at the Summer School.

I am writing to you to see if you might possibly have the reference for the two papers that you mentioned towards the end of your last lecture (the paper about bipartite networks, and the paper about algorithms to find groups of vertices; I *think* the latter was Newman and Light (?) but I'm not sure -- I didn't copy down the references thinking that the slides will be on the Summer School Wiki, but they're not there yet.) Do you by any chance have the references?

Our idea for the project, briefly, is the following: we'll create a lexical net with data (text, i.e. a corpus) from a given domain, and then "grow" the net thus built with more data from the domain. We want the second set of data should be similar enough to the first set of data, but with some systematic variation from the former.

Specifically, we are thinking of using the data consisting of the abstracts (or maybe even full papers) of the Santa Fe Institute working paper series for this purpose. Our thinking behind this is this: in the early days, most working papers in SFI used to be from the hard sciences, while, with time, more and more papers have recently been from the social sciences. So, if we create a lexical net using data that uses the text of the earlier papers as a corpus, and then "grow" the net into a new net using data thaty uses the text of the later papers as a corpus, our working hypothesis is that we may end up getting subtle but interesting differences between these two lexical nets. We would expect the first lexical net to have more "hard-science" qualities somehow, and the second lexical net to have (relatively) more "social science" qualities somehow.

We haven't quite thought everything through, but it does seem like algorithms that have to do with partitions in networks, or finding groups of vertices within networks, might be handy -- hence the reuqest. (As this is just an exploratory project, we're just interested about using this project to play with and get insights about lexical networks.)

I know you must be very busy and so I apologize for making the request, since I should have noted down the references myself at the time of the lecture. In case you have the references lying about, they would be very helpful for us. (And in case you have any thoughts/suggestions about the project based on what I wrote above, that would be very helpful too!)

Thanks very much for your help,

-Sayan.