Lexical networks: Difference between revisions

Revision as of 05:07, 13 June 2007

CSSS Santa Fe 2007

(suggested by Sayan)

Interested people: (please add yourself if you are interested )

3.

Lexical networks are graphs encoding the co-occurrence of words in large texts. (If the text is sufficiently large, we can pretend that the network encodes the entire language).

In the graph, two connected words are adjacent, and the degree of a given word is the number of edges that connect it with other words. You can take a look at this paper to see what lexical networks look like: (it's a short, readable paper): The small-world of human language' by Cancho i Ferrer and Solé.

Thinking of going along either of the two possible directions for the project (but open to other suggestions):

(1) Perhaps, something along the lines of how to identify synonyms from within a lexical network / exploring suitable "metrics" for synonyms

(2) Perhaps, something along the lines of exploring attack tolerance in lexical networks (e.g. tolerance to knocking out some nodes in the network). [An interesting paper to look at here may be: "Albert, R., Jeong, H., & Barabasi, A.-L. (2000). Error and attack tolerance of complex networks. Nature, 406, 378-381.] Evidently, this can have some interesting connections to the Healing strategies for networks project, as well.

A motivation for thinking about (1) is that questions/mechanisms of analogy-making and comparison-making at all levels of cognition tend to be very interesting questions, and so (1) fits in well with some broader questions in that regard.

A motivation for thinking about (2) is that several people in the summer school are thinking of working on projects on fault tolerance / attack tolerance in small-world networks -- e.g. in biological /metabolic networks, in neural networks, etc, and so (2) would "mesh" well with similar projects with other kinds of scale-free networks that others in the summer school are thinking of working on, leading to exchange of ideas, etc.

Now, of course (1) and (2) above could actually end up being pursuable in terms of / within the same project: for example, the identification of synonyms within a lexical network could lead to attack tolerance (e.g. how to design a "self-healing" lexical network so that, if some of the nodes in the network are taken out, synonyms can step in to take over for the words thus taken out...)

An interesting exchange with John Mahoney on June 8

On 6/8/07, John Mahoney <jrmahone (at) ucdavis (dot) edu> wrote:

> might be interesting to try to model a lex net where synonyms display resistance to becoming

> equally used. attempting to use ideas from Page's talk on consistency and coherence. some say

> ketchup.. some say catsup ?

> > so basically thinking about the difference between semantic equivalence and equivalent use.

> > ?-john

Hi John,

Thanks for this interesting idea.

It makes me think of each node (word) maybe as a "basin of attraction", drawing meanings into it, with some (a few) meanings that are perched precariously on the rim between two basins, and capable of going down the direction of either word.

We might also have a natural tendency to "compartmentalize" the world into discrete, non-overlapping categories (this would make sense from the viewpoint of evolutionary history, I think -- it might simply make sense to carve up the world into discrete categories if you're a hunter-gatherer on the savannahs trying to make split-second decisions). And so maybe we can say, from one point of view, that "language abhors synonyms" ?

And there is the competing pressure, for reasons of building fault-tolerance into the system, etc, to get some redundancy in there, too.

So from the point of view of a lexical network, there are good reasons, perhaps, for nodes to be similar (what you called "semantic equivalence") as well as for nodes to be dissimilar (what you called "non-equivalence of use").

Thanks again, am putting this up on the Wiki.

More thoughts from Sayan and Hannah on June 12

Earlier, we mentioned going along either of the two possible directions for the project:

(1) Perhaps, something along the lines of how to identify synonyms from

within a lexical network / exploring suitable "metrics" for synonyms

(2) Perhaps, something along the lines of exploring attack tolerance in
lexical networks

After talking over the past few days, and also attending Mark Newman's talk today, we are now tentatively thinking of settling on option (1) above (synonyms) rather than option (2).

Narrowing down our ideas after further discussion, here is our thinking at this point:

(1) We'll create a lexical net with data (text, i.e. a corpus) from a given domain, and then "grow" the net thus built with more data from the domain. The second set of data should be similar enough to the first set of data, but with systematic variation from the former.

Specifically, we are thinking of using the data consisting of the abstracts (or maybe even full papers) of the Santa Fe Institute working paper series for this purpose. Our thinking behind this is as follows: in the early days, most working papers in SFI were from the hard sciences, while, with time, more and more papers have been from the social sciences. So, if we create a lexical net using data that uses the text of the earlier papers as a corpus, and then "grow" the net into a new net using data thaty uses the text of the later papers as a corpus, our working hypothesis is that we may end up getting subtle but interesting differences between these two lexical nets.

Synonymy is an issue that Hannah is interested in, and Sayan is also interested in the issue of synonymy as Sayan is generally interested in the issue of analogy and comparison (wherever they occur). So, we discussed that what we might both be interested in, might be to find measures of "synonymy" in the two lexical networks thus built. We might find some interesting differences, given that the nature of the texts upon which the earlier lexical network would be built (more hard-science papers) is subtly different from the nature of the texts upon which the later lexical network would be built (relatively more social science papers), while both sets of papers belong to a common domain (all are texts about "complex systems", after all).

As to a suitable metric/measure for "synonymy", we're thinking of using one of the standard measures of similarity of nodes in a network (such as cosine similarity, for example), on selected, salient nodes. (Actually, neither of us knew anything about statistical tools for measuring similarity of nodes in a network before Mark Newman gave his talk about networks today here at the summer school, so we may well be erring or being somewhat naive in the choice of measure here: "cosine similarity" just seems like a good thing. If you have any suggestions as to a better way of thinking about synonymy, or perhaps measures more suited specifically to this type of network, we're very interested in hearing about it.)

@@ Line 4: / Line 4: @@
 (suggested by [http://www.santafe.edu/events/workshops/index.php/Sayan_Bhatacharyya Sayan])
-==Members: (please add yourself if you are interested )==
+==Interested people: (please add yourself if you are interested )==
 .[[Sayan_Bhattacharyya]]
-.
+.[[Hannah_Cornish]]
 .
@@ Line 31: / Line 31: @@
 Now, of course (1) and (2) above could actually end up being pursuable in terms of / within the same project: for example, the identification of synonyms within a lexical network could lead to attack tolerance (e.g. how to design a "self-healing" lexical network so that, if some of the nodes in the network are taken out, synonyms can step in to take over for the words thus taken out...)
-==An interesting exchange with John Mahoney:==
+==An interesting exchange with John Mahoney on June 8 ==
@@ Line 76: / Line 76: @@
 Thanks again, am putting this up on the Wiki.
+==More thoughts from Sayan and Hannah on June 12 ==
+Earlier, we mentioned going along either of the two possible
+directions for the project:
+ (1) Perhaps, something along the lines of how to identify synonyms from
+within a lexical network / exploring suitable "metrics" for synonyms
+ (2) Perhaps, something along the lines of exploring attack tolerance in
+ lexical networks
+After talking  over the past few days, and also attending
+Mark Newman's talk today,  we are now  tentatively thinking of
+settling on option (1) above (synonyms)  rather than option (2).
+Narrowing down our ideas after further discussion, here is our
+thinking at this point:
+(1) We'll create a lexical net with data (text, i.e. a corpus) from a
+given domain, and then "grow" the net thus built with more data from
+the domain. The second set of data should be similar enough to the
+first set of data, but with systematic variation from the former.
+Specifically, we are thinking of using the data consisting of the
+abstracts (or maybe even full papers) of the Santa Fe Institute
+working paper series for this purpose. Our thinking behind this is as
+follows: in the early days, most working papers in SFI were from the
+hard sciences, while, with time, more and more papers have been from
+the social sciences. So, if we create a lexical net using  data that
+uses the text of the earlier papers as a corpus, and then "grow" the
+net into a new net using data thaty uses the text of the later papers
+as a corpus, our working hypothesis is that we may end up getting
+subtle but interesting differences between these two lexical nets.
+Synonymy is an issue that Hannah is interested in, and Sayan is also
+interested in the issue of synonymy as Sayan is  generally interested in
+the issue of analogy and comparison (wherever they occur). So, we
+discussed that what we might both be interested in, might be to find
+measures of "synonymy" in the two lexical networks thus built. We
+might find some interesting differences, given that the nature of the
+texts upon  which the earlier lexical network would be built (more
+hard-science papers) is subtly different from the nature of the texts
+upon which the later lexical network would be built (relatively more
+social science papers), while both sets of papers belong to a common
+domain (all are texts about "complex systems", after all).
+As to a suitable metric/measure for "synonymy", we're thinking of
+using one of the standard measures of similarity of nodes in a network
+(such as cosine similarity, for example), on selected, salient nodes.
+(Actually, neither of us knew anything about statistical tools for
+measuring similarity of nodes in a network before Mark Newman gave his
+talk about networks today here at the summer school, so we may well be
+erring or being somewhat naive in the choice of measure here: "cosine
+similarity" just seems like a good thing. If you have any suggestions
+as to a better way of thinking about synonymy, or perhaps measures
+more suited specifically to this type of network,  we're very interested
+in hearing about it.)

Lexical networks: Difference between revisions

From Santa Fe Institute Events Wiki

Revision as of 05:07, 13 June 2007

Interested people: (please add yourself if you are interested )

An interesting exchange with John Mahoney on June 8

More thoughts from Sayan and Hannah on June 12