Actions

Lexical networks: Difference between revisions

From Santa Fe Institute Events Wiki

No edit summary
No edit summary
 
(15 intermediate revisions by 2 users not shown)
Line 1: Line 1:
{{CSSS 2007 Santa Fe}}
(suggested by [http://www.santafe.edu/events/workshops/index.php/Sayan_Bhatacharyya Sayan])
==Interested people: (please add yourself if you are interested )==
1.[[Sayan_Bhattacharyya]]
2.[[Hannah_Cornish]]
3. [[Kathryn_Cooper]]
Lexical networks are graphs encoding the co-occurrence of words in large texts. (If the text is sufficiently large, we can pretend that the network encodes the entire language).
Lexical networks are graphs encoding the co-occurrence of words in large texts. (If the text is sufficiently large, we can pretend that the network encodes the entire language).


In the graph, two connected words are adjacent, and the degree of a given word is the number of edges that connect it with other words. You can take a look at this paper to see what lexical networks look like. (It's a short, readable paper.)
In the graph, two connected words are adjacent, and the degree of a given word is the number of edges that connect it with other words. You can take a look at this paper to see what lexical networks look like: (it's a short, readable paper): [http://complex.upf.es/~ricard/SWPRS.pdf The small-world of human language' by Cancho i Ferrer and Solé].
 
 


Thinking of going along either of the two possible directions for the project (but open to other suggestions):
Thinking of going along either of the two possible directions for the project (but open to other suggestions):
Line 7: Line 23:
(1) Perhaps, something along the lines of how to identify synonyms from within a lexical network / exploring suitable "metrics" for synonyms
(1) Perhaps, something along the lines of how to identify synonyms from within a lexical network / exploring suitable "metrics" for synonyms


(2) Perhaps, something along the lines of exploring attack tolerance in lexical networks (e.g. tolerance to knocking out some nodes in the network). [An interesting paper to look at here may be: "Albert, R., Jeong, H., & Barabasi, A.-L. (2000). Error  and attack tolerance of complex networks. Nature, 406, 378-381.]
(2) Perhaps, something along the lines of exploring attack tolerance in lexical networks (e.g. tolerance to knocking out some nodes in the network). [An interesting paper to look at here may be: "Albert, R., Jeong, H., & Barabasi, A.-L. (2000). Error  and attack tolerance of complex networks. Nature, 406, 378-381.] Evidently, this can have some interesting connections to the [http://www.santafe.edu/events/workshops/index.php/Healing_strategies_for_networks Healing strategies for networks] project, as well.


A motivation for thinking about (1) is that questions/mechanisms of analogy-making and comparison-making at all levels of cognition tend to be very interesting questions, and so (1) fits in well with some broader questions in that regard.
A motivation for thinking about (1) is that questions/mechanisms of analogy-making and comparison-making at all levels of cognition tend to be very interesting questions, and so (1) fits in well with some broader questions in that regard.
Line 14: Line 30:


Now, of course (1) and (2) above could actually end up being pursuable in terms of / within the same project: for example, the identification of synonyms within a lexical network could lead to attack tolerance (e.g. how to design a "self-healing" lexical network so that, if some of the nodes in the network are taken out, synonyms can step in to take over for the words thus taken out...)
Now, of course (1) and (2) above could actually end up being pursuable in terms of / within the same project: for example, the identification of synonyms within a lexical network could lead to attack tolerance (e.g. how to design a "self-healing" lexical network so that, if some of the nodes in the network are taken out, synonyms can step in to take over for the words thus taken out...)
==An interesting exchange with John Mahoney on June 8 ==
On 6/8/07, John Mahoney <jrmahone (at) ucdavis (dot) edu> wrote:
> might be interesting to try to model a lex net where synonyms display resistance to becoming
> equally used. attempting to use ideas from Page's talk on consistency and coherence. some say
> ketchup.. some say catsup ?
>
> so basically thinking about the difference between semantic equivalence and equivalent use.
>
> ?-john
Hi John,
Thanks for this interesting idea.
It makes me think of each node (word) maybe as a "basin of
attraction", drawing meanings into it, with some (a few) meanings that
are perched precariously on the rim between two basins, and capable of
going down the direction of either word.
We might also have a natural tendency to "compartmentalize" the world
into discrete, non-overlapping categories  (this would make sense from
the viewpoint of evolutionary history, I think -- it might simply make
sense to carve up the world into discrete categories if you're a
hunter-gatherer on the savannahs trying to make split-second
decisions). And so maybe we can say, from one point of view, that
"language abhors synonyms" ?
And there is the competing pressure, for reasons of building
fault-tolerance into the system, etc, to get some redundancy in there,
too.
So from the point of view of a lexical network, there are good
reasons, perhaps, for nodes to be similar  (what you called "semantic
equivalence") as well as for nodes to be dissimilar (what you called
"non-equivalence of use").
Thanks again, am putting this up on the Wiki.
==More thoughts from Sayan and Hannah on June 12 ==
Earlier, we mentioned going along either of the two possible
directions for the project:
(1) Perhaps, something along the lines of how to identify synonyms from
within a lexical network / exploring suitable "metrics" for synonyms
(2) Perhaps, something along the lines of exploring attack tolerance in
lexical networks
After talking  over the past few days, and also attending
Mark Newman's talk today,  we are now  tentatively thinking of
settling on option (1) above (synonyms)  rather than option (2).
Narrowing down our ideas after further discussion, here is our
thinking at this point:
(1) We'll create a lexical net with data (text, i.e. a corpus) from a
given domain, and then "grow" the net thus built with more data from
the domain. The second set of data should be similar enough to the
first set of data, but with systematic variation from the former.
Specifically, we are thinking of using the data consisting of the
abstracts (or maybe even full papers) of the Santa Fe Institute
working paper series for this purpose. Our thinking behind this is as
follows: in the early days, most working papers in SFI were from the
hard sciences, while, with time, more and more papers have been from
the social sciences. So, if we create a lexical net using  data that
uses the text of the earlier papers as a corpus, and then "grow" the
net into a new net using data thaty uses the text of the later papers
as a corpus, our working hypothesis is that we may end up getting
subtle but interesting differences between these two lexical nets.
Synonymy is an issue that Hannah is interested in, and Sayan is also
interested in the issue of synonymy as Sayan is  generally interested in
the issue of analogy and comparison (wherever they occur). So, we
discussed that what we might both be interested in, might be to find
measures of "synonymy" in the two lexical networks thus built. We
might find some interesting differences, given that the nature of the
texts upon  which the earlier lexical network would be built (more
hard-science papers) is subtly different from the nature of the texts
upon which the later lexical network would be built (relatively more
social science papers), while both sets of papers belong to a common
domain (all are texts about "complex systems", after all).
As to a suitable metric/measure for "synonymy", we're thinking of
using one of the standard measures of similarity of nodes in a network
(such as cosine similarity, for example), on selected, salient nodes.
(Actually, neither of us knew anything about statistical tools for
measuring similarity of nodes in a network before Mark Newman gave his
talk about networks today here at the summer school, so we may well be
erring or being somewhat naive in the choice of measure here: "cosine
similarity" just seems like a good thing. If you have any suggestions
as to a better way of thinking about synonymy, or perhaps measures
more suited specifically to this type of network,  we're very interested
in hearing about it.)
== Modified ideas ==
[Sayan sent off this email to Mark Newman today. (It incorporates some additional/new
thoughts.]
Hi Prof. Newman,
This is Sayan, a grad student at U-M and one of the participants of
the current Santa Fe Institute Summer School. First of all, many
thanks for your interesting talks here earlier this week. It gave me
some ideas for the group project on lexical networks which I and a
couple of others are thinking of doing here at the  Summer School.
I am writing to you to see if you might possibly have the reference
for the two papers that you mentioned towards the end of your last
lecture (the paper about bipartite networks, and the paper about
algorithms to find groups of vertices; I *think* the latter was Newman
and Light (?) but I'm not sure -- I didn't copy down the references
thinking that the slides will be on the Summer School Wiki, but
they're not there yet.) Do you by any chance have the references?
Our idea for the project, briefly, is the following: we'll create a
lexical net with data (text, i.e. a corpus) from a given domain, and
then "grow" the net thus built with more data from the domain. We want
the second set of data should be similar enough to the
first set of data, but with some systematic variation from the former.
Specifically, we are thinking of using the data consisting of the
abstracts (or maybe even full papers) of the Santa Fe Institute
working paper series for this purpose. Our thinking behind this is this:
in the early days, most working papers in SFI used to be from the
hard sciences, while, with time, more and more papers have recently
been from the social sciences. So, if we create a lexical net using  data that
uses the text of the earlier papers as a corpus, and then "grow" the
net into a new net using data thaty uses the text of the later papers
as a corpus, our working hypothesis is that we may end up getting
subtle but interesting differences between these two lexical nets. We
would expect the first lexical net to have more "hard-science" qualities
somehow, and the second lexical net to have (relatively) more "social
science" qualities somehow.
We haven't quite thought everything through, but it does seem like algorithms
that have to do with partitions in networks, or finding groups of
vertices within
networks, might be handy -- hence the reuqest. (As this is just an
exploratory project,
we're just interested about using this project to play with and get
insights about lexical networks.)
I know you must be very busy and so I apologize for making the request,
since I should have noted down the references myself at the time of the
lecture. In case you have the references lying about, they would be very
helpful for us. (And in case you have any thoughts/suggestions about
the project
based on what I wrote above, that would be very helpful too!)
Thanks very much for your help,
-Sayan.

Latest revision as of 07:28, 21 June 2007

CSSS Santa Fe 2007


(suggested by Sayan)

Interested people: (please add yourself if you are interested )

1.Sayan_Bhattacharyya

2.Hannah_Cornish

3. Kathryn_Cooper


Lexical networks are graphs encoding the co-occurrence of words in large texts. (If the text is sufficiently large, we can pretend that the network encodes the entire language).

In the graph, two connected words are adjacent, and the degree of a given word is the number of edges that connect it with other words. You can take a look at this paper to see what lexical networks look like: (it's a short, readable paper): The small-world of human language' by Cancho i Ferrer and Solé.


Thinking of going along either of the two possible directions for the project (but open to other suggestions):

(1) Perhaps, something along the lines of how to identify synonyms from within a lexical network / exploring suitable "metrics" for synonyms

(2) Perhaps, something along the lines of exploring attack tolerance in lexical networks (e.g. tolerance to knocking out some nodes in the network). [An interesting paper to look at here may be: "Albert, R., Jeong, H., & Barabasi, A.-L. (2000). Error and attack tolerance of complex networks. Nature, 406, 378-381.] Evidently, this can have some interesting connections to the Healing strategies for networks project, as well.

A motivation for thinking about (1) is that questions/mechanisms of analogy-making and comparison-making at all levels of cognition tend to be very interesting questions, and so (1) fits in well with some broader questions in that regard.

A motivation for thinking about (2) is that several people in the summer school are thinking of working on projects on fault tolerance / attack tolerance in small-world networks -- e.g. in biological /metabolic networks, in neural networks, etc, and so (2) would "mesh" well with similar projects with other kinds of scale-free networks that others in the summer school are thinking of working on, leading to exchange of ideas, etc.

Now, of course (1) and (2) above could actually end up being pursuable in terms of / within the same project: for example, the identification of synonyms within a lexical network could lead to attack tolerance (e.g. how to design a "self-healing" lexical network so that, if some of the nodes in the network are taken out, synonyms can step in to take over for the words thus taken out...)

An interesting exchange with John Mahoney on June 8

On 6/8/07, John Mahoney <jrmahone (at) ucdavis (dot) edu> wrote:

> might be interesting to try to model a lex net where synonyms display resistance to becoming

> equally used. attempting to use ideas from Page's talk on consistency and coherence. some say

> ketchup.. some say catsup ?

> > so basically thinking about the difference between semantic equivalence and equivalent use.

> > ?-john


Hi John,

Thanks for this interesting idea.

It makes me think of each node (word) maybe as a "basin of attraction", drawing meanings into it, with some (a few) meanings that are perched precariously on the rim between two basins, and capable of going down the direction of either word.

We might also have a natural tendency to "compartmentalize" the world into discrete, non-overlapping categories (this would make sense from the viewpoint of evolutionary history, I think -- it might simply make sense to carve up the world into discrete categories if you're a hunter-gatherer on the savannahs trying to make split-second decisions). And so maybe we can say, from one point of view, that "language abhors synonyms" ?

And there is the competing pressure, for reasons of building fault-tolerance into the system, etc, to get some redundancy in there, too.

So from the point of view of a lexical network, there are good reasons, perhaps, for nodes to be similar (what you called "semantic equivalence") as well as for nodes to be dissimilar (what you called "non-equivalence of use").

Thanks again, am putting this up on the Wiki.

More thoughts from Sayan and Hannah on June 12

Earlier, we mentioned going along either of the two possible directions for the project:

(1) Perhaps, something along the lines of how to identify synonyms from

within a lexical network / exploring suitable "metrics" for synonyms

(2) Perhaps, something along the lines of exploring attack tolerance in
lexical networks


After talking over the past few days, and also attending Mark Newman's talk today, we are now tentatively thinking of settling on option (1) above (synonyms) rather than option (2).

Narrowing down our ideas after further discussion, here is our thinking at this point:

(1) We'll create a lexical net with data (text, i.e. a corpus) from a given domain, and then "grow" the net thus built with more data from the domain. The second set of data should be similar enough to the first set of data, but with systematic variation from the former.

Specifically, we are thinking of using the data consisting of the abstracts (or maybe even full papers) of the Santa Fe Institute working paper series for this purpose. Our thinking behind this is as follows: in the early days, most working papers in SFI were from the hard sciences, while, with time, more and more papers have been from the social sciences. So, if we create a lexical net using data that uses the text of the earlier papers as a corpus, and then "grow" the net into a new net using data thaty uses the text of the later papers as a corpus, our working hypothesis is that we may end up getting subtle but interesting differences between these two lexical nets.

Synonymy is an issue that Hannah is interested in, and Sayan is also interested in the issue of synonymy as Sayan is generally interested in the issue of analogy and comparison (wherever they occur). So, we discussed that what we might both be interested in, might be to find measures of "synonymy" in the two lexical networks thus built. We might find some interesting differences, given that the nature of the texts upon which the earlier lexical network would be built (more hard-science papers) is subtly different from the nature of the texts upon which the later lexical network would be built (relatively more social science papers), while both sets of papers belong to a common domain (all are texts about "complex systems", after all).

As to a suitable metric/measure for "synonymy", we're thinking of using one of the standard measures of similarity of nodes in a network (such as cosine similarity, for example), on selected, salient nodes. (Actually, neither of us knew anything about statistical tools for measuring similarity of nodes in a network before Mark Newman gave his talk about networks today here at the summer school, so we may well be erring or being somewhat naive in the choice of measure here: "cosine similarity" just seems like a good thing. If you have any suggestions as to a better way of thinking about synonymy, or perhaps measures more suited specifically to this type of network, we're very interested in hearing about it.)

Modified ideas

[Sayan sent off this email to Mark Newman today. (It incorporates some additional/new thoughts.]

Hi Prof. Newman,

This is Sayan, a grad student at U-M and one of the participants of the current Santa Fe Institute Summer School. First of all, many thanks for your interesting talks here earlier this week. It gave me some ideas for the group project on lexical networks which I and a couple of others are thinking of doing here at the Summer School.

I am writing to you to see if you might possibly have the reference for the two papers that you mentioned towards the end of your last lecture (the paper about bipartite networks, and the paper about algorithms to find groups of vertices; I *think* the latter was Newman and Light (?) but I'm not sure -- I didn't copy down the references thinking that the slides will be on the Summer School Wiki, but they're not there yet.) Do you by any chance have the references?

Our idea for the project, briefly, is the following: we'll create a lexical net with data (text, i.e. a corpus) from a given domain, and then "grow" the net thus built with more data from the domain. We want the second set of data should be similar enough to the first set of data, but with some systematic variation from the former.

Specifically, we are thinking of using the data consisting of the abstracts (or maybe even full papers) of the Santa Fe Institute working paper series for this purpose. Our thinking behind this is this: in the early days, most working papers in SFI used to be from the hard sciences, while, with time, more and more papers have recently been from the social sciences. So, if we create a lexical net using data that uses the text of the earlier papers as a corpus, and then "grow" the net into a new net using data thaty uses the text of the later papers as a corpus, our working hypothesis is that we may end up getting subtle but interesting differences between these two lexical nets. We would expect the first lexical net to have more "hard-science" qualities somehow, and the second lexical net to have (relatively) more "social science" qualities somehow.

We haven't quite thought everything through, but it does seem like algorithms that have to do with partitions in networks, or finding groups of vertices within networks, might be handy -- hence the reuqest. (As this is just an exploratory project, we're just interested about using this project to play with and get insights about lexical networks.)

I know you must be very busy and so I apologize for making the request, since I should have noted down the references myself at the time of the lecture. In case you have the references lying about, they would be very helpful for us. (And in case you have any thoughts/suggestions about the project based on what I wrote above, that would be very helpful too!)

Thanks very much for your help,

-Sayan.