From Santa Fe Institute Events Wiki


Quick: contact, (413) 575-4995.

I'm a PhD student in Computer Science at the University of Massachusetts Amherst. I work on data mining / machine learning -- i.e., given some large body of data, how do we analyze it, find dependencies, identify interesting patterns, and make predictions? In particular, my lab works with data sets that have a network structure, such as movies and their actors, papers and their writers, cellular proteins and their interactions, or stockbrokers and their firms. As for the relation to complex systems research, I'm most interested in models of graph structure.

In a paper that I'm hoping will be the beginning of a dissertation, I took a data set of stock brokers and their employment histories, and found pairs of people where you could say, "there's no way you could have appeared at all those different jobs at the same time as each other and not have been working together on purpose." It turned out that people in this situation were, as we'd suspected, more likely to have high "fraud risk" scores.

Instead of stock brokers, one could look for animals that are closely tied--"there's no chance you would have both appeared in that herd at that time, and that herd, and in that herd, unless you two are traveling together." Or people with similar interests--"no one else on Netflix likes that movie, and also that movie, and also that movie; you two are made for each other! Or else you're already renting movies together." Right now I'm looking at models that can describe how people in these data sets pick a set of jobs to work at, or a set of movies to like--models that capture the idea, "if you liked this movie, you have a 75% chance of liking this one too."

I've been intrigued by SFI ever since I read the book Complexity about 9 years ago, so I'm excited to have the opportunity to attend the summer school!


As mentioned above: in general, models of graph structure, including for dynamic graphs or for graphs with attributes (other data attached to the nodes and edges).


From computer science: writing algorithms to be efficient. (And, more of a background in discrete/combinatorial problems than in continuous systems like diff eqs.) I can build you a mysql database or a solid java program, but mostly I prefer writing quick scripts in perl (or python, or R).

From machine learning: exposure (maybe "expertise" is pushing it) to statistical methods, such as "graphical models" describing how systems of variables depend on each other, and ideas like likelihood ratio tests, estimating/learning the parameters of a distribution, Bayesian methods, bias/variance tradeoff, avoiding overfitting through separate training & test data, permutation tests, etc.

From data mining: noticing that this problem in this one field has the same structure as that problem in that other field.

Hopes for CSSS

Colleagues to have fun with and to be sounding boards for each others' ideas. Possible collaborations. (I'd like to do interdisciplinary work if I stay in academia; if one is researching methods designed to be applied to real problems, then one should be working on real problems together with experts in those areas).

Project ideas

I'm not sure if others would find this interesting, but one idea I've been wanting to explore for a while is how common properties of graphs might be explained as arising from common properties of the underlying data. That is, when a graph appears in front of us (and we say, "wow, the clustering coefficient is high, and the density is low, and there are hubs and clusters"), there was originally just a non-graph data set of objects and their properties. Is there something about how we decide what becomes an object, and what becomes a link, that would make the surprising(?) graph properties inevitable?