Actions

Difference between revisions of "Complex Systems Summer School 2016-Projects & Working Groups"

From Santa Fe Institute Events Wiki

(Using DOTA 2 data as a proxy to improve cyber intrusion detection algorithms)
(Using DOTA 2 data as a proxy to improve cyber intrusion detection algorithms)
Line 365: Line 365:
 
See [[Media:mitredatadota.pdf |DOTA 2 DATA OVERVIEW]] for additional details.   
 
See [[Media:mitredatadota.pdf |DOTA 2 DATA OVERVIEW]] for additional details.   
  
 +
===Technology===
 +
 +
Parsed data is in JSON format.  We should have access to two databases: one is a mongo database discussed above for aggregated information (e.g., fights, deaths (characters respawn), characters played, items acquired, positions for the first 10 minutes, league IDs, player IDs, wins), and the second is raw data for every move and every action each player took throughout the entire game. 
  
 
===Interesting addition and comments to the idea===
 
===Interesting addition and comments to the idea===

Revision as of 01:34, 15 June 2016

Complex Systems Summer School 2016


Contents

Projects

Quantifying Trust

Keywords

Agent Based Modeling, Information Theory, Game Theory

Summary

To me there are some simple rules of trust:

  1. The more we trust someone the less questions we are asking about their activity
  2. By trusting someone, a person reduces the amount of information he needs to process. e.g. If a pedestrian does not have a clear view of the road then he might follow other pedestrians in front of him. By following them he is reducing the amount of information he needs to process.
  3. If there is a "perfect trust" in a social system, then the system seems to be more ordered. For example in a military system a soldier blindly obeys the order of its officers. This makes the system more ordered.
  4. On the other hand the less there is trust the more disorder in the social system. So quantifying trust could be a way to quantify the disorderedness or entropy of a social system.
  5. But there is a catch. It is easier to propagate a misinformation in a system where everyone blindly trusts each other.

My intuition is that we can use both information theory and game theory to quantify trust this way. The problem is I don't have a dataset yet that can validate the rules above. But we can use the above observations to formulate an agent based model where we can tune the parameters to see if we can get any interesting observation.

Interesting addition and comments to the idea

Please add your comments and idea about the project here.

Group Contact

Syed Arefinul Haque, haque.s@husky.neu.edu

Interested Participants

Modeling cultural evolution through 150 years of language change

Keywords

cultural evolution, language, multilingual

Summary

I have a clean dataset of word evolution for 3 languages (English, French, German); this data is built from historical text (books mostly) and includes data for all words that are reasonably frequent/popular in that decade (more than 10000 words per decade). For every decade from 1850-2000 I have:

  1. word frequency/popularity
  2. historical "word vectors" (capture word meaning/semantics)
  3. word sentiment scores (whether a word is has a positive or negative connotation; English only)
  4. word part of speech tags (e.g., is a word used more as a noun or a verb in a given year).

I am totally willing to give all this data away freely (even if I am not actively involved in the project). The project website with the data is: nlp.stanford.edu/projects/histwords/

Group Contact

Will Hamilton (wleif@stanford.edu)

Interested participants, please sign up below

Matteo Morini I've always wanted to work on evolutionary linguistics. You have the (coolest) dataset; now we need to devise a model. Perhaps an ABM (w/geographical, metaphorical, network-based space) where people settle, meet, migrate, adjust to new words/morphemes. The (ambitious) goal would be to trim the model as to mimic the empirical data. Sounds feasible? Btw, we will need to figure out (a.k.a. fake ^_^) a way to deal with spatializing the non-geographically referenced dataset. Who's in?

Danilo Liuzzi

Semantic word game

Keywords

cognitive science, language, games, network routing, free association

Summary

See http://pybossa.socientize.eu/pybossa/app/Semantics/ to play the game. The goal is, given a starting word, navigate through forced choices to a known target word. The global idea is to understand the effect of context and how individuals can manipulate semantic space online (like we constantly have to do in natural conversation. A project idea is to construct random models where we can see influence of the target word. For example, if we have directed search from the start node to the end node as well as a directed search path from goal to start, can we combine these paths to capture human performance.

Data is available for download at https://pybossa.socientize.eu/pybossa/app/Semantics/tasks/export and looks like (JSON) {"info": "ILLUSION~FANTASY~16~FANTASY~ISLAND~6~ISLAND~OCEAN~4","user_id": 1732, "task_id": 13792, "created":"2014-04-01T11:26:06.479946", "finish_time": "2014-04-01T11:26:06.479966", "calibration": null, "app_id": 430,"user_ip": null, "timeout": null, "id": 3211605}, This user went from illusion to fantasy (16 seconds), then from fantasy to island (6 seconds) and so on.

Research question is still a bit unclear but a while back there was more than 10478 games. Additionally something similar has been done using wikipedia http://snap.stanford.edu/data/wikispeedia.html

Group Contact

Nicole Beckage nbeckage at gmail

Interested participants, please sign up below

Effect of observer on their environment

Keywords

uncertainty, observation

Summary

I'm interested in looking at how an observer affects their environment or the system they are studying. I've talked to a few people about this, but the idea may need more refinement to make a good project. If anyone has more specific ideas, please add them here.

Group Contact

Jacob Hunter, jacob.hunter@pnnl.gov

Interested

Temporal social networks (and dialogues) in film

Keywords

movies, films, social network analysis, pop culture

Summary

I have two great datasets for analyzing social networks and interactions of characters in films (i.e., the social networks of the characters in the movies). One dataset has pre-processed dialogues from around 300 movies + metadata (this dataset is public). The other dataset has detailed information on the temporal scene structure (who appears on screen together etc.) for thousands of movies + metadata (e.g., rotten tomatoes scores); this data is more private/sensitive, but we can definitely use it for this project.

Things I would like to do:

  • How do structures of movie social networks predict genre, rating, box office success, country of production, etc.
  • (Temporal) network motifs: what are the common patterns of different types of films?
  • Movie social networks vs. real life social networks.

Group Contact

Will Hamilton (wleif@stanford.edu)

Interested participants, please sign up below

CSSS group formatiom

Keywords

networks, social groups, nomorespeeddatings, communities

Summary

Many of us had idea to make a field study of group formation in Summer Schools we can use data on sex, nationality, profession and bio sketches data on previous summer schools could be available as well

You can fill up the space with your ideas 

There is a nice paper about collective intelligence http://science.sciencemag.org/content/330/6004/686 which can also be useful

Group Contact

Dmitry Alexeev (exappeal@gmail.com)

Interested participants, please sign up below

Emergence of project groups in CSSS 2016

Keywords

subgroups, team participation, satisficing, networks

Summary

We want to model team emergence in cross-disciplinary academic research.

- Is diversity important? 
- What makes teams productive? What makes teams fun?
- How do teams form information networks?
- How do members satisfice?
- How do members select their teams and how does that align with success of the group?
- What are the conditions for maverick rigor?

Group Contact

Ross Buhrdorf rbuhrdorf@gmail.com

Interested

Jacob Hunter, jacob.hunter@pnnl.gov

Julia Adams

NY subway microbiome

Keywords

networks, ecology, subway , city , communities

Summary

I am working with international consortia - Metasub one of the projects is well known NY subway map of microbes http://d2zahwnsqpmout.cloudfront.net/map/

Chris Mason from Cornell is willing to share the data they have collected in different stations and different parts of the stations

Microbial communities reflect history of the stations as well as people migration - interesting would be to create a model for microbe spread along the subway and exchange

Other data on NY could also be applied

Original paper from Cell Systems http://www.cell.com/pb/assets/raw/journals/research/cell-systems/do-not-delete/CELS1_FINAL.pdf

Group Contact

Dmitry Alexeev (exappeal@gmail.com)

Interested participants, please sign up below

Daniel Biro (daniel.biro@med.einstein.yu.edu)

Network and language dynamics of Reddit (Correlations between cultural and structural properties of communities))

Summary

I have a cleaned dataset of all Reddit (www.reddit.com) comments from 2009 through 2014 (Reddit is a very popular social media forum with >30 million users; it is organized into thousands of user-moderated "subreddits", which are topically-focused communities). The data includes all the text of the comments, usernames, the upvotes, and the thread structures (so networks can be constructed). There are many interesting questions that could be investigated with the data, and I would love to hear ideas! (The data is totally public, and I am willing to share even if I don't work on the project).

I have done a lot of pre-processing and preliminary analysis on a manageable and very clean subset of all 2014 comments for 1200 mid-to-large subreddits (about 50Gb). My background is in natural language processing and network mining, and I have lots of preliminary analysis/machinery for measuring linguistic signals in the data.

One high-level idea I have is to look at the relationship between sociolinguistic/cultural features of a community and the dynamic network interaction structures (e.g., clustering of comment-reply networks). I have a ton of sociolinguistic variables that could be used, e.g. how "polite" the language is or how much people use "we" vs "I".

Another cool project would be looking at conflict and/or anti-social behavior (again, I have some linguistic features that I think could be helpful for this).

Group Contact

Will Hamilton (wleif@stanford.edu)

Interested participants, please sign up below

The assembly of plant-pollinator networks

Keywords

networks, mutualisms, ecology, restoration, succession, assembly

Summary

I have a 10 year dataset of ~1500 observations of pollinators (bees, flies, butterflies, wasps etc.) visiting plants in a native plant restorations (hedgerows) in the Central Valley of CA. The assembling communities are paired with unrestored field margins (controls) and mature (non-assembling) hedgerows. The goal would be to examine how and why the structure of the network is changing through time. How are the individual species changing their interaction patterns? What does this mean for the topology/resilience of the network? There is also a spatial dimension (the meta-population dynamics of the networks?) that could be explored.

For more information on the dataset, please see https://nature.berkeley.edu/~lponisio/wp-content/uploads/2014/12/ponisio-2016-704.pdf

Group Contact

Lauren Ponisio (lponisio@gmail.com)

Interested participants, please sign up below

Ryan McGee (ryansmcgee@gmail.com)
Daniel Biro (daniel.biro@med.einstein.yu.edu)
Dima Alexeev exappeal@gmail.com
Lindsay Todman (Lindsay.todman@rothamsted.ac.uk)
Chris Revell (cr395@cam.ac.uk)
Julia Adams (jadams@wellesley.edu)
Marilia P. Gaiarsa (gaiarsa.mp@gmail.com)

Modeling prestige good economies/power structures

Summary

Prestige goods, in a nutshell, are goods whose value is in conveying social status/ranking, and there is a great deal of speculation about how they contribute to increasing social complexity. It would be interesting to explore these dynamics via an ABM.
A couple of ideas:

1) Kantner from the Santa Fe School of Advanced Studies has an interesting game-theory model on how people decide to invest in prestige goods. He includes some data on turquoise in archeological contexts in the Chacoan culture (http://www.jstor.org/stable/27820883). It would be interesting to a) implement this as an agent-based model, and b) explore some additional dynamics.
2) How does a prestige economy respond when the prestige goods all come from outside and the supply dries up?
An interesting case study (and what inspired the question) is the Late Yayoi-period Japanese islands (roughly 100-300 CE), a network of chiefdoms closely connected to the Korean peninsula (as a source of ore) and the Chinese mainland (as a "tributary state" of the Han Dynasty). The chief status markers in Western Yayoi society were Chinese-produced bronze mirrors and swords -- so that when the Han Dynasty fell apart and the supplies of these goods dried up, we see evidence in the archeological record of upheaval (including an attempt to make homegrown replicas of the prized items). This may also have contributed to the subsequent formation of more complex chiefdoms. It would be interesting to put together an exploratory model of this (don't have data, at least not for the Japan example).
I know some folks at SFI have been working on the dynamics of prestige.

Since prestige goods indicate power (in the form of social status/ranking, or however else we may define power), this project will also explore modeling power in social groups.

Group Contact

Ellen Badgley (flyingrat42@gmail.com)
Lula Chen (nchen3@illinois.edu)

Interested participants, please sign up below

  • Simon Carrignon: I wrote already a general ABM to study that kind of things, but would be really happy to start from scratch something new in python or whatever!
    • Awesome! I started reading your paper and it looks like a great starting point - maybe we could extend the model to distinguish between common goods (subsistence-level, go away after each season) and permanent prestige goods that convey social value, as well as looking into the implementations of prestige on agent actions. If you would like to continue working on this for CSSS let's talk. - Ellen

Group Meetings

Anybody want to meet after dinner on Wednesday 15 June?

Tracking the migrations of urban hipsters (aka spatiotemporal analysis and scaling of labor)

Summary

This is linked to the "Viscosity of Labor" question on the MITRE Challenge Questions list, which is copied below:
Given US Census collected data can we find relationships between labor and urban scale?

"Using US Census data (and other sources) a number of interesting scaling laws have been discovered that relate to the dynamics of urban human social systems. These scaling laws relate to such things as the generation of intellectual property, income, tax revenue, crime, and so on. What about labor? Does the scaling seen in income come from new or shifting categories of labor or simply increasing the income within an existing (static?) distribution of labor categories? Is there a spatial component? Does the spatial distribution change with the scale of the urban area?"

The above is the complete question from MITRE, but we can adjust/focus as needed depending on interest and ideas.

There is plenty of US Census data available for this:

We would probably start at the county level and go finer (census tract/block group) if time allows.
MITRE and SFI are actively working together to improve the 2030 Census, so we can reach out to SFI directly on this.


Group Contact

Ellen Badgley (ebadgley@mitre.org, flyingrat42@gmail.com)

Interested participants, please sign up below

Group Meetings

Anybody want to meet after dinner on Wednesday 15 June?

A preliminary model of the coupled human-natural system of swidden agriculture

Summary

Swidden agriculture, also known as Slash-and-burn is about as old as agriculture itself. It exists in diverse variants practiced by 200 to 500 million people in different regions of the world; all of these forms involve the slashing and burning of portions of forest, hence its name. There has been a historic controversy regarding swidden agriculture, with some publications presenting it as a destructive force that contributes to global deforestation and other publications highlighting its sustainability and ecological benefits when practiced as a means of subsistence. This controversy shows the need of tools to further research and assess the benefits and costs of swidden agriculture.

This project idea will address the following research question: How do human activities interact with the ecological landscape and the sustainability of swidden agriculture?

This project idea aims to produce a simple, preliminary model with the following components:

  • A simplified social network of swidden farmers exchanging agricultural labor, inspired by Downey (2010)
  • A landscape in the form of a 2-D grid in which cells are patches of forest / crops

Possible model outputs include:

  • Yearly harvest
  • Biodiversity
  • Yearly biomass production in forest and fallows
  • A representative indicator of the "net production" of the complete system
  • Sustainability of the complete system

Biodiversity, biomass production, and sustainability will need reasonable, simple definitions that follow modern literature on the subject.

Software and language

Every idea is worth discussing. NetLogo might be an option that fits with our limited calendar. Git and GitLab would be used to manage the source code.

Group contact

Fabio Correa <facorread@gmail.com>

Interested participants, please sign up below

  • Julia Adams
  • Ellen Badgley (I will lurk)
  • Lindsay Todman (Lindsay.todman@rothamsted.ac.uk)
  • Chris Revell (cr395@cam.ac.uk)

Modeling a City ('s Traffic?) In-Silico

Summary

I have a few datasets laying around that I cleaned for my thesis (and I’m not using for that purpose) and a few more that I obtained but never cleaned. All of them are data for the city of Madrid (Spain) on various geographical and time scales/resolutions, most of them related to social/environmental/demographic variables.

The one I’m most interested in using (and I have not cleaned yet) is a dataset of traffic intensity in Madrid. This uses sensors placed in most traffic lights (~3600), and the dataset provides count of vehicles (and associated % capacity used) every 15 minutes, from 2013 to March 2016. Some of the data is messy (but they have a flag for unreliable data). This data is freely available at datos.madrid.es (look for “intensidad de trafico”).

Not sure how to insert an image here, but here's a link to a plot created with 45 days of this data on a single intersection. You can guess when the Holy Week and Holy Thursday/Friday happened this year: here

I somehow have the sense that using this data may be a cool experience, but I have not figured out yet what to do with it.

I have cleaned data on sociodemographic/socioeconomic characteristics at the census section level (~1500 people), geocoded commercial spaces data and more stuff that we can add to this. This also includes historical data on all properties (including houses, commercial spaces, etc.)

Contact me (or talk to me anytime!) if you are interested or want to discuss some ideas! Usama Bilal <ubilal@jhmi.edu>

Interested please sign up below

  • Ellen Badgley - I'm interested but don't have a question to go with this yet - will think about it more!

Can we use metabolic networks to predict the next beneficial mutation?

ideas

Richard Lenski (who is also associated with SFI) has evolved an coli population to use a new sugar source, and tracked their changes for 40000 generations. This is a widely studied dataset (>200 papers published on it already...) in one of the most widely studied biological system but there is still a lot we do not understand about it. For example although we know what genes evolved during the experiments, we do not know why it is these particular genes that changed. I think it would be cool to try to use genetic network (which is also very well understood) to try to understand how new mutations rising in frequency changes the performance of a bacteria, and how the previous state of performance changes what the next mutation should be. The genetic interactions are pretty well mapped in e.coli.(Regulondb) Also they claim that the performance plateaued, but the mutations that accumulated were still beneficial in the following paper. How did that happen? (Genome-wide Mutational Diversity in an Evolving Population of Escherichia coli) From a more computer science perspective: 40000 generation for an iterative process doesn't seem like very long. Does the network structure of metabolism pathways allow rapid adaptation? What can we learn from these networks to apply to computer science problems? These ideas are pretty raw... but I think something evolution related, that looks not just at a property of a system but also allow the system to change would be really cool. My email is Chenling Xu <chenlingantelope@gmail.com>.

Interested please sign up below

  • Ryan McGee (ryansmcgee@gmail.com)
  • Chris Revell (cr395@cam.ac.uk)

Complexity insights into Circular Economy

Summary

The 'circular economy' is recent approach on questions of sustainability. The current 'take, make, waste' way of producing is considered to be not sustainable anymore. Current thoughts on 'closing the loop' mechanisms on producing and consuming are booming in many different academic fields. However, not yet from an integrated point of view. The idea of 'closing the loop' mechanisms corresponds to complex system paradigms, however complexity science has hardly been utilized in this field. In this project we want to start brainstorming on how complexity science can contribute to the field of circular economy. We think there are many ways in which complexity science can contribute, both in methods and in theory. Based on the competences in the group we would like to brainstorm on suitable research questions, that can range from very practical to theoretical issues. Keywords: Complexity economics, Network optimization, Urban systems, Agent-based modeling


Insights

Please contribute here by adding your insights

Meeting scheduled Wednesday. Time and Date follows.

People interested

- Joris Broere 
- Juste Raimbault



Insights in Scientometrics

Summary

I have multiple scientometric datasets scraped from the "Web of Science" website, which include published works from different (hard sciences only, sorry) domains. Author(s), Year, Journal, ..., and citations are available. I have some code ready to pre-process those data, e.g. for linking articles by co-citations, or bibliographic coupling. I've been working on these data for some time, and am eager to offer the data in exchange for fresh, brilliant ideas.

Group Contact

Matteo Morini

Interested people please sign up below

Using DOTA 2 data as a proxy to improve cyber intrusion detection algorithms

Keywords

Adaptive strategies, co-evolutionary systems, cybersecurity, DOTA 2, Inductive Game Theory, time-series analysis, potential NLP

Key Points

Main question: how to a) detect and b) quantify the rate of change of strategies in co-evolutionary systems?

Motivation: improve cyber intrusion detection algorithms

Method: use the competitive, strategic, multiplayer online game DOTA 2 data as a proxy for a cyber attack. Extract sets of strategies from game observables and develop an algorithm to detect when strategies change. This could be from within the games or from the drafting of characters (heroes).

Summary

Most metrics for scoring how well a given cyber intrusion algorithm performed involve static determinations after an attack has occurred and usually consist of simple false positive or false negative type scores. This does not enable the granularity necessary to improve these algorithms in appreciable ways. The goal is to develop new metrics to assess in real time when and how attackers adapt to the protective mechanisms they encounter so that the security teams can adjust their own defensive tactics more quickly. This could decrease the long lead time for patches and stave off additional damage.

Usable data from within the cyber domain that may help to strengthen detection algorithm scoring is nearly nonexistent. If actual attacks are documented in the real world, this data is rarely made available and may be network specific (hence, not generalizable). Thus, a proxy for the data is necessary. An online battle arena video game called Defense of the Ancients 2 (Dota 2) is a rich data source of real-time adaptive adversaries. Dota 2 is a strategic, competitive, multiplayer game where two teams of five individuals each compete against each other to complete objectives and to destroy the other team’s base in a time frame of ~20 to ~90 minutes. The players deploy various in-game and between-game tactics and procedures to achieve a specific measurable objective. Professional players vie for tens of millions of dollars in prize pools each year, and over 2 billion games have been played. Currently, we have access to a Mongo database with 500 GB of data (~2.5 million games played over one year).

This is akin to the classic ‘Red Queen hypothesis’ in evolutionary biology, but in this case we are interested in human behaviors where a strategy can be considered most generally as a ‘meme’ of sorts. Our hypotheses include: 1) a mapping exists between game observables and a set of strategies, 2) a measurable signal can be extracted from which to ascertain the adoption, stabilization, and decay of specific strategy traits, 3) changes in strategies occur over time, and 4) strategy changes are driven (at least in part) from behaviors of the opposing team. We postulate that testing these hypotheses will require an understanding of the co-evolutionary dynamics of the overall environment. In particular, the human behavioral components underlying the adversary/defense team actions will need to be assessed. The use of data from a multiplayer online game for this purpose assumes there is a valid mapping between the ‘game-space’ to ‘cyber-space’ behaviors from which useful inferences can be made. Through this exercise we hope to understand what types of data (if any) are useful for this purpose and how one might develop proxies to characterize strategies. Ultimately, we hope that the analysis could be applied against realistic data specific to a cyber intrusion.

See DOTA 2 DATA OVERVIEW for additional details.

Technology

Parsed data is in JSON format. We should have access to two databases: one is a mongo database discussed above for aggregated information (e.g., fights, deaths (characters respawn), characters played, items acquired, positions for the first 10 minutes, league IDs, player IDs, wins), and the second is raw data for every move and every action each player took throughout the entire game.

Interesting addition and comments to the idea

Please add your comments and idea about the project here.

Group Contact

Anne Sallaska, alsallaska@gmail.com (or asallaska@mitre.org, harder to check so slower to respond potentially)

Interested participants, please sign below

Data Sets

MITRE Data Sets

The two data sets we have access to are Defense of the Ancients 2 (DOTA 2) and Polish Power grids. To access the data please contact Juniper she has it on a hard drive. If you have any specific questions about the data you can contact Matt Koehler at mkoehler@mitre.org.

DOTA 2 DATA OVERVIEW

MITRE Challenge Questions and Powergrid Data Overview

MITRE Challenge Questions Overview