# Difference between revisions of "Research Experiences for Undergraduates 2015-Potential Mentors"

### From Santa Fe Institute Events Wiki

 Research Experiences for Undergraduates 2015

A complete list of resident faculty list and our postdoctoral fellows can be found here

## Cristopher Moore, SFI REU PI, Santa Fe Institute, Resident Faculty

Mentors: Cris Moore and Pan Zhang
The idea of statistical significance is tricky in networks. Do these three definitions of statistically significant community agree with each other?
1. communities stay the same when a few edges are added or removed (robustness)
2. communities help predict missing edges (if we train the algorithm on the edges we know about)
3. the statistical physics definition: communities correspond to states of low free energy (low energy and high entropy)

## Andrew Berdahl, Santa Fe Institute, Omidyar Fellow

Intro:
I work on collective animal behaviour (schooling, flocking, herding, swarming etc.) and am particularly interested in collective navigation (animals in groups being better at finding their way than solo travelers.) For example, social travelers are hypothesized to pool the many independent directional estimates of the group, and have been shown to collectively sense and respond to complex and dynamic environmental gradients.

Project 1:

Recent research suggests that salmon might benefit from collective navigation on their iconic homeward migration (larger groups find their way home more accurately than smaller groups). One (recent) challenge for migrating salmon is to find their way past dams blocking their route upstream. A toy model I made recently predicts that salmon in larger groups should be able to find their way past these dams more quickly than salmon in small groups. The army corps of engineers monitors the movement of all salmon passing dams on the Columbia River and its tributaries. This unique data set, which has the passage times and traveling densities for many millions of salmon over tens of years, would allow us to test these predictions and others we might think of once we have looked at the data. We could also expand my simple model to an agent based schooling model, incorporating river flows and neat stuff like that.

Project 2:

Soaring migratory birds (storks, vultures, etc.) rely on thermal updrafts to power their long distance migrations. Such updrafts, occurring in clear air, are mostly invisible, so tracking the thermals is a significant challenge for these birds. Researchers have hypothesized that by traveling in groups the birds might be able to collectively sense these updrafts and that this is the main reason social migration exists in these species. A collaborator has a very neat data set with every individual in several groups of White Storks tagged with a 1Hz GPS unit and accelerometer. This data, along with agent based simulations, should allow us to uncover the mechanism behind this social migration and also predict how the migrational ability of these species might fail if their numbers are reduced below a critical threshold.

Project 3:
Most models of collective motion assume each individual is identical. There are plenty of reasons why this might not be the case, but due to technological reasons, heterogeneity in real animal groups is largely unexplored. Recent advances in automated tracking, and GPS technology, have allowed us to gather complete, unique, trajectories for each individual in a group for an arbitrary length of time. Using information theoretic measures (Mutual information etc...) I hope to infer causal relationships (who is leading whom) in real animal groups. Thankfully a collaborator has just done most of the mathematical heavy lifting. What is left for us is to be creative about what variables (position, accelerations, change in heading etc...) to feed into the mathematical machinery to give us meaningful results. To valid date the approach, we would start by analyzing agent based swarming models, into which we have programmed leaders and followers, so we can see that the method detects the agents we know to be leaders. With that ground truthing done we can turn to a number of data sets of real animal groups (fish, birds, bison, baboons...), and test whether or not there are leaders and followers, and if so what traits (sex, age, boldness...) is most strongly linked to influence.

## Tanmoy Bhattacharya, Santa Fe Institute, Resident Faculty

Simon DeDeo, Santa Fe Institute, External Faculty
Beyond the Mind-Virus: the Cultural Phylogeny of Anti-Semitism
Simon DeDeo / simon@santafe.edu
(external faculty at Santa Fe; professor of Informatics and of the Cognitive Science department at Indiana University)

In any society, much of what passes for disinterested information is anything but: what individuals perceive to be {\it prima facie} reasonable facts are often complete fabrications. Their lack of veracity is compensated for by their social function.

One of the most persistent and perilous examples of this is anti-Semitism, the hatred and fear of the Jewish people. The varying yet correlated clusters of beliefs associated with anti-Semitism have been leveraged and incited by governments and mass political movements and have been a defining feature of global culture for thousands of years.

Far from being the simple mind viruses and memes described by Richard Dawkins, anti-Semitism is rather a network of prejudices and paranoias that draw on everything from envy and fear of political elites to hygenic, moral and even sexual disgust. Varying in time and place, they force us to go beyond simple population-genetic models of cultural transmission, to look at epistatic interactions among syntactic and semantic variations.

What we'll do:

We'll try to understand the cognitively complex development of anti-Semitism by analysis of multiple editions of some of its most virulent and persistent documents, including such texts as the Protocols of the Elders of Zion and Conspiracy of the Six-Pointed Star. We'll draw our data from core texts drawn found in the Hathi Trust, a large-scale library digitization project of which Indiana University is a member.

We'll learn some core tools for the modeling of natural language texts, including "topic modeling", a sophisticated statistical tool to extract large-scale regularities in word co-occurrence. Topic modeling has been used for everything from tracking signal emergence in the French Revolution to the study of Charles Darwin's reading habits. Here, it will allow us to decompose some of the themes -- and stylistic patterns -- of the texts, and thus the key ideas in play. Topic modeling should allow us to construct a geometry of anti-semitic ideas (for students with a taste for abstract mathematics, there is a sub-project here in the differential geometry of information theory with anti-semitism as a sample space).

Given successful completion of stage one, and based on these results, we will attempt an ambitious reconstruction of the evolutionary history of anti-semitism in the 19th and 20th Centuries. We will use phylogenetic reconstruction to track the complex branching, fragmentation, and synthesis of the multiple threads that define this terrifying phenomenon, and graph clustering algorithms developed by other members of the institute such as the message passing algorithms of Zhang and Moore.

Getting clear scientific answers to even the most basic of the questions addressed here may have profound importance not only for the cognitive, social, and political sciences, but for policy makers and those working to understand and manage a global world where rapid communication and idea-sharing can lead to both great good and great ill.

An ideal student would have some experience with programming (e.g., Python) and some comfort with mathematical reasoning; we can teach on the job and ahead of time. Most importantly a prospective researcher should have a keen interest in the development of culture and how it interacts with the human mind. In a Summer of intense work on a topic of profound political importance, we can expect plenty of unexpected results.

The researcher on this project will work with Simon DeDeo, who will be in residence from June 6th through August 16th and able to communicate beforehand on potential readings and places to begin. This work is in collaboration with Cassidy Sugimoto, professor of Information and Library Sciences, also at Indiana. Prof. Ken Stanley, an expert in evolutionary algorithms, will also be involved in this project, but will only be on site for the first few weeks of the REU.

## Stephanie Forrest, Santa Fe Institute, External Professor

1. Understanding and evolving software diversity:

Neutral landscapes and mutational robustness are believed to be important enablers of evolvability in biology. We apply these concepts to software, defining mutational robustness to be the fraction of random mutations to program code that leave a program’s behavior unchanged. Test cases are used to measure program behavior and mutation operators are taken from earlier work on genetic programming. Although software is often viewed as brittle, with small changes leading to catastrophic changes in behavior, our results show surprising robustness in the face of random software mutations. Depending on the student's interest, the REU project could either involve learning to use the GenProg software and running new experiments, or it could involve developing theoretical models based on earlier work in the biological literature.

2. Understanding and modeling large-scale cybersecurity threats:

Cyber security research has traditionally focused on technical solutions to specific threats such as how to protect desktops and mobile devices against the latest malware. This approach has greatly enhanced our ability to defend against specific attacks, but technical improvements will not be sufficient on their own. Today’s cyber issues involve social, economic, organizational, and political components, which are often pursued in isolation from technical reality. Forrest and Prof. Robert Axelrod (Univ. of Michigan) are collaborating on project that aims to address that gap by focusing on the problem of reducing risks arising from cyber conflicts, especially those among state actors. Our current plan is to focus on the attribution problem, that is, once a cyberattack has been discovered, what does it take to hold a particular actor responsible? Depending on the student's interest and background, this project could entail statistical modeling, developing game-theoretic or decision-theoretic models (e.g., when is it advantageous to plant false flags), to researching the current (public) state of the art for attributing cyberattacks.

3. Forrest and Moses are collaborating on a project to examine how ant colonies and immune systems form distributed information exchange networks to search, adapt and respond to their environments. The is characterizing search strategies quantitatively in terms of how information flows over networks of communicating components, in this case, ants or immune cells. It will measure how information improves performance, measured in terms of how quickly a colony finds seeds or the immune system neutralizes pathogens. By studying in detail how distributed interaction networks guide search in two distinct systems, this project aspires to formulate a general theory describing how decentralized biological networks are organized to search, respond and adapt to different environments, and how they effectively scale up to large sizes. As the research has progressed we have begun to test our theoretical understanding of distributed search strategies by implementing them as search algorithms in robotic swarms. This demonstrates practical applications as well as providing a controlled experimental system in which to test theoretical predictions.

## Mirta Galesic, Santa Fe Institute, Professor

I study spread of beliefs in social networks. As a psychologist, I am particularly interested in how cognitive processes such as perception of other’s beliefs and social learning rules affect belief dynamics. We can work on some of the many related issues by: a) building a computational model in which agents, using diverse cognitive rules, interact and adopt each other’s beliefs and/or revise their connections, or b) investigating a more abstract version of such a model analytically (e.g. using statistical physics framework), or c) collecting empirical data (e.g. on mTurk) to either c1) better understand social learning rules people use in real life (e.g. copy majority, copy best); c2) construct closed social environments with actual people but with pre-defined connections and learning rules, then track how initially rare beliefs or behaviors spread. Any of these components can be combined; e.g. empirical data can inform computational model, analytical models can be tested in agent-based simulations, etc.

## Josh Grochow, Santa Fe Institute, Omidyar Fellow

Title: Exploring the space of algorithms, or, finding novel algorithms for hard and easy problems using genetic algorithms

Mentor(s): Josh Grochow

Despite nearly half a century of algorithm development, our understanding of algorithms is still shockingly poor. For example, how long does it take to multiply two n-digit numbers? The state of the art is nearly O(n log(n)) steps, much better than the standard way we are taught in elementary school. Yet it's consistent with current knowledge that this could be done in only O(n) steps - i.e., that multiplying two numbers is just about as easy as adding them! No one knows the right answer. Such issues are even worse for the famous P versus NP problem, which essentially asks whether brute-force search can always be improved upon. Unlike multiplying numbers, virtually no progress has been made on this problem in the half-century that it's been around.

Part of the reason for our paltry understanding of these questions is that, despite thousands of new algorithms, we've actually only explored a tiny corner of the space of all possible algorithms. This was brought home with the discovery in the early 2000's of so-called holographic algorithms - classical algorithms, inspired by quantum computing, that take advantage of surprising cancellations to efficiently solve problems that no one would have thought could be efficiently solved.

In this project we propose to explore more of the space of algorithms by several methods. We'll mention one such method here, and can discuss others in person if you're interested.

We propose to use novelty-seeking genetic algorithms to help find examples of truly new classes of algorithms. Given that genetic algorithms have long been used to search for new and better algorithms, what's special here? At least three things: (1) in the last decade or so there have been great advances in using genetic algorithms to search for novel solutions, rather than to directly search for better solutions; whether or not this finds better algorithms, in looking to further explore the space of algorithms, novelty is a key aspect; (2) the methods used in these novelty-seeking algorithms, such as indirect encoding, match up well with frontier problems in computational complexity, and (3) we propose to apply genetic algorithms to problems that we think may not have efficient solutions, such as NP-hard problems. If the genetic algorithm fails to find a good algorithm, we hope to still extract some insight from the bad algorithms it finds. Our hope is that we'll either find genuinely new algorithms for these problems, or that we'll gain theoretical insight into how hard these problems truly are.

## Laurent Hebert-Dufresne, Santa Fe Institute, Postdoctoral Fellow

The first idea, and I am writing this in part for Markus (another post-doc who should be back at SFI in less than two weeks), is really cool you are interested in coming up with your own idea. We (Markus and I) have tons of data on city structure (think street networks) and human mobility (mobile phone data with almost perfect coverage). This is a great opportunity to look into different aspects of mobility and infrastructure, but also to ask cool question: can we better define a city than standard 'city limits' in terms of attraction and influence on people living there? how different are behaviors and mobility between urban and rural residents once we control for different infrastructure? I think this project is a perfect opportunity to just brainstorm and play around with lots of data. I'll be working with Markus on this one way or the other, but we would be happy to have creative people on our team.

Second, most of my work is in network theory, mainly trying to figure out the interplay between structure of networks and the dynamics that occur on networks. I have this vague idea of dynamics as the source of structure and it is much clearer since watching The Imitation Game (ok biopic about Alan Turing). I thought it could be super interesting to look at Turing patterns on clustered networks. Clustering represents geographical proximity and our tendencies to 'hang out' with a dense group of friends. On these clusters, I would expect the Turing patterns to create a hierarchy of structure, sub-groups within the clusters and groups of clusters above. This will be really interesting theoretically (fits perfectly with the tools I usually work with), but could also be linked to organization of human institutions and if we want to get crazy could maybe be tested experimentally.

Unfortunately, this was not short and not very clear, so if you are interested you can find me in my office in Pod C, around the foosball table, or by email.

## John Pepper, Santa Fe Institute, External Professor

Project Proposal:

This project is to refine and implement a new evolutionary heuristic algorithm for a classic NP-hard spatial optimization problem called the “Traveling Salesman Problem” The implementation could be done in any object-oriented programming platform with good spatial graphics, so we could decide that part together if you’re interested. -Faculty mentor: Dr. John W Pepper, external faculty

## Sid Redner, Santa Fe Institute, Resident Faculty

In case any of you are still looking for a summer project, I have several to suggest. The main prerequisite is the ability to progam. If you are interested in any of these projects, please feel free to stop by and we can talk:

1. Stochastic search processes with multiple searchers:

Consider a target that is placed at some unknown location in space and one has to find it by launching N>1 random walkers from a different starting point. First, suppose that the walkers are all launched simultaneously. How long does it take to find the target? This question is reasonably understood in one dimension (see Mortality, Redundancy, and Diversity in Stochastic Search, B. Meerson and S. Redner, Phys. Rev. Lett. 114, 198101 (2015)); however, many open questions remain. First, what is the search time in higher dimensions? Second, suppose that there is a fixed cost per unit time for each searcher. What is the optimum number of searchers that should be launched that minimizes the cost of the search. In one dimension, this optimum number turns out to be 6. What happens in higher dimensions? Finally, suppose that the searchers are not launched simultaneously, but rather, at some fixed rate. Is there an optimum rate that minimizes the search time? More generally, what other strategies could be effective if the searchers are stupid and move only by random walk motion?

2. Dynamics of Voter models

The Voter model is an idealized and appealing simple of opinion evolution in an interactive population. In this model, each person can be in one of two states—say + and -. Each voter has zero self confidence and updates his/her opinion by the following rule: pick a random neighbor and adopt the state of this neighbor. The dynamics involves picking a voter at random and updating its state according to the rule given above. This simple model is surprisingly rich and much is known about the dynamics, both numerically and analytically. My interest is to extend the Voter model to include some socially realistic generalizations, including: (i) asymmetric interactions in which the rate at which voter A influences B is different than the rate at which B influences A, and (ii) more than two states, and (iii) evolving number of states. The latter generalization is meant to account for increasing extremism, as occurred during the student demonstrations in the 60’s and the current slate of Republican Presidential candidates.

## Sam Scarpino, Santa Fe Institute, Omidyar Fellow

The dynamics of Ebola underreporting during the 2014 West Africa epidemic
Full Description of Project

Introduction
Infectious disease surveillance data can be unreliable during unfolding crises in which resources are limited and public health authorities have poor access to affected communities. In the ongoing 2014 Ebola epidemic in West Africa, surveillance efforts primarily detect cases that are treated in healthcare facility, and likely miss the sizable fraction of cases that do not seek care in a clinical setting [1, 2]. The CDC recently used a factor of 2.5 to correct for this underreporting in Sierra Leone and LIberia, but acknowledged that this quantity is essentially unknown [1]. Accurate outbreak projections and assessments of intervention strategies depend on reliable estimates of underreporting rates. However, underreporting can be a very dynamic process, potentially varying in time, space, and/or with outbreak size, and driven by intrinsic properties of the pathogen, human behavior, diagnostic practices, and the healthcare infrastructure. Mechanistic models that capture these factors is critically important for estimating, reducing, and correctly accounting for underreporting, but they do not yet exist. The primary difficulty we have faced in estimating underreporting is the limited data commonly available during an outbreak, namely confirmed and suspected cases and mortalities. From these data alone, one cannot estimate the rate of underreporting without making strong modeling assumptions and, even with such assumptions, we often lack statistical power to make precise estimates. However, new types of data are available from the ongoing EVD outbreak, including digital data - Facebook, Twitter, cell phones, HealthMap, etc - and Ebola virus genomic data [3]. These data, when coupled with an appropriate mathematical and statistical framework, provide a novel opportunity to model underreporting. For example, Facebook surveys can poll urban populations in West Africa regarding healthcare seeking behavior and, phylodynamic models can directly estimate the rate of underreporting from pathogen sequence data.

Here, we propose to construct an integrative framework, combining a mechanistic model of underreporting factors with survey-based and phylodynamic-based inference, to estimate dynamic rates of underreporting in the ongoing EVD outbreak. The results of this study will have an immediately impact, in enabling better EVD projections and analyses of intervention policies, and more broadly advance our understanding of and statistical analysis of reporting biases in emerging epidemics.

## Pan Zhang, Santa Fe Institute, Postdoctoral Fellow

Mentors: Cris Moore and Pan Zhang
The idea of statistical significance is tricky in networks. Do these three definitions of statistically significant community agree with each other?
1. communities stay the same when a few edges are added or removed (robustness)
2. communities help predict missing edges (if we train the algorithm on the edges we know about)
3. the statistical physics definition: communities correspond to states of low free energy (low energy and high entropy)