Movie Project

From Santa Fe Institute Events Wiki

Predicting Metadata from Network Structure


This is a 'meta' task... Essentially the idea is to use machine learning or any kind of other techniques to predict things like success, genre etc. of a movie.

First Few Tasks

  • Script to download all movie galaxies (MS -- done; see post by Andrew in slack)
  • Conversion from gephi to useful format (MS -- done; note that there is a broken file and two movies with 1 and 0 nodes!)
  • Network comparison (MS -- running atm; note that you need to get orca for graphlet counting)
  • Get DigitalSmiths data in usable format (WIll -- almost done; tons of good metadata like Rotten Tomatoes scores etc.)


  • Michael Schaub
  • Andrew Meller
  • Xiao (Thomas) Zhang
  • Lu Liu
  • Harrison Smith
  • Will Hamilton

Network Construction and Time Dynamics


The main goal here will be to look at the time dynamics of the movie character networks, with a particular focus on how characters are introduced to the network. We can use this analysis to see how stories develop through the network construction. This can be compared between movies to see how similar network construction and dynamics are across movies.


  • Moriah Echlin (
  • Dan Biro (
  • Will Hamilton (

Trope network


There is another dataset from TV Tropes ( that I would be happy to bring into this project. Tropes are story telling elements (if you go to and read a few entries, you will quickly get a sense of them). The dataset contains ~3,500 movies and a list of tropes for each, as well as the movie's year, IMDB rating, and box office.

I am interested in studying story archetypes (typical plots). From a network perspective, it may be possible to build a directed network of "narrative" tropes (identified in , but may need more inspection), where the edge directions represent time orders. The time sequence of tropes is not represented in the TV Tropes data, therefore I'm thinking if any of Will's datasets may shed some lights on it. If the network construction is successful, extracting the backbones of the network will show us what are the most commonly used story arcs in movies, etc.

This is only a half-baked idea, and I would love to hear any ideas/comments. If anyone is interested, please let me(Elise) know.


Yizhi (Elise) Jing (

Natural Language Processing of Dialogues



This subproject works with the dataset of Cornell Movie-Dialogs ( Already clean.


The aim is explore the semantic information contained in dialogues (dynamic and static), and ideally to be complementary to other subprojects (on the film overlap) by bringing new features for datamining.


Put your ideas here

  • (Juste) Use sentiment analysis to establish temporal profiles of sentiment evolution in movies. Try to find typical profiles by time-serie clustering e.g. ; check if they correspond to movie classification.
  • (Lu) Study difference among male/female characters by sentiment analysis, and how gender difference evolves over time and genres.


  • Marius Somveille (
  • Lu Liu
  • Juste Raimbault
  • Will Hamilton (
  • Harrison Smith