Breaking Bad vs. Superman: Applying Data Science to Our Passion Projects
Metis Data Science Bootcamp (Flyer)
- 96% of data is from the last 2 years.
- It's projected that by 2017, there will be 150,000 data science jobs with no one qualified to fill them.
Irmak's Data Science Passion Project
- Is my reaction to a movie predictable?
- Star Wars as an example. Discusses how his own rating to the movie actually changed over time (like age 9 vs. 13 vs. 30).
- Suggesting that movie reactions are difficult to predict.
- 2006 - formulates Cinematch algorithm, which utilizes the > 1 billion ratings on Netflix at that time
- Do you have a "soulmate" in taste?
- Perfect soulmate - rates everything exactly the same as you, but he's seen Book of Eli and you haven't. Since he liked it, you'll like it.
- Calculated soulmate - draws on a bunch of people who kind of match your tastes, formulates a "soulmate", uses that as a basis. So how closely tastes have aligned in the past. Weighted average.
- These were the ideas behind Cinematch. Algorithm does really badly with movies that have been rarely watched.
- Also, movies that elicit love/hate reactions are biggest source of error. Popular but weird movies. Think Napoleon Dynamite.
- If you were to take mean scores of everything and use those as a basis for predicted ratings, maximum error in accuracy was determined to be +/- 1.05040 stars. (Netflix star system)
- Maximum error in accuracy on the Cinematch algorithm ("soulmates") was determined to be 0.9525 stars.
- So Cinematch was a 9.6% increase in accuracy over the trivial mean score method. And I think the stat about how much this increased viewership on the "top" movies was 1200%.
- Also in 2006 - Netflix announces $1M bounty for a further 10% improvement.
- 2009 - the Netflix bounty is awarded, team managed to achieve a 10.09% accuracy improvement.
Looking at Movie Data Differently
- Before: solid assumptions, you have a certain taste, your taste dictates ratings for unwatched content, after you watch it this will be clear. This is largely wrong.
- Your taste changes over time (even day to day), also affected by how many ratings you've given that day, and average rating for the day.
- Taking these things into account, your time-dependent rating tendencies, makes for a more accurate algorithm than Cinematch without even considering movie content. I thought this was significant - without even considering the movies, but instead looking at your rating behavior surrounding those movies, made for a more accurate algorithm.
- We cannot explicitly compare a movie with all others we've seen.
- Environmental factors play a huge role.
- Some people are followers, some people behave like "hipsters", once you kinda figure a person out you can make accurate predictions.
- Take "Music Lab", an experimental website for downloading music. When other people's ratings are invisible, you get more or less equal ratings. When other people's ratings (or the illusion of other people's ratings) are shown and quantified, things get interesting.
- Social influence plays a huge role in what will be a hit, what will be a miss.
Diving in Further
- Degree of liking is difficult to predict consistently and accurately with a number.
- The difficulty in answering "What are your top 20 movies?" (if you really sit down and think about it) illustrates how degree of liking is sensitive and vague.
- "Enjoyment" from a movie is a very high-dimensional concept. There are movies that yield completely different flavors of reaction from you, and how is that supposed to be broken down into 4.3, etc.
- For the most part, it's straightforward to compare just 2 movies. Fully analyze all the comparisons to see where things ultimately stand.
- If Star Wars > Indiana Jones and Indiana Jones > Troll 2, Star Wars > Troll 2 can be inferred. Simple statistics.
- Elo rating system (The Social Network's "Facemash")
- Bayesian ranking algorithms (Microsoft utilized it for Halo matchmaking)
- Elo and Bayesian were originally applied to chess.
- Asking about movies you're really uncertain about is better than asking what you probably know - produces more valuable data.
- Quantifying human reactions is hard.
- Many comparisons for a movie will average out environmental factors.
- Don't necessarily want to average out social influence, that's part of nature.
- Most important part of data science is design - figuring out the right questions to ask.
- SQL + Python for this movie project
Hoping this isn't too hard to follow.
The Uncubed people are launching a service that seems like it will be similar to Treehouse - a pay-monthly, no-contract, learn-startup-topics-by-video-heavy-modules service. It's called Uncubed Edge. While attendees were able to sign up already for locked-in $5/month, when the public launch happens it's still only $20/month. If the content there is anything like today's event, it's worth every penny. At $5 signing up was a no-brainer for me - may write a full review in the future. No modules are up currently. Update: An Uncubed Edge review