Friday, June 8, 2007

A Guide to Similarity %

We're seeing more and more avatars on movie pages -- and that's fun: the reviews have somehow changed for us, from something like "content" into something more human, more interesting. You put a photo on your reviews and they become as much about you as they do the movie. And people now seem a lot more interested in clicking on those photos to see what else the reviewer likes and has reviewed.

But for me, the key is that Similarity %. But there is no scale, it's just a relative value. Is 50% similar good? What does it mean? So here are some comments on Sim%.

How is it calculated? Netflix uses algorithms comparable to those employed in the "Cinematch" engine which recommends movies -- but turns it around. Now computers take all the movies you've connected with -- rented is the most weighted, but also rated or even just put in your queue -- to get a signal about your taste. Then we compare those movies to the same set from each reviewer and generate a number. But its not an absolute value. Sometimes there is little direct overlap of titles, but there is overlap in "similar" titles, or more importantly, an overlap of genres. You might not have seen (or rated) the same set of movies I did, but we are interested in the same kind of movies, and this would make us similar. We do this very quickly to get a general sense of similarity.

What is a "good" match? Like I said, it's all relative -- if you and I are 60% similar, I may not know precisely what that means, but it suggests we're more similar than someone that i'm 55% similar to. The wisdom around here is that if you are 70% similar to someone, that's pretty darn similar. 80% is dead on. My very best friends -- with whom i would see ANYTHING they liked most of the time -- i'm in the high 80s with. And I'm not 90% similar to anyone I know. (Although I sometimes find reviewers who share that much taste with me). Below 50% and i tend to check carefully if i agree with their Favorite movies...

With your Friends list, we add a few more passes through the algorithm, to get an even subtler taste similarity, where we push up the emphasis on how you and I rate movies, and how common that kind of rating for a movie is (if you and I love a movie that the whole world loves, that doesn't really make us all that similar, but if you and I love a movie that everyone hates, well then, that's worth noting. So we do.)

One note: With Friends, the Sim% is asymmetric--that is, I can be more similar to you than you are to me. This is because if you have seen 10 movies and I have seen 100, including all 10 of yours, due to some intricacies in the formula, it shows a (small) difference between us--you with 10 movies will be MORE similar to me than I am to you (since I've seen so many you haven't, because there is such disproportion between our viewing histories). The presumption is that if you've only seen 10 and I've seen 100, i may have a far wider interest range than you. If you watch (or rate) 90 more, and there is still good overlap in interest, that eliminates the difference pretty much, but there is a lot of uncertainty with your smaller dataset. (We actually don't like this asymmetry very much, and are exploring that part of the equation even as we speak.) I know I was disappointed to learn that my very best (most similar) friend--who was 89% similar to me--didn't hold me in a comparable position, and I was only 80% similar to him. That was a bit of a let down. (I'm rating more movies and the difference is shrinking.)

Like the recommendation engine at Netflix, we continually improve these mathematical formulas (see the Netflix Prize). The only (somewhat cryptic) thing i'd add is that we're only scratching the surface for how many cool things we can do once we have calculated this Sim%, and you will be seeing more use of the tool throughout the year. Here's my question of the week: besides being able to find and save other people who are very similar to you, and sorting reviews based on (among other things) how similar the reviewer is to you, what ways can you imagine applying the Sim%?

Do you find it useful? Interesting?