Summary: a woman wrote something like this...http://bit.ly/8BLUAw (describing her sexual orientation in an "anonymous" (imdb) forum), and then, based on her other reviews and data in the Netflix contest, a couple of hackers (security/privacy researchers) put 2 + 2 together to find out her identity (not their intended purpose, but as a side effect.) She blames Netflix.
I'm actually unclear as to who's right and wrong here. Clearly, it seems unjust that she unknowingly outed herself, but how responsible is she of her online personas? I also wouldn't be surprised if Netflix has something in their ToS relating to this kind of "anonymous" release of information.
I think the responsibility does rest with Netflix, because it's almost outside the control of one person to keep identities separate these days. Every bit of information released vastly increases the chance of identification.
To given an example - let's say Google as part of a new anti-fraud service released an MD5 hash of every e-mail address with a Google account, plus a MD5 hash of every IP that account had been successfully logged in from.
Sounds fairly anonymous - except that every website owner would now be able to match up every duplicate-but-separate account in their database to find out who had two or more identities, even if the user had been careful to use two separate machines for that specific site.
I wonder how many mothers of multiple children with Netflix subscriptions and IMDB accounts from Franklin County, Ohio, will be traveling to the location of the court in which the suit is being made on the date of the trial and have also visited the offices of the lawyer representing "Jane Doe". It seems as though perhaps the lawsuit may be insufficiently anonymizing her personal data.
I don't know that it was necessarily a good or a legal decision for Netflix to release the contest data, but it doesn't help the plaintiffs' case when they quote where the privacy policy of Netflix specifically states that they may disclose the information disclosed in the contest and immediately claim that the policy states no such thing.
Thanks Scott. I just want to say that I had nothing to do with this lawsuit. Also, this comment I posted earlier might be relevant: http://news.ycombinator.com/item?id=838226
It doesn't sound like this was a technical failing of netflix' anonymization process but rather a matter of deduction of various independently anonymous pieces of data.
The thing is that perfect anonymization implies that the dataset would be useless since by definition it would contain no information. If you can begin correlating data points with enough outside information you will be able to extract at least a shadow of the original information.
[The researchers] identified several NetFlix users by comparing their “anonymous” reviews in the Netflix data to ones posted on the Internet Movie Database website
So why sue netflix instead of IMDB? Additionally, is there an expectation of privacy when posting movie reviews to public websites?
I believe the suit is targeted directly at Netflix because it allegedly violated stringent privacy laws related to video rental data.
Whether Netflix violated the persons privacy or not is debatable (hence it hasn't been settled yet), however they certainly don't appear to have the intent to keep peoples privacy:
> The suit is also asking the court to stop Netflix from launching its promised second contest to improve the recommendations — this time giving out user data that includes ZIP codes, ages and gender, along with movie ratings and ID numbers substituted for user names.
I'm not certain how ZIP codes work in the US, I know however that my postal code for my childhood home in the UK could place me within a 30 house range on my street. Given age you could extract this down to ~7 people, given sex it was down to 3. Being 1 of these 3 people means I have a 50/50 chance of identifying two 'anonymous' people based solely on postal code, age and gender.
Lawsuits usually come down to intent, and Neflix arguably doesn't have the intent to keep its users privacy if it's intending on releasing ZIP, Age and Gender information.
It depends on the type of Zip codes. Basic Zip codes are only 5 digits so ~30,000+ people per Zip code. However, Zip + 4 codes dramatically reduce that.
Considering ~3% of Americans are NetFlix subscribers that means only ~900 people per Basic Zip might be subscribers. Gender will half this to ~450 people, and considering that (IIRC) each 5-year population group has on average ~5.8% of the population in it with a median age of ~38 (where the percentages are hitting 7.8%), but let's say an average 1.2% of the population is of any individual year of age.
This means, on average, you should still be able to place someone down to ~5 people. God forbid you're a 110 year old using netflix. If netflix is releasing it in 5-year groupings that still puts you in a group of ~30 people for grouping of ~30,000.
I'm unsure if any data release like this counts as anonymous.
I don't think Netflix is going to release their customer list the first reduction is not directly possible. Also splitting the population into equivalent size groupings is a normal approach, so you might start with 18-21,21-26... and end with 85+.
Also, it's normal to add / remove ~1% of your sample to remove some edge cases and muddle the waters.
It's a little off topic, but I'm intrigued by the '87% of Americans can be uniquely identified by DOB, gender and zip code'. Given the size of the US's population and the relative scarcity of zip codes, this seems an incredible claim. The link in the article is broken so I can't read the paper. Even just thinking about the big cities, where I imagine virtually no one would be identifiable, that figure sounds impossibly high. Maybe the figure actually refers to just working adults or something like that. Does anyone have any more information about this, or access to the original paper?
Actually, I was thinking about DOB as being just the day and month, not day, month and year. If the year's included then this seems less far fetched, although I'd still be surprised if it's correct.
~300 million Americans spread among ~100,000 zip codes yields an average of 3,000 people per zip code. So zip code alone narrows a person down to a set of around 3,000. Gender then to 1,500.
That's only 4.2 years worth of unique birthdays! Unless the zip code is dominated by a certain age group (as around a college campus), there should be a lot of unique birthdays, since those will be drawn from a range of many decades.
Do densely-packed cities change things much? A bit, but they probably also have helpful age-diversity. San Francisco has 800,000 people and 25 zip codes -- 32,000 per zip code; 16,000 per gender; 16,000 unique birthdays would cover 45.2 years.
So there are sure to be more collisions, but still many unique zip/sex/DOB combinations. And for all the zip codes with < 3,000 residents, the people uniquely addressable could be 97, 98, 99%.
If we take zip+4, it's all over. My 12-unit apartment building has 2 zip+4 codes all by itself, so zip+4 identifies a group of < 12 people.
“a privacy blunder that could cost millions of dollars in fines and civil damages.”
Since they considered the knowledge gained from the original contest cheap at $1 million, I'm sure the bigwigs at Netflix are wondering, "How many millions?"
They only considered the contest cheap because doing the equivalent within the company would be expensive, take forever and possibly not produce anything usable. The contest, by paying out only under certain conditions, guaranteed a usable result and thousands of people working on it. So it was much cheaper than the alternative in more ways than just financially. It was a brilliant idea IMO. Releasing the data without passing on liability of some kind, maybe not so brilliant.
I think this demonstrates why people should be more careful about what they post in public forums, under their own names. The ability to make associations like this is only going to become easier.
I actually saw this lady at a bar once kissing another woman. But I had no idea she was a lesbian until I wrote a deanonymizer on a dataset of millions of rows and then combed through her IMDB posts to find one very suggestive comment.
Give me a break.
> The lead attorney on the new suit recently reached a multimillion-dollar settlement with Facebook over its failed Beacon program
It really annoys me when attorneys try to make millions when honest people trying to improve the world make a mistake.
So Beacon was a bad idea. Netflix should have asked for permission before releasing a user's anonymized data. But I think they learned their lessons.
Why should some random attorney who builds nothing get paid millions and obstruct these companies from continually trying to innovate?
> I actually saw this lady at a bar once kissing another woman. But I had no idea she was a lesbian until I wrote a deanonymizer on a dataset of millions of rows and then combed through her IMDB posts to find one very suggestive comment.
The issue is really whether or not 'everyone' will know that she is a lesbian (really bisexual if read her IMDB comment which someone posted above). I highly doubt that someone wanting to conceal her identity as a lesbian/bisexual woman would go to a 'straight bar' and start making out with a woman. So the presumption here would be that you saw her at a gay bar kissing a woman. If so, not too many 'anti-gay' people go to gay bars, so I would think that she is relatively safe from discrimination in such a scenario.
> Netflix should have asked for permission before releasing a user's anonymized data. But I think they learned their lessons.
Obviously not because Netflix's second contest will release even more information on users (like zip/postal code, age, etc). Does that sound like someone that has 'learned their lesson?'
While the first contest could be put down to good faith, the second one definitely shows them at least attempting to push the boundaries.
That is unclear since she has not been identified in the lawsuit. But several people were successfully identified, and she might have been one of them.
Whether or not she was outed then, I'm willing to bet money that at some point during the lawsuit she will be tracked down and outed in a rather public fashion. Filing lawsuits that upset tech people is not a good way to protect your privacy...
This is my assumption as well, I mean this lady really does not want her privacy concealed if she is taking netflix to court she really wants money to compensate for what little damage was done. To be honest people are much more accepting of gay people and she would be able to stay relatively closeted (depending on the size of her home town) compared to what will happen when this becomes a public spectacle.
I'm actually unclear as to who's right and wrong here. Clearly, it seems unjust that she unknowingly outed herself, but how responsible is she of her online personas? I also wouldn't be surprised if Netflix has something in their ToS relating to this kind of "anonymous" release of information.
This is a complicated scenario.