We use semantic information inferred from comments and submissions. I think using stylometry would be a great addition, but it would be hard to google for "guy who writes fanciful using many puns" rather then "indie developer in Switzerland". I think stylometry could be better used for verification, once you have a small set of candidates stylometry could further narrow down the candidates and be used to make a decision.
We test different methods, in section 2, we use LLM agents to agentically identify people. We don't share any code here, but you could try with various freely available agents on yourself.
That's a great background paper on the Netflix attack, we make a pretty direct comparison in section 5. We also try to use similar methods for comparison in sections 4 and 6. In section 5 we transform peoples Reddit comments into movie reviews with an LLM and then see if LLMs are better than naraynan purely on movie reviews. LLMs are still much better (getting about 8% but the average person only had 2.5 movies and 48% only shared one movie, so very difficult to match)
awesome, i saw the mention in the introduction but i havent yet had a chance for a thorough read through of the paper -- ive just skimmed it. looking forward to reading it in-depth!
We do advocate for stricter controls on data access on social platforms because of this. There is a bit of an unfortunate trade-off, but I think allowing mass-scraping or downloads of data from social sites can be misused in increasingly more ways.
There is also a practical issue here that people usually don't write a lot on linkedin, most people just have structured biographical information. We use very limited stylometry in section 6 for matching reddit users who we synthetically split according to time.
We don't use (much) stylometry, so this won't help. This is totally something you could try, but we use interests and clues. Semantic information you reveal about yourself.
Thanks for the providing the details, where I've been just lazy about reading the paper now :))
I'm not a fan of your proposed changes, as they further lock down platforms.
I'd like to see better tools for users to engage with. Maybe if someone is in their Firefox anonymous (or private tab) profile they should be warned when writing about locations, jobs, politics, etc. Even there a small local LLM model would be useful, not foolproof, but an extra layet of checks. Paired with protection about stylometry :D
Mitigations are pretty difficult, I understand it is kind of cool that some websites have really open APIs where you can just read everything. There are some cool apps that used HN data in the past. But I think there should at least be consideration that LLMs are then going to read everything and potentially discover things. Users might have thought this is protected by obscurity, who would read their 5 year old comments?
How helpful would injecting noise and red herring into pseudonymous posts help?
It seems like it would make sense to get in the habit of distort your posts a bit, and do things like make random gender swaps (e.g. s/my husband/my wife), dropping hints that indicate the wrong city (s/I met my friend at Blue Bottle coffee/I met my friend at Coffee Bean), maybe even using an LLM fire off posts indicating false interests (e.g. some total crypto bro thing).
This is probably a good use case for something like OpenClaw. Have it take over your accounts and inject a bunch of non-offensive noise using a variety of personas to pollute their analysis. Meanwhile, you take your real thoughts and opinions underground.
To be clear, we are making a clear concession here that the people weren't truly anonymous. But we did use an LLM to remove any identifying information from HN making them quasi-anonymous, this is more described in the appendix Table 2.
We do also make a more real world like test in section 2. There we use the anthropic interviewer dataset which Anthropic redacted, from the redacted interviews our agent identified 9/125 people based on clues.
Edit: actually I've re-upped your submission of that link and moved the links to the paper to the toptext instead. Hopefully this will ground the discussion more in the actual study.
Yeah my first thought was "of course an LLM can do that, we didn't need a paper to tell us". I would be more impressed if it could do it without that information, such as by analyzing writing styles and other cues that aren't direct PII.
It’s the same thing as theft and locks. Any motivated attacker will overcome any rudimentary obstacle. We still use locks because most opportunistic attackers are the most prevalent.
Even the paper on improved phishing showed that LLMs reduce the cost to run phishing attacks, which made previously unprofitable targets (lower income groups), profitable.
The most common deterrent is inconvenience, not impossibility.
I agree that these accounts probably on average still contain more information than the average pseudonymous account. I think we could try to use the LLM to increasingly ablate more information and see how it performance decays – to be clear we already heavily remove such information, see Table 2 appendix. But I don't expect that to change the basic conclusions.
I also wonder how well the LLM would do with less direction e.g. just ask it to analyze someone's posts and "figure out what city they live in based on everything you know about how to identify someone from online posts".
Over a large enough timeframe (often a couple years at most), almost everyone online gives too much information about themselves. A seemingly innocuous statement can pin you to an exact city and so on.