More

DalasNoin · 2026-02-26T16:33:09 1772123589

Why does SynthID make it worthless? it helps other platforms detect this as ai?

zardo · 2026-02-26T16:38:24 1772123904

If the value is in deception.

DalasNoin · 2026-02-25T18:33:56 1772044436

We use semantic information inferred from comments and submissions. I think using stylometry would be a great addition, but it would be hard to google for "guy who writes fanciful using many puns" rather then "indie developer in Switzerland". I think stylometry could be better used for verification, once you have a small set of candidates stylometry could further narrow down the candidates and be used to make a decision.

DalasNoin · 2026-02-25T18:13:30 1772043210

We test different methods, in section 2, we use LLM agents to agentically identify people. We don't share any code here, but you could try with various freely available agents on yourself.

DalasNoin · 2026-02-25T18:11:52 1772043112

We essentially don't use stylometry but semantic information revealed from peoples' comments – clues and interests.

(We use a little stylometry in a single experiment in section 5)

DalasNoin · 2026-02-25T18:05:23 1772042723

We essentially don't use stylometry but semantic information – clues and interests.

DalasNoin · 2026-02-25T17:58:17 1772042297

That's a great background paper on the Netflix attack, we make a pretty direct comparison in section 5. We also try to use similar methods for comparison in sections 4 and 6. In section 5 we transform peoples Reddit comments into movie reviews with an LLM and then see if LLMs are better than naraynan purely on movie reviews. LLMs are still much better (getting about 8% but the average person only had 2.5 movies and 48% only shared one movie, so very difficult to match)

john_strinlai · 2026-02-25T18:02:01 1772042521

>we make a pretty direct comparison in section 5

awesome, i saw the mention in the introduction but i havent yet had a chance for a thorough read through of the paper -- ive just skimmed it. looking forward to reading it in-depth!

DalasNoin · 2026-02-25T17:45:58 1772041558

We do advocate for stricter controls on data access on social platforms because of this. There is a bit of an unfortunate trade-off, but I think allowing mass-scraping or downloads of data from social sites can be misused in increasingly more ways.

DalasNoin · 2026-02-25T17:30:43 1772040643

There is also a practical issue here that people usually don't write a lot on linkedin, most people just have structured biographical information. We use very limited stylometry in section 6 for matching reddit users who we synthetically split according to time.

DalasNoin · 2026-02-25T17:18:53 1772039933

We don't use (much) stylometry, so this won't help. This is totally something you could try, but we use interests and clues. Semantic information you reveal about yourself.

The blog post might be more approachable if you want to get a quick take: https://simonlermen.substack.com/p/large-scale-online-deanon...

mhitza · 2026-02-25T17:28:51 1772040531

Thanks for the providing the details, where I've been just lazy about reading the paper now :))

I'm not a fan of your proposed changes, as they further lock down platforms.

I'd like to see better tools for users to engage with. Maybe if someone is in their Firefox anonymous (or private tab) profile they should be warned when writing about locations, jobs, politics, etc. Even there a small local LLM model would be useful, not foolproof, but an extra layet of checks. Paired with protection about stylometry :D

DalasNoin · 2026-02-25T17:33:40 1772040820

Mitigations are pretty difficult, I understand it is kind of cool that some websites have really open APIs where you can just read everything. There are some cool apps that used HN data in the past. But I think there should at least be consideration that LLMs are then going to read everything and potentially discover things. Users might have thought this is protected by obscurity, who would read their 5 year old comments?

palmotea · 2026-02-25T18:09:21 1772042961

How helpful would injecting noise and red herring into pseudonymous posts help?

It seems like it would make sense to get in the habit of distort your posts a bit, and do things like make random gender swaps (e.g. s/my husband/my wife), dropping hints that indicate the wrong city (s/I met my friend at Blue Bottle coffee/I met my friend at Coffee Bean), maybe even using an LLM fire off posts indicating false interests (e.g. some total crypto bro thing).

GorbachevyChase · 2026-02-25T18:45:28 1772045128

This is probably a good use case for something like OpenClaw. Have it take over your accounts and inject a bunch of non-offensive noise using a variety of personas to pollute their analysis. Meanwhile, you take your real thoughts and opinions underground.

DalasNoin · 2026-02-25T17:17:21 1772039841

To be clear, we are making a clear concession here that the people weren't truly anonymous. But we did use an LLM to remove any identifying information from HN making them quasi-anonymous, this is more described in the appendix Table 2.

We do also make a more real world like test in section 2. There we use the anthropic interviewer dataset which Anthropic redacted, from the redacted interviews our agent identified 9/125 people based on clues.

The blog post might be more approachable for a quick take: https://simonlermen.substack.com/p/large-scale-online-deanon...

dang · 2026-02-25T17:43:55 1772041435

Thanks for that link! I'll put in the top text.

Edit: actually I've re-upped your submission of that link and moved the links to the paper to the toptext instead. Hopefully this will ground the discussion more in the actual study.

ranger_danger · 2026-02-25T17:36:52 1772041012

But you also relied on people giving away too much personal information about themselves... which won't always be the case.

majorchord · 2026-02-25T17:40:03 1772041203

Yeah my first thought was "of course an LLM can do that, we didn't need a paper to tell us". I would be more impressed if it could do it without that information, such as by analyzing writing styles and other cues that aren't direct PII.

intended · 2026-02-25T18:06:29 1772042789

It’s the same thing as theft and locks. Any motivated attacker will overcome any rudimentary obstacle. We still use locks because most opportunistic attackers are the most prevalent.

Even the paper on improved phishing showed that LLMs reduce the cost to run phishing attacks, which made previously unprofitable targets (lower income groups), profitable.

The most common deterrent is inconvenience, not impossibility.

DalasNoin · 2026-02-25T17:53:41 1772042021

I agree that these accounts probably on average still contain more information than the average pseudonymous account. I think we could try to use the LLM to increasingly ablate more information and see how it performance decays – to be clear we already heavily remove such information, see Table 2 appendix. But I don't expect that to change the basic conclusions.

ranger_danger · 2026-02-25T19:35:08 1772048108

I also wonder how well the LLM would do with less direction e.g. just ask it to analyze someone's posts and "figure out what city they live in based on everything you know about how to identify someone from online posts".

famouswaffles · 2026-02-25T17:43:35 1772041415

Over a large enough timeframe (often a couple years at most), almost everyone online gives too much information about themselves. A seemingly innocuous statement can pin you to an exact city and so on.

ranger_danger · 2026-02-25T19:32:31 1772047951

I would be quite impressed if someone could figure out what city I live in from my 4.5 year old account, but I highly doubt it.