Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

A friend of mine co-runs a semi-popular semi-niche news site (for now more than a decade), and complains that recently traffic rose with bots masquerading as humans.

How would they know? Well, because Google, in its omniscience, started to downrank them for faking views with bots (which they do not do): it shows bot percentage in traffic stats, and it skyrocketed relative to non-bot traffic (which is now less than 50%) as they started to fall from the front page (feeding the vicious circle). Presumably, Google does not know or care it is a bot when it serves ads, but correlates it later with the metrics it has from other sites that use GA or ads.

Or, perhaps, Google spots the same anomalies that my friend (an old school sysadmin who pays attention to logs) did, such as the increase of traffic along with never seen before popularity among iPhone users (who are so tech savvy that they apparently do not require CSS), or users from Dallas who famously love their QQBrowser. I’m not going to list all telltale signs as the crowd here is too hype on LLMs (which is our going theory so far, it is very timely), but my friend hopes Google learns them quickly.

These newcomers usually fake UA, use inconspicuous Western IPs (requests from Baidu/Tencent data center ranges do sign themselves as bots in UA), ignore robots.txt and load many pages very quickly.

I would assume bot traffic increase would apply to feeds, since they are of as much use for LLM training purposes.

My friend does not actually engage in stringent filtering like Rachel does, but I wonder how soon it becomes actually infeasible to operate a website with actual original content (which my friend co-writes) without either that or resorting to Cloudflare or the like for protection because of the domination of these creepy-crawlies.

Edit: Google already downranked them, not threatened to downrank. Also, traffic rose but did not skyrocket, but relative amount of bot traffic skyrocketed. (Presumably without downranking the traffic would actually skyrocket.)



Are you saying that Google down-ranked them in search engine rankings for user behaviour in AdWords? Isn't that an abuse of monopoly? It still surprises me a little bit.


Who's going to call them on it if it is?


Yeah, but then who is going to stop them acting monopolistic?

New administration is going to be monopoly friendly.

I was honestly pleased that Gaetz was nominated for AG solely because he's big on antitrust. Or has been.


Any sentiment expressed by the party which has dedicated itself to unrestricted corporate rights in this direction is an insincere attempt to pander to a current culture war front they are fighting that week; In this case, likely something along the lines of 'Twitter censored Trump's hydroxychloroquine post - we MUST PUNISH THEM AND REIGN IN BIG TECH [for not contributing to the fascist project]'.

EDIT: Direct quote - "The internet's hall monitors out in Silicon Valley, they think they can suppress us, discourage us. Maybe if you're just a little less patriotic. Maybe if you just conform to their way of thinking a little more, then you'll be allowed to participate in the digital world,"

This isn't an attempt to ensure freedom from monopoly, this is an attempt to enforce partisan control of the message, weaponizing the idea of free speech using force.

I can assert that the 'common public square' idea central to freedom of speech is disappearing, and that this is a bad thing, but that's not what this man has been arguing or why this man has chosen this issue.


if you believe their words (and I can't blame anyone who doesn't) apparently they want to lighten regulations on everything except big tech. So there may be a chance all those Google/Amazon cases will keep going on into the Trump administration.


To be clear this isn't because they have a problem with monopoly businesses abusing consumers. It's because big tech exercised their First Amendment rights in ways he found undesirable.

https://www.bbc.com/news/world-us-canada-57754435

Note that he's still talking about breaking up tech companies but not... X? (Surely that will resume once he and Elon have a falling out)


It's not that hard to dominate bots. I do it for fun, I do it for profit. Block datacenters. Run bot motels. Poison them. Lie to them. Make them have really really bad luck. Change the cost equation so that it costs them more than it costs you.

You're thinking of it wrong, the seeds of the thinking error are here: "I wonder how soon it becomes actually infeasible to operate a website with actual original content".

Bots want original content, no? So what's the problem with giving it to them? But that's the issue, isn't it? Clearly, contextually, what you should be saying is "I wonder how soon it becomes actually infeasible to operate a website for actual organic users" or something like that. But phrased that way, I'm not sure a CDN helps (I'm not sure they don't suffer false positives which interfere with organic traffic when they intermediate, more security theater because hangings and executions look good, look at the numbers of enemy dead).

Take measures that any damn fool (or at least your desired audience) can recognize.

Reading for comprehension, I think Rachel understands this.


what is a bot motel and how do you run one?


Easy way is to implement e.g. a 4xx handler which serves content with links which generate further 4xx errors and rewrite the status code to something like 200 when sent to the requester. Load the garbage pages up with... garbage.


Since this is getting upvoted, I will put forth a suggestion I've made to the people who've paid me to help with this sort of subterfuge: turn your 404 handler into search. Then a human who goes there has a way out. But absolutely, load it up with garbage and broken links.


Thanks, and you can make money with this? Sorry I'm a total noob in this area.


Not really... You cost the bots money.

Many are trying to index the web for whatever reason. By feeding them a Library of Babel, you can clog up their storage with noise.


Once in a while people pay you to do something you enjoy doing, like making people cry and wish they had a jobs flipping burgers instead. But I do it on my own systems for fun, honestly.


The idea is that bots are inflexible to deviations from accepted norms and can't actually "see" rendered browser content. So if your generic 404, 403 error pages return a 200 status instead, with invisible links to other non accessible pages. The bots will follow the links but real users will not, trapping them in a kind of isolated labyrinth of recursive links (the urls should be slightly different though). It's basically how a lobster trap works if you want a visual metaphor.

The important part here is to do this chaotically. The worst sites to scrape are buggy ones. You are, in essence, deliberately following bad practices in a way real users wouldn't notice but would still influence bots.


QQBrowser users from Dallas are more likely to be Chinese using a VPN than bots, I would guess.


That much is clear, yeah. The VPN they use may not be a service advertised to public and featured in lists, however.

Some of the new traffic did come directly from Tencent data center IP ranges and reportedly those bots signed themselves in UA. I can’t say whether they respect robots.txt because I am told their ranges were banned along with robots.txt tightening. However, US IP bots that remain unblocked and fake UA naturally ignore robot rules.


> The VPN they use may not be a service advertised to public and featured in lists, however.

Well, of course not, since the service is illegal.


I'm seeing some address ranges in the US clearly serving what must be VPN traffic from Asia, and I'm also seeing an uptick in TOR traffic looking for feeds as well as WP infra.


At my company we have seen a massive increase in bot traffic since LLMs have become mainstream. Blocking known OpenAI and Anthropic crawlers has decreased traffic somewhat so I agree with your theory.


I don’t think it’s a bot thing. Traffic is down for everyone and especially smaller independent websites. This year has been really rough for some websites.


I think it's also because a lot of sites have started paywalling. So users walk away.


I too found an extremely unlikely % of iphone users when checking access logs.


> who are so tech savvy that they apparently do not require CSS

Lmao!


Heres Crime^H^H^H^H^(ahem)Cloudflare requesting assets from one of my servers. I don't use Cloudflare, they have no business doing this.

  104.28.42.8 - - [21/Dec/2024:13:58:35 -0800] consulting.m3047.net "GET /apple-touch-icon-precomposed.png HTTP/1.1" 404 980 "-" "NetworkingExtension/8620.1.16.10.11 Network/4277.60.255 iOS/18.2"
  104.28.42.8 - - [21/Dec/2024:13:58:35 -0800] consulting.m3047.net "GET /favicon.ico HTTP/1.1" 200 302 "-" "NetworkingExtension/8620.1.16.10.11 Network/4277.60.255 iOS/18.2"
  104.28.42.8 - - [21/Dec/2024:13:58:35 -0800] consulting.m3047.net "GET /dubai-letters/balkanized-internet.html HTTP/1.1" 200 16370 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_1) AppleWebKit/601.2.4 (KHTML, like Gecko) Version/9.0.1 Safari/601.2.4 facebookexternalhit/1.1 Facebot Twitterbot/1.0"
  104.28.42.8 - - [21/Dec/2024:13:58:35 -0800] consulting.m3047.net "GET /apple-touch-icon.png HTTP/1.1" 404 980 "-" "NetworkingExtension/8620.1.16.10.11 Network/4277.60.255 iOS/18.2"

  # dig -x 104.28.42.8

  ; <<>> DiG 9.12.3-P1 <<>> -x 104.28.42.8
  ;; global options: +cmd
  ;; Got answer:
  ;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 35228
  ;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 1

  ;; OPT PSEUDOSECTION:
  ; EDNS: version: 0, flags:; udp: 1280
  ; COOKIE: 6b82e88bcaf538fc7ab9d44467685e82becd47ff4492b1be (good)
  ;; QUESTION SECTION:
  ;8.42.28.104.in-addr.arpa.      IN      PTR

  ;; AUTHORITY SECTION:
  28.104.in-addr.arpa.    3600    IN      SOA     cruz.ns.cloudflare.com. dns.cloudflare.com. 2288625504 10000 2400 604800 3600

  ;; Query time: 212 msec
  ;; SERVER: 127.0.0.1#53(127.0.0.1)
  ;; WHEN: Sun Dec 22 10:46:26 PST 2024
  ;; MSG SIZE  rcvd: 176
Further osint left as an exercise for the reader.


104.28.42.0/25 Is one of the ip ranges used by Apples Private Relay (via Cloudflare)

https://github.com/hroost/icloud-private-relay-iplist/blob/m...

(There is also a list of ranges on apples site, but I forget where…)

Edit: found it https://mask-api.icloud.com/egress-ip-ranges.csv


What is the issue with this request?


> What is the issue with this request?

I didn't realize this was an Apple thing, but that's fine. It changes the color of the horse and the name of the river, but the same road leads to the same destination.

1) There is a notion that Cloudflare is a content distribution network. The risk profile for a content distribution network is different from a VPN service. Now I know it's a VPN service (or is it?). Changes it from "seems weird and inappropriate" to "do I care about people relying on this? no, probably not". Cloudflare can't be arsed to provide reverse DNS for something which is clearly not part of their CDN, or is it?

1.5) Is it layer 2 or application? Cloudflare runs a CDN. Correct me if I'm wrong, but the CDN is a reverse proxy is it not? Is Cloudflare caching my website's content? Can they observe it? (It's surprisingly hard to find a solid explanation, but they talk about "proxies" and "decrypts the name of the website you requested" and none of that adds clarity, it makes it sound more like believe what we want you want to believe.)

2) I don't block incoming SYNs from Cloudflare (yet) the way I do with Amazon, and this traffic per se isn't going to trip any mitigations here. But not all of the traffic is as benign (and it's impressive that they're so technically savvy they don't need the CSS as noted elsewhere). Presumably those exit points are shared by multiple customers. Did I mention I block all incoming SYNs from Amazon?


> and it's impressive that they're so technically savvy they don't need the CSS as noted elsewhere

With the logs you provided, they appear to be coming from within iMessage.

So when someone posts a link in iMessage it will fetch the favicon(s) and the html in order to generate a “preview” of the page with the title of the page and use one of the favicons. It doesn’t need to fetch any css files to do this.

Not saying bad actors don’t fetch css either, but the lack of it being fetched doesn’t mean that it’s a bad actor.

As for why CF don’t reverse DNS their IPs stating it’s iCloud private relay, well CF are not Apples only 3rd party egress provider (Akamai are also one that springs to mind). So if the number of providers can change at any time, the best source of information about valid egress providers is from Apple themselves.

But Apple do also publish these changes to geo-location databases for you to query, for example: https://www.ip2location.com/demo/104.28.42.8 lists it as iCloud Private Relay.

As for “are CloudFlare caching my site when ran through private relay?”, not 100% sure, I’ll have to check my own logs and cba’ed right now, but I don’t think so (it’s been a while since I ran tests on it to see how it behaved to be 100% sure right this minute.

But I think it would be silly of them if they did as they may not be aware of the what to cache and for who. Let’s say they cached /profile without knowing what the server is using to determine who the logged in user is, they may false cache-hit and leak data from a previous request. When they act as your sites CDN you explicitly tell them what to cache on, but when acting as a relay (either for apple or their own warp product) for a site they are not a CDN for they are missing this info, sure they could guess, but why risk being wrong?)


Thanks for the explanation.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: