Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

The Internet Archive has roughly the same problem with the internet as space travel has with space, there's just so unimaginably much of it. You'd think you have some kind of a grasp of how much of it there is but it isn't anywhere close to tangible.


There should be a tool that runs in the background of your web browser and, every page you visit, captures and uploads it to some sort of archive. Anonymously, and it would have some way to prevent accidentally uploading bank details or other confidential information (this would have to work perfectly by default at least, so the best approach is probably a global whitelist with sites like Reddit, news etc. curated by trusted people).

Maybe this tool already exists, maybe it’s what ArchiveTeam uses, but more people should use it.

With enough people, you have an archive of websites that people actually visit (well, people who use the tool). With a few people it will only have the most popular sites and outliers, with more people and more time it will start reaching more niche content.

Furthermore, sacrifice some anonymity (at minimum you need some identity verification to prevent trivial SEO optimization, you'd probably also want region and other broad characteristics to filter) and you have a search engine. You can scrape the archived sites for keywords; you can determine how popular a site is by how many people visited it; and you can figure out whether two sites are related by the probability that someone who visits site A also visits site B.


I use an app on MacOS to do this. It's called History Book. There are also browser extensions. And my bookmark manager also submits anything I save to Wayback Machine.


> news etc. curated by trusted people

I don't trust anyone who is claimed to be trusted on the internet though. I rather we archive everything, even the bad, as long as its not illegal. Otherwise, you wind up with my initial issue, you don't get everything, and you miss small things that might seem insignificant but hold more value than you'd ever realize.


> "The Internet Archive has roughly the same problem with the internet as space travel has with space, there's just so unimaginably much of it. You'd think you have some kind of a grasp of how much of it there is but it isn't anywhere close to tangible."

And, what you describe is actually far more-so what you describe than people even begin to grasp, because the Internet is so much more than just the "world wide web" that most people instantly think of when they hear the word "Internet".


I don't think so. Most of the web is behind a login and/or unlinkable. So you're left with 'open web'. This part is much smaller. So not impossible to archive a meaningful part of it, pretty tangible, especially the useful parts.


The meaningful part of open web is small, yes. Sadly there's so much junky pages, nowadays also partially generated by AI, previously by just copy-pasting randomly content of other pages, cluttering search results. It somehow needs to be all filtered out, otherwise it'll end up taking place instead of something more useful... So I'd really wonder how much of the open web is some kind of original content and how much is duplicate/auto-generated junk.


I'm not convinced. There are various different estimates for how large the internet is with varying confidence, but most I found average around a few hundred zettabytes. The Internet Archive seems to be in the ballpark of a hundred petabytes. So unless I got it wrong, the archive currently covers about 0.01% of the whole thing. How much we need to cover the useful bits is a separate discussion of course.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: