On the technical rather than practicality side, how does prevent garbage data/ad...

On the technical rather than practicality side, how does prevent garbage data/advertising being added to the collective crawl?

How about using this type of technology for something Google can't (for legal reasons)? Say for example full text search of the library genesis archive? Over Tor or somesuch?