Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

What I find most effective, is to wrap `get` with local cache, and this is the first thing I write when I start a web crawling project. Therefore, from the very beginning, even when I'm just exploring and experimenting, every page only gets downloaded once to my machine. This way I don't end up accidentally bother the server too much, and I don't have to re-crawl if I make a mistake in code.


requests-cache [0] is an easy way to do this if using the requests package in python. You can patch requests with

  import requests_cache
  requests_cache.install_cache('dog_breed_scraping')
and responses will be stored into a local sqlite file.

[0] https://requests-cache.readthedocs.io/en/stable/




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: