Kinda tangent, but Playwright's doc (specifically, the intro https://playwright....

simonw · on Feb 20, 2024

Hah yeah that's confusing.

https://playwright.dev/python/docs/intro is actually the documentation for pytest-playwright - their pytest plugin.

https://playwright.dev/python/docs/library is the documentation for their automation library.

I just filed an issue pointing out that this is confusing. https://github.com/microsoft/playwright/issues/29579

PaulHoule · on Feb 20, 2024

Back in the day I used to use HTMLUnit

https://htmlunit.sourceforge.io/

to crawl Javascript-based sites from Java. I think it was originally intended for integration tests but it sure works well for webcrawlers.

I just wrote a Python-based webcrawler this weekend for a small set of sites that is connected to a bookmark manager (you bookmark a page, it crawls related pages, builds database records, copies images, etc.) and had a very easy time picking out relevant links, text and images w/ CSS selectors and beautifulsoup. This time I used a database to manage the frontier because the system is interactive (you add a new link and it ought to get crawled quickly) but for a long time my habit was writing crawlers that read the frontier for pass N from a text file which is one URL per line and then write the frontier for pass N+1 to another text file because this kind of crawler is not only simple to write but it doesn't get stuck in web traps.

I have a few of these systems that do very heterogenous processing of mostly scraped content and something think about setting up a celery server to break work up into tasks .

electroglyph · on Feb 20, 2024

agreed, playwright is great. it even has device emulation profiles built in, so you can for instance use an iphone device with the right screen size/browser/metadata automatically