Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Kinda tangent, but Playwright's doc (specifically, the intro https://playwright.dev/python/docs/intro ) confuses me. It asks you to write a test and then run `pytest`, instead of just letting you to use the library directly (which exists, but is buried in the main text: https://playwright.dev/python/docs/library).

I understand that using Playwright in tests is probably the most common use case (it's even in their tagline) but ultimately the introduction section of a lib should be about the lib itself, not certain scenario to use it with a 3rd-party lib B (`pytest`). Especially when it may cause side effect (I wasn't "bitten" by it but surely was confusing: when I was learning it before, I created test_example.py as said in a minefield folder which has batch of other test_xxxx.py files. And running `pytest` causes all of them to run, and gives confusing outputs. And it's not obvious to me at all, since I've never used pytest before and this is not a documentation about pytest, so no additional context was given.)

> tagline



Hah yeah that's confusing.

https://playwright.dev/python/docs/intro is actually the documentation for pytest-playwright - their pytest plugin.

https://playwright.dev/python/docs/library is the documentation for their automation library.

I just filed an issue pointing out that this is confusing. https://github.com/microsoft/playwright/issues/29579


Back in the day I used to use HTMLUnit

https://htmlunit.sourceforge.io/

to crawl Javascript-based sites from Java. I think it was originally intended for integration tests but it sure works well for webcrawlers.

I just wrote a Python-based webcrawler this weekend for a small set of sites that is connected to a bookmark manager (you bookmark a page, it crawls related pages, builds database records, copies images, etc.) and had a very easy time picking out relevant links, text and images w/ CSS selectors and beautifulsoup. This time I used a database to manage the frontier because the system is interactive (you add a new link and it ought to get crawled quickly) but for a long time my habit was writing crawlers that read the frontier for pass N from a text file which is one URL per line and then write the frontier for pass N+1 to another text file because this kind of crawler is not only simple to write but it doesn't get stuck in web traps.

I have a few of these systems that do very heterogenous processing of mostly scraped content and something think about setting up a celery server to break work up into tasks .


agreed, playwright is great. it even has device emulation profiles built in, so you can for instance use an iphone device with the right screen size/browser/metadata automatically




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: