Files
botlib/examples/web-scraper/README.md
2022-04-08 20:22:08 +02:00

36 lines
1.8 KiB
Markdown

# How-to web scraping
Use the `playground.py` for quick testing.
Initially, you have to set `cache_only=False` or otherwise no data is downloaded.
After the first download, re-enable `cache_only` so you don't have to download the data over and over again.
Also, when you feel ready, uncomment the `break` statement to see if it works for all entries.
## Finding a proper `select`
The hardest part is getting all regex matches right.
Open the browser devtools and choose the element picker.
Hover over the first element / row of the data you'd like to retrieve.
Pick whatever tag or class seems apropriate, also look at neighboring tags.
The `select` must match all entries but no unnecessary ones.
Although you can always filter unnecessary ones later...
## Finding the regex
The matches for the individual data fields are tricky too.
Select and right-click on the element you picked above.
Important: Either edit or copy as raw HTML.
The devtools will omit whitespace and display `'` as `"`, so you have to make sure you know what you are trying to match.
Now begins the playing around part.
The regex will match the first occurrence, so if there are two anchor tags and you need the second one, you have to get creative.
For example, this is the case in the craigslist example.
Here I can match the second anchor because it is contained in a `h3` heading.
Try to match as compact as possible, this makes it more robust against source code changes.
For example, use `<a [^>]*>` to match an opening anchor with arbitrary attributes.
Some sites will put the `href` immediatelly after `<a`, other somewhere in between.
Be creative.
Use `[\s\S]*?` to match anything (instead of just `.`), including whitespace and newlines.
And finally, have at least one matching group (`()`).
Note: whitespace will be stripped from the matching group.