Having written one or ten people scrapers this week, I’ve come to realize a few things. Posted here so I have a reference for the next time it comes up.
- Requests + BeautifulSoup is a great combo.
- You’re still going to want to keep re handy.
- You’ll want Unidecode too.
- You’re probably thinking a straight dict for a person, but go with defaultdict. Use an empty string for the default e.g.
person = defaultdict(lambda: '""'). Makes it easy to normalize across sites.
- Man there’s a lot of broken HTML out there.