Web Scraping with AWS Lambda
Web scraping is quite easy, but scaling is a different story. My scraper worked perfectly fine for a scale of thousands. So we took a serious measure to scale it up. Our requisites were rotating IP, minimized cost and of course scalable.
Code Trigger — API Gateway — Lambda
So we refactored our code to fit it in a Lambda(FAAS). We scraped around a million pages per day with the new approach.
Owing to a client requirement we tried it for a different geographic location. The extraction rate was initially fine, but it deteriorated. We found a lot of captcha problem.
Below are the few approaches that we did:
- we tried setting our lambda to the location that we were trying to scrape — didn't improve
- Tried rotating cookies — didn't improve much
What was the Exact problem??
- It's a warm start. Looks like they change IP’s less frequently than what we require. So if I hit 100 parallel requests, then there is a high possibility of all 100 coming from the same IP
Any Possible solution??
- Try invoking a cold start, add adequate sleep time. Again there is no fixed time for a cold start to occur. It can vary from 5–60 minutes. This solution won't solve the problem entirely but after a small interval, we can change the IP.
- Another approach would be to have your scraper in n different lambda’s. And rotate the lambda.