Web Scraping with AWS Lambda

reena .m
1 min readOct 12, 2019

Web scraping is quite easy, but scaling is a different story. My scraper worked perfectly fine for a scale of thousands. So we took a serious measure to scale it up. Our requisites were rotating IP, minimized cost and of course scalable.

Code Trigger — API Gateway — Lambda

So we refactored our code to fit it in a Lambda(FAAS). We scraped around a million pages per day with the new approach.

Owing to a client requirement we tried it for a different geographic location. The extraction rate was initially fine, but it deteriorated. We found a lot of captcha problem.

Below are the few approaches that we did:

  • we tried setting our lambda to the location that we were trying to scrape — didn't improve
  • Tried rotating cookies — didn't improve much

What was the Exact problem??

  • It's a warm start. Looks like they change IP’s less frequently than what we require. So if I hit 100 parallel requests, then there is a high possibility of all 100 coming from the same IP

Any Possible solution??

  • Try invoking a cold start, add adequate sleep time. Again there is no fixed time for a cold start to occur. It can vary from 5–60 minutes. This solution won't solve the problem entirely but after a small interval, we can change the IP.
  • Another approach would be to have your scraper in n different lambda’s. And rotate the lambda.

--

--