Last update:

Distributed Web Crawler - SaaS

The scraping/crawling infrastructure for Developers

The distributed web crawler is a SaaS service living in the cloud of several large providers such as Amazon AWS and Microsoft Azure.

The service allows to control real browsers such as the Chromium Browser with a browsing fingerprint indistinguishable from humanly used browsers. A simple scraping/crawling function that is replicated on many thousand worker nodes around the world can be defined. When the workers finished, the results are joined and merged into a single, convenient file.

Sign Up to use Web Crawler

Use Cases

Why should you use the Distributed Crawler?

Many businesses have a huge need for publicly available information from the Internet. Often, the IT department of such companies is assigned to obtain the data.
But creating a robust, resilient and intelligent crawling infrastructure with real browsers is all but easy. For this reason, it is often a more economically viable solution to use a robust SaaS product.

A list of use cases require a scalable and resilient distributed crawling infrastructure.

Process Automation

Every process that is thinkable with a browser can be automated. Think of boring repetitive task that need to be automated via API from different geographical regions.

Market Research

In a competitive landscape, information about the marketplace and your competitors gives you a decisive edge.

Content Monitoring

A distributed crawling infrastructure makes it possible to monitor changes in websites automatically and to fire appropriate events.

Scientific Data Collection

Scientists often need data for their empirical analysis. For example, the very first machine learning translation services were trained with data that was scraped from bilingual websites.

Examples

A few examples are presented in order to illustrate the use cases for the distributed web crawler infrastructure. The corresponding API calls are also included. When you decide to sign up, you can start using the cloud crawler API.

Scraping Product Data from Amazon

This Amazon scraping function hosted on GitHub scrapes the product name, product url and product price for an arbitrary number of product search keywords.

The Amazon scraping process can be launched with this API call:

{
  "invocation_type": "request_response",
  "worker_type": "browser",
  "function": "https://raw.githubusercontent.com/NikolaiT/scrapeulous/master/amazon.js",
  "items": ["french press", "samsung galaxy"],
  "region": "us",
}

This will execute the Amazon function on two Workers in the cloud from the US and will return the joined data as soon as all workers succeeded.

Scraping Google SERPs

This Google SERP function hosted on GitHub scrapes the SERP results (title and link) from Google for a set of keywords.

If this scraper function is used, the API call would look like this:

{
  "invocation_type": "request_response",
  "worker_type": "browser",
  "function": "https://raw.githubusercontent.com/NikolaiT/scrapeulous/master/amazon.js",
  "items": ["SaaS platform", "saul goodman", "good movies"],
  "region": "uk",
}

This will execute the Google SERP function on three Workers in the cloud from the United Kingdom region and will return the joined data as soon as all workers succeeded.


Extracting phone numbers and email addresses from urls

This Lead scraping function hosted on GitHub visits the given url and tries to extract any phone number and email address in the page source.

The lead extraction function can be launched with:

{
  "invocation_type": "request_response",
  "worker_type": "http",
  "function": "https://raw.githubusercontent.com/NikolaiT/scrapeulous/master/leads.js",
  "items": ["https://scrapeulous.com/"],
  "region": "de"
}

This will execute the Lead extraction function on one Worker in the cloud from Germany and will return the joined data as soon as all workers succeeded.

Your own Scraping function?

Web Crawler can scrape anything! Don't hesitate to contact us such that we can quickly imlement your use case and start the distributed crawling!

Technical Architecture

API

The distributed cloud crawler is controlled via API. The complete configuration and logic of the crawler can be specified via a single RESTful API call.

Queuing

Queues are a central construct when it comes to distributed crawling. We use a sophisticated queueing infrastructure to handle crawling assignments.

Pay as you Go

Only consumed infrastructure is billed:

  • Execution time of workers
  • Consumed proxy bandwith
  • Storage

Distributed

Distributed means that many thousand different Workers from different geographical regions are launched in a concurrent fashion.

We believe that efficient and professional crawling/scraping needs to done with an Divide and Conquer approach.

Intelligent Proxy Strategy

We use a mix of cloud workers and proxy serves to route requests through different end nodes. That way, we guarantee that all requests will be handled eventually.

Crawl with headless Chromium or http requests

It's possible to scrape/crawl with plain http requests using the library axios or to control fully fledged Chromium Browser instances via puppeteer.