Last update:

About scrapeulous.com

Scrapeulous.com allows you to scrape various search engines automatically and in large quantities.

The business requirement to scrape information from search engines often occurs in marketing research or in scientific projects. This service allows you to scrape thousands of keywords from many different search engines. Equipped with this data, your project profits from the best business intelligence available.

Scrapeulous.com was created by the developer of Google Scraper, a large open source project that helped many people achieve their data extraction/analysis objectives.

As of 2019, GoogleScraper is replaced by a modern successor named se-scraper that builds on top of puppeteer and headless Chromium browser.

We created this service because many clients approached us and asked us to scrape data for them. We thought that this service can satisfy the requirement for middle sized to large scraping projects.

If you have a special requirements for your scraping project, you can write us an email to the contact mail and ask us if we can tailor a solution for your needs.

Vision

Visions sometimes is cloudy
Future vision for scrapeulous

Current State

Currently, Scrapeulous allows their customers to scrape and crawl search engines such as Google or Bing in large quantities. Additionally, an intermediate-term goal of Scrapeulous is to offer a general purpose scraping/crawling architecture to their customers. This scraping/crawling architecture consists of automated browsers that are executed on various geographical locations in the cloud, thus aiming to simulate organic, distributed human traffic.

How do we generate value?

The benefit of a scraping architecture is multivariate. Scraping/crawling is a inherent error prone process. Several factors need to align properly for a scrape job to be successful. There exists a resource problem, a big data problem, reliability issues and many more complex obstacles.

For this reason, clients have a large incentive to use a provider such as Scrapeulous to develop their scraping/crawling needs. With Scrapeulous, clients may exclusively focus on the logic of data extraction, instead on the various annoying involved issues such as getting blocked by websites, limited computing and network resources, the creation of reliable queuing infrastructure and so on.

What problems does Scrapeulous solve?

There is a huge demand for publicly accessible data published on the Internet. Think of user profiles on Instagram, Google Maps Locations entries, Amazon product details and many more examples of publicly accessible data. Often, web site maintainers try to fend of attempts to enumerate and harvest this data in an automated fashion by defining criteria that distinguish humans from automated bots.

For example, websites categorize an user agent as a bot, if all requests originate from the same IP address or browser fingerprint, if the user agent doesn't use a real browser such as Chromium, does not move the mouse cursor like a human would or is unable to solve captchas.

Therefore, there exists a large demand for an programmable, arbitrarily large amount of human like user agents, executed from different geographical locations that are hard to detect/block.

How does Scrapeulous solve them?

First of all, a primary goal of Scrapeulous is to not offend website maintainers by generating excessive traffic.

The objective is to simulate organic traffic that is indistinguishable from real human visitors. This is currently done by using the latest version of Chromium as a browser, controlled by the javascript library puppeteer. Puppeteer instances are executed in the cloud from different geographical locations and thus distinct IP addresses. The configuration of each browser is programmed to be as similar as possible to a real human-navigated browser in order to prevent fingerprinting. The actual scraping/crawling of websites is evenly distributed over time such that no excess traffic is produced and no marks are left that indicate that the visits are of automatic origin.

As a intermediate-term goal, artificial intelligence via neuronal networks and machine learning will be used to resemble humans as closely as possible.