Here you can see a video of how to scrape 260 keywords on Bing in very short time. Sorry for the bad english and the stuttering. It is my first video :D

Introduction

In the following article we will show the awesome speed of GoogleScraper by trying to scrape 260 keywords in a second and outputting the data in JSON format.

The preparation

First we need our 260 keywords. For this, I opened the wikipedia page with a list of best selling books in the overall history of humanity. You can see the list yourself here.

Then I extracted the books with a small jQuery script that I wrote in a previous arcticle (Notice to myself: Very good, you actually reused a small snippet of code and saved 10 minutes) and prepended buy to every book title. The list then looks something like this:

buy A Tale of Two Cities
buy The Lord of the Rings
buy The Little Prince
buy Harry Potter and the Philosopher's Stone
buy And Then There Were None
buy Dream of the Red Chamber
buy The Hobbit
buy She: A History of Adventure
buy The Lion, the Witch and the Wardrobe
buy The Da Vinci Code
buy Think and Grow Rich
buy Harry Potter and the Half-Blood Prince
buy The Catcher in the Rye
buy The Alchemist
buy Harry Potter and the Chamber of Secrets
buy Harry Potter and the Prisoner of Azkaban
buy Harry Potter and the Goblet of Fire
buy Harry Potter and the Order of the Phoenix
buy Harry Potter and the Deathly Hallows
buy One Hundred Years of Solitude

(... many more books ...)

Then you need to save the keywords somewhere, I saved my list in the file /tmp/books.txt.

Installing GoogleScraper

Because I explained this step already quite a few times, I will skip it and just mention that I have a rather good description of how to install it on the Github Page. If you are on a linux system, it probably works out of the box if you fire this command in your shell:

pip install GoogleScraper

But note that GoogleScraper is writen in Python 3.4. So use the 3 version of pip.

The scraping

Now lets to the fun thing. Switch to the directory where you saved your file with the 260 keywords and enter the following command in a shell:

GoogleScraper --version
0.1.20

The version should be at least 0.1.20. If you have the correct vesion, then we can begin the asynchronous scrape. Note that we will scrape the search engine Bing. In the second example we try the same with Google.

GoogleScraper -m http-async --keyword-file books.txt -s bing -o bing_results.json --verbosity 2

BAAAM!!!. In my case, because my internet is really shitty, I had to wait 10 seconds. But after just 10 seconds and using only one IP address, this is what GoogleScraper spitted out: bing_results.json output file. 1.5 Megabyte of json data. 260*10=2600 unique links. And a ton of metadata to analyze. Gathered just within very small time (Note that if you have a better internet connection as mine, which is probably the case, you have the results in one second).

Now we try the same with Google.

nikolai@nikolai:~/Projects/private/GoogleScraper$ GoogleScraper -m http-async --keyword-file /tmp/books.txt -s google -o google_results.json --verbosity 3
2015-01-24 14:27:37,986 - GoogleScraper - INFO - Continuing last scrape.
2015-01-24 14:27:37,989 - GoogleScraper - INFO - 0 cache files found in .scrapecache/
2015-01-24 14:27:37,989 - GoogleScraper - INFO - 0/257 objects have been read from the cache. 257 remain to get scraped.
2015-01-24 14:27:37,990 - GoogleScraper - INFO - Going to scrape 257 keywords with 1 proxies by using 1 threads.
2015-01-24 14:27:40,234 - GoogleScraper - INFO - [+] localhost requested keyword 'buy Jonathan Livingston Seagull' on google. Response status: 503
2015-01-24 14:27:40,234 - GoogleScraper - INFO - [i] URL: https://www.google.com/search?num=50&start=1&q=buy+Jonathan+Livingston+Seagull HEADERS: {'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', 'Accept-Encoding': 'gzip, deflate', 'Accept-Language': 'en-US,en;q=0.5', 'Connection': 'keep-alive'}
2015-01-24 14:27:40,246 - GoogleScraper - INFO - [+] localhost requested keyword 'buy Harry Potter and the Goblet of Fire' on google. Response status: 503
2015-01-24 14:27:40,247 - GoogleScraper - INFO - [i] URL: https://www.google.com/search?num=50&start=1&q=buy+Harry+Potter+and+the+Goblet+of+Fire HEADERS: {'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', 'Accept-Encoding': 'gzip, deflate', 'Accept-Language': 'en-US,en;q=0.5', 'Connection': 'keep-alive'}
2015-01-24 14:27:40,257 - GoogleScraper - INFO - [+] localhost requested keyword 'buy Calico Cat Holmes series' on google. Response status: 503
2015-01-24 14:27:40,258 - GoogleScraper - INFO - [i] URL: https://www.google.com/search?num=50&start=1&q=buy+Calico+Cat+Holmes+series HEADERS: {'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', 'Accept-Encoding': 'gzip, deflate', 'Accept-Language': 'en-US,en;q=0.5', 'Connection': 'keep-alive'}
2015-01-24 14:27:40,268 - GoogleScraper - INFO - [+] localhost requested keyword 'buy Star Wars' on google. Response status: 503
2015-01-24 14:27:40,268 - GoogleScraper - INFO - [i] URL: https://www.google.com/search?num=50&start=1&q=buy+Star+Wars HEADERS: {'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', 'Accept-Encoding': 'gzip, deflate', 'Accept-Language': 'en-US,en;q=0.5', 'Connection': 'keep-alive'}

But we see that we always get the Response status: 503. So Google blocks us immediately.

Conclusion

This is a serious bug in Bing, because we can scrape as much data in a very very short time, without getting banned. Just think about this: With a good connection, you can easily process 5000 keywords in a second. This is a hell of a lot. Google on the other hand blocks us very fast and we run into an 503 server error (solving an stupid catpcha).

Now what you want to do with it is up to you. But stay responsible and give credits to where it belongs :D