Table of Contents

  • Introduction
  • Scraping with GoogleScraper
  • Data Analysis
  • Conclusion

Introduction

In the following article we will demonstrate the powers of GoogleScraper by pursuing a sample market analysis for the correlation of fashion brand names and the worlds current top models (both male and female).

So what exactly are we going to do?

We will search two big search engines, namely Google and Bing for fashion brand names as

  1. Levi Strauss
  2. Coach
  3. Phillips-Van Heusen
  4. Estée Lauder
  5. Richemont
  6. Christian Dior
  7. The Gap
  8. Kering
  9. H&M
  10. LVMH

in combination with the names of the world top 50 male and female models as listed by http://models.com/rankings/ui/Top50.

This means we will iterate over all 10 brand names and totally 100 model names (which sums up to 1000 queries) with two search engines (Bing and Google, as mentioned above), which amounts to 2000 keywords to search for.

One such example keyword looks like this: "Amanda Murphy Levis Strauss"

Thus, an exerpt of the keyword file looks something like this:

(...)
Caroline Brasch Nielsen Christian Dior
Caroline Brasch Nielsen The Gap
Caroline Brasch Nielsen Kering
Caroline Brasch Nielsen H&M
Caroline Brasch Nielsen LVMH
Chiharu Okunugi Levi Strauss
Chiharu Okunugi Coach
Chiharu Okunugi Phillips-Van Heusen
Chiharu Okunugi Estée Lauder
Chiharu Okunugi Richemont
Chiharu Okunugi Christian Dior
(...)

What questions do we hope to answer with this scraping project?

I am not really knowledgable when it comes to fashion brand market analysis, so I won't be your man to draw the correct conclusions from the data that we are going to scrape.

Additionally, it's even quite debatable if this scrape endavour even does make sense alltogether. But I hope to find some answers to the following questions:

  • Which brand earns the most search hits all in all?
  • Which online store is listed most on SERP pages all in all?
  • Are the results in Bing and the Google search engine fundamentally different?
  • And many more :D

The technical side: Scraping with GoogleScraper

Extracting the models to scrape from our data source

Well now let's look at the technical part and the way we are going to extract the data. First of all, we need to extract the 100 models name from the site models.com. I will code for this purpose a short jQuery selector inside the script console of Firebug:

This is what I came up with to extract the model names from the site:

jQuery('.capdiv > a').each(function (index) {
  console.log( $( this ).text() );
});

So apply it to both sites, once for the Top 50 female models and once for the top 50 male models. If we've done this, save the list of the 100 models to a local text file.

Working with Firebug and entering the javascript code should look similar to this:

Extracting results with jQuery

Creating the file with the keywords to scrape for

Now we have a file 'models.txt' with the top 50 female and top 50 male models. No we need to generat the keywords that GoogleScrape can process (on each line a keyword). For this task, I just made the following small Python script that combines each name of the model with the 10 brand names that I listed above (I am sure there are more elegant ways, for example with itertools).

brands = [
    'Levi Strauss',
    'Coach',
    'Phillips-Van Heusen',
    'Estée Lauder',
    'Richemont',
    'Christian Dior',
    'The Gap',
    'Kering',
    'H&M',
    'LVMH'
]

with open('keywords.txt', 'wt') as outfile:
    for model in open('models.txt', 'r'):
        for brand in brands:
            model = model.strip()
            brand = brand.strip()
            if model and brand:
                s = '{model} {brand}\n'.format(model=model, brand=brand)
                outfile.write(s)

Then run the following commands in your shell, cmd, or whatever tool you use

nikolai@nikolai:~/Projects/private/ScrapingProjects$ python3 create_kw_file.py 
nikolai@nikolai:~/Projects/private/ScrapingProjects$ wc -l keywords.txt 
1070 keywords.txt
nikolai@nikolai:~/Projects/private/ScrapingProjects$ head keywords.txt 
Amanda Murphy Levi Strauss
Amanda Murphy Coach
Amanda Murphy Phillips-Van Heusen
Amanda Murphy Estée Lauder
Amanda Murphy Richemont
Amanda Murphy Christian Dior
Amanda Murphy The Gap
Amanda Murphy Kering
Amanda Murphy H&M
Amanda Murphy LVMH

So we just created our keyword file with 1070 keywords in total (The reason why we got more than the expected 1000 keywords is that there were slightly more than 50 female models on our source site).

Setting up GoogleScraper

As promised, we well use the tool GoogleScraper that I developed over the last year. You need to have Python 3.4 installed and a shell to work in. Then fire the following commands in a shell to install GoogleScraper:

virtualenv --python python3 env
source env/bin/activate
pip install GoogleScraper

The above command will have installed GoogleScraper in a virtual environment.

We are going to do the scrape with three different IP addresses (two SOCKS5 proxies that I set up on my two VPS servers) as well as my own ip address. This makes it possible to scrape with 6 open browser windows, without using the same IP address with the same seach provider concurrently. So for each automated browser instance (Steered with the Selenium framework for the curios ones) we use exactly one IP address. And each of these browsers would need to request 1070/30=356 keywords all in all.

The scraping

The scraping process itself isn't very interesting except from one minor fact: It seems that Google is much more sensitve when it comes to automated searches in comparasion to Bing. In my tests, Google would show the captcha much earlier than Bing. I am quite positive that this behaviour changed over the last few months (Maybe because of this tool? With 250 stars on Github it is popular enough to have a negative impact and thus piss of people @Google).

When I used GoogleScraper for one of my previous customers, scraping Google with automated browsers was very easy and one could use up to 15 browser threads and request around 3 keywords per second (Remember: By using just one ip address). Now I cannot get anywhere near this, when using more than one request per 5 seconds (15 times slower), Google will instantly ban me. So your either need quite a lot of proxies or good scraping strategy.

Bing on the other hand still has no big restrictions :D

The results

To give you a little preview of the results:

Data Analysis - Playing with the results

Now that we scraped the data, it's time to play with it.

You can accesse an interactive shell with GoogleScraper by calling

GoogleScraper --shell

This will give you a ipython3 session with some sqlalchemy variables to inspect the results. These variables are:

  • session: A sqlalchemy session.
  • ScraperSearch: This represents a call of the GoogleScrape application. It saves the time the scraping needed and has an element serps: A link to all the SERP that we found in the session.
  • SERP: This is a representation of a Search engine results page. It has a handle to the links.
  • Link: What we want. Stores the link, the snippet and the shown url of an entry in the serp page.

I assume that you are now in the shell session:

To list all searches that we issued:

>>> session.query(ScraperSearch).all()

# Assuming that our search has index 1
>>> search = session.query(ScraperSearch).get(1)

# Now list all serp pages
>>> search.serps

# And to iterate through all links use
>>> for serp in search.serps:
>>>     for link in serp.links:
>>>         print link

# Alternatively you can get all urls like this:
>>> links = [link.link for serp in search.serps for link in serp.links]

Now that we have a basic understanding of how we can work with the data, lets try answer our questions that we asked ourselves in the begging of this blog post!

We could of course work again in the sqlalchemy shell that GoogleScraper provides, but it's better to use a database administration tool like phpmyadmin. I will use the command line tool sqlite3, since I like working in the shell and because I am quite used to it. Let's get confortable with our tool by firing up some basic queries:

# open the database in the tool
sqlite3 google_scraper.db

# show the schema of the database
sqlite> .schema
CREATE TABLE scraper_search (
    id INTEGER NOT NULL, 
    number_search_engines_used INTEGER, 
    used_search_engines VARCHAR, 
    number_proxies_used INTEGER, 
    number_search_queries INTEGER, 
    started_searching DATETIME, 
    stopped_searching DATETIME, 
    PRIMARY KEY (id)
);
CREATE TABLE serp (
    id INTEGER NOT NULL, 
    search_engine_name VARCHAR, 
    scrapemethod VARCHAR, 
    page_number INTEGER, 
    requested_at DATETIME, 
    requested_by VARCHAR, 
    num_results INTEGER, 
    "query" VARCHAR, 
    num_results_for_keyword VARCHAR, 
    PRIMARY KEY (id)
);
CREATE TABLE scraper_searches_serps (
    scraper_search_id INTEGER, 
    serp_id INTEGER, 
    FOREIGN KEY(scraper_search_id) REFERENCES scraper_search (id), 
    FOREIGN KEY(serp_id) REFERENCES serp (id)
);
CREATE TABLE link (
    id INTEGER NOT NULL, 
    title VARCHAR, 
    snippet VARCHAR, 
    link VARCHAR, 
    domain VARCHAR, 
    visible_link VARCHAR, 
    rank INTEGER, 
    link_type VARCHAR, 
    serp_id INTEGER, 
    PRIMARY KEY (id), 
    FOREIGN KEY(serp_id) REFERENCES serp (id)
);

# how many serp pages do we have using Google as a search engine?
sqlite> select count(*) from serp where search_engine_name = 'google';
1067

# more specifically: How may unique queries have we with the search engine google? 1000 were expected...
sqlite> select count(distinct query) from serp where search_engine_name = 'google';

# what are the domains that appeared most frequently in all scraped urls and at least fifty times? 
sqlite> select domain, count(domain) as num_domain_appeared from link group by domain having num_domain_appeared > 50 order by num_domain_appeared desc;
|1370
www.pinterest.com|912
models.com|904
www.facebook.com|628
www.tumblr.com|602
www.fashionmodeldirectory.com|356
www.thefashionisto.com|296
twitter.com|228
instagram.com|222
www.youtube.com|222
en.wikipedia.org|180
www.polyvore.com|180
ftape.com|178
www.malemodelscene.net|174
fashionindustryarchive.com|170
forums.thefashionspot.com|168
www.linkedin.com|152
nymag.com|140
en.vogue.fr|138
www.vogue.it|124
www.vogue.com|116
www.dazeddigital.com|98
www.fashiongonerogue.com|92
article.wn.com|70
books.google.de|70
awake-smile.blogspot.com|68
www.topfash.com|66
www.wwd.com|64
www.googleadservices.com|63
www.vmagazine.com|62
fashioncopious.typepad.com|56
corpusfashion.tumblr.com|54

Now that we are familiar with sql, lets answer our question that we asked ourselves earlier:

Which brand earns the most search hits all in all?

This is slightly complicated because we need to parse the brand name out of the initial search query and we also need to parse the number of results that the search engine delivered based on our search query. That's why we will use the a separate script again, so we can store stuff in variables and solve to problem in a programmers approach.

Note that we use some handy candy that python provides for us

  • The Counter class from the collections module
  • The pretty print function: pprint

That's the script I made to count the oveall search hits for each brand:

#!/usr/bin/python3
# -*- coding: utf-8 -*-

import re
from GoogleScraper.database import get_session, SERP
from collections import Counter
from pprint import pprint

brands = [
    'Levi Strauss',
    'Coach',
    'Phillips-Van Heusen',
    'Estée Lauder',
    'Richemont',
    'Christian Dior',
    'The Gap',
    'Kering',
    'H&M',
    'LVMH'
]

num_res = re.compile(r'(?P<numr>[\d\.,]+) (results|Ergebnisse)')

# your path to the database that GoogleScraper saved
Session = get_session(path='/home/nikolai/Projects/private/GoogleScraper/google_scraper.db')
session = Session()

counter = Counter()

for serp in session.query(SERP).all():
    for brand in brands:
        if brand in serp.query:
            try:
                number = num_res.search(serp.num_results_for_keyword).group('numr')
                counter[brand] += int(re.sub(r',\d+', '', number).replace('.', ''))
            except AttributeError:
                pass

pprint(counter.most_common())

The output is the following:

[('The Gap', 568178074),
 ('Kering', 494447823),
 ('H&M', 371373925),
 ('Coach', 217229804),
 ('Christian Dior', 98413899),
 ('Richemont', 51080484),
 ('Estée Lauder', 51067537),
 ('LVMH', 40017347),
 ('Levi Strauss', 33992170),
 ('Phillips-Van Heusen', 5209493)]

To conclude: Obviously, the brand name "The Gap" earns a lot more search hits than for example "Phillips-Van Heusen". This seems to be right, because we can check it shortly by googling these queries manually:

  • Phillips-Van Heusen: 392.000
  • The Gap: 129.000.000

Which online store is listed most on SERP pages all in all?

Well to answer this question we would need a list of online stores to sort by. Because I don't have this specific info, I cannot answer the question :)

But I am optimistic that someone that actually works in this industry will figure out how to do it by approaching the matter with the techniques outlined here.

Are the results in Bing and the Google search engine fundamentally different?

Well to answer this question, here is a list of the most frequent domains found in the link from Google and Bing. I got the data by using the following sql query in the sqlite3 shell:

# Google most frequent domains
sqlite> select link.domain, count(link.domain) as cdom from link join serp on link.serp_id = serp.id group by link.domain having cdom > 30 and serp.search_engine_name == 'google' order by cdom desc;
models.com|1818
www.tumblr.com|988
www.vogue.it|502
instagram.com|342
fashionindustryarchive.com|338
www.dazeddigital.com|322
art8amby.wordpress.com|266
www.vogue.de|220
www.vogue.com|162
www.wwd.com|144
www.vmagazine.com|126
www.style.com|124
thefashionography.com|122
books.google.de|112
www.bellazon.com|108
www.businessoffashion.com|106
fashioncopious.typepad.com|102
www.pvh.com|100
de.wikipedia.org|98
www.womenmanagement.com|92
www.topfash.com|84
www.fordmodelsblog.com|78
www.tee-vanity.com|76
www.fashionedbylove.co.uk|74
de.fashionmag.com|72
www.textilwirtschaft.de|70
www.thefashionlaw.com|70
www.googleadservices.com|67
www.anneofcarversville.com|66
www.scoop.it|64
www.zimbio.com|64
corpusfashion.tumblr.com|62
www.bloglovin.com|60
www.gettyimages.com|58
tmagazine.blogs.nytimes.com|56
www.dailymalemodels.com|54
www.harpersbazaar.com|54
rebloggy.com|52
phx.corporate-ir.net|50
www.hola.com|50
www.huffingtonpost.com|50
www.wilhelminanews.com|50
books.google.com|48
www.bellomag.com|48
www.tumblr.net|48
fabmagazineonline.com|46
www.modni-zpravodajstvi.jecool.net|46
coolspotters.com|44
www.nytimes.com|42
www.vogue.co.uk|42
igbox.co|40
www.justjared.com|40
www.amazon.de|38
www.arcstreet.com|38
www.whynotmodels.com|38
brand.haibao.com|36
designsfever.com|36
ink361.com|36
it.fashionmag.com|36
uk.linkedin.com|36
www.4-traders.com|36
www.chloe.com|36
www.luxuo.com|36
fashion.globetrottingwino.me|34
www.esteelauder.com|34
yangabin.perso.neuf.fr|32
yumilambert.tumblr.com|32

and now for Bing:

# Bing most frequent domains
sqlite> select link.domain, count(link.domain) as cdom from link join serp on link.serp_id = serp.id group by link.domain having cdom > 30 and serp.search_engine_name == 'bing' order by cdom desc;
www.pinterest.com|1572
www.facebook.com|1468
www.fashionmodeldirectory.com|1006
www.thefashionisto.com|952
www.linkedin.com|708
forums.thefashionspot.com|662
www.youtube.com|582
twitter.com|494
en.wikipedia.org|456
ftape.com|422
www.polyvore.com|400
www.malemodelscene.net|398
en.vogue.fr|332
nymag.com|300
www.fashiongonerogue.com|288
article.wn.com|242
awake-smile.blogspot.com|204
vimeo.com|202
www.models.com|194
1662080.r.msn.com|188
wn.com|188
3132320.r.msn.com|186
2080531.r.msn.com|180
2712200.r.msn.com|168
www.zoominfo.com|162
0.r.msn.com|150
fashionista.com|142
www.designscene.net|132
profashioneye.com|126
www.dailymail.co.uk|106
blog.sight-management.com|96
www.dior.com|94
www.imageamplified.com|94
i-d.vice.com|88
womenmanagement.blogspot.com|82
1669492.r.msn.com|72
3024113.r.msn.com|72
katiesrunwaysreport.wordpress.com|72
2546653.r.msn.com|68
onabbotkinney.com|68
us.fashionmag.com|68
www.elle.de|68
www.spoke.com|68
fashionopher.blogspot.com|66
www.got-blogger.com|66
www.imdb.com|66
www.donbleek.com|64
plus.google.com|58
uk.fashionmag.com|58
www.flickr.com|58
www.thefashionspot.com|58
1178864.r.msn.com|56
www.levistrauss.com|56
issuu.com|54
nowfashion.com|52
www.details.com|52
avaxhm.com|50
blog.megamodelagency.com|50
indonesia.style.com|50
www.dnamodels.com|50
www.vogue.fr|48
www.ftape.com|46
www.jolie.de|46
www.weartrends.com|46
www.modellink.com.au|42
fashionweekdaily.com|40
www.lead411.com|40
www.modelinia.com|40
soheleetahmina.wordpress.com|38
www.fashionone.com|38
www.fibre2fashion.com|38
www.levi.com|38
fashionbombdaily.com|36
www.lvmh.com|36
www.reuters.com|36
www.womenmanagement.fr|36
fashionablymale.net|34
www.15minutenews.com|34
rapax.com|32
www.fashionising.com|32

Conclusion

We have seen that GoogleScraper allows us to process a lot of data in short time. The more proxies you have, the more you can scrape. The above investigation isn't very interesting, but if you are creative, you may dig for many hidden valuable pieces of information :)