Scraping Data in Python Using WebDrivers*

* An interactive IPython notebook version of this post can be found on my GitHub page (here).

1. Introduction

The Internet is a treasure trove of interesting, unique, and often underutilized data. Getting that data, however, can be an arduous task. There are numerous libraries, implemented in various programming languages, that can help to ease the burden and have been a boon to data miners everywhere. Most notable are the BeautifulSoup and urllib2 libraries in Python.

1.1 The Problem

Still though, there are a number of instances where you may feel that these aren’t the right tools for the job. Sites where you need to enter information into forms, select boxes, or navigate by dropdown menus are especially tricky when using these more traditional methods. Dynamic websites, with and without static addresses are close to impossible.

1.2 The Solution

WebDrivers can provide a (generally) user-friendly answer to these problems. Although this post will focus on using the selenium library paired with ChromeDriver in Python, there are other WebDrivers (e.g., Firefox, headless browsers) and languages (e.g., Java) that can be used for this.

1.2.1 What is a WebDriver and Why is it the Solution?

A WebDriver is simply a live instance of an Internet browser controlled by a program rather than real-time human interaction. In essence, it looks like a regular browser (both to you and sites’ servers), it quacks like a regular browser, but it isn’t quite a regular browser. The appeal lies in the fact that you are able to automate natural web-navigation that would be difficult or impossible with traditional html and xml extractors.

A simple example is filling out a form. Say that you want to search a site for documents associated with a set of boolean strings (e.g., [“selenium NOT java”, “java NOT selenium”, …]) over a set of specific time spans. However, the address of those search results are dynamic – making them impossible to generate a priori. That’s going to be a problem for other tools, but with a WebDriver you can execute the search by filling out the search bar and specifying the date range (e.g., by clicking on a calendar GUI, entering in the dates, or using a dropdown menu).


2. Setup

Assuming that you have Python 2.7+ up and running on your machine, the first thing that you will need is ChromeDriver. You’ll want to install this somewhere that’s easily accessible (I just have it in my “Desktop” folder).

Next you will need to install selenium, which can be done via pip:

pip install selenium

Optional libraries that you may find useful are: os, random, re, time, and sys.


3. Getting Started

Once those are installed, you can start getting acquainted with using a WebDriver through Python.

First, we need to import the necessary libraries. The Select utility allows us to isolate user interfaces that we want to operate on.

import os
from selenium import webdriver
from selenium.webdriver.support.ui import Select
import random
import time
import sys

Next, we’re actually going to connect to the WebDriver and open up a browser window.

## Connecting
chromedriver = "C:\\Users\\rbm166\\Desktop\\chromedriver"
os.environ["webdriver.chrome.driver"] = chromedriver

## Opening
driver = webdriver.Chrome(chromedriver)
# Clear cache
driver.delete_all_cookies()
# Full screen
driver.maximize_window()
# Wait...
time.sleep(5)

This should open up a Google Chrome Browser page that isn’t on any particular webpage:


4. Example

Now that you know how to get started, let’s move on to a real example.

4.1 The Constitute Project

The Constitute Project is a collection of national constitutions from across the world. Users can read full constitutions, compare countries, and explore topics within and across documents. This database, however, would be very difficult to scrape without using a WebDriver. This is because the site’s main page listing the constitutions is dynamic and based on JavaScript.

Using Python and ChromeDriver, we can easily navigate this page and grab all the links to countries constitutions, whose pages are static HTML and can be parsed with a standard library like BeautifulSoup.

First, we’re going to point our driver to the page we want to load:

## Provide the starting page:
link0 = 'https://www.constituteproject.org/search?lang=en'

## Go to that starting page:
driver.get(link0)

## Wait for it to load:
time.sleep(5)

Once loaded, the driver’s browser should look something like this:

Using the inspect element option available in that window or a separate browser, we can find the identifier associated with the links that we want to grab. In this case, we can use the link text “View HTML”.

Now we can tell our driver to find all of the page’s elements associated with that link text:

## Pull all "View HTML" related objects:
objects = driver.find_elements_by_link_text('View HTML')

We can perform a quick check to make sure that we’re only grabbing what we want. The number of objects should equal the number of constitutions, n = 194, listed at the top of the page:

## Sanity Check:
len(objects)
194

We just have to get the links out of these objects now. Selenium makes this easy:

## List to hold the links:
links = []

## Iterate over list and get the links:
for obj in objects:
    links.append(obj.get_attribute('href'))

## Inspect the links:
links[0:6]
[u'https://www.constituteproject.org/constitution/Afghanistan_2004?lang=en',
 u'https://www.constituteproject.org/constitution/Albania_2012?lang=en',
 u'https://www.constituteproject.org/constitution/Algeria_2008?lang=en',
 u'https://www.constituteproject.org/constitution/Andorra_1993?lang=en',
 u'https://www.constituteproject.org/constitution/Angola_2010?lang=en',
 u'https://www.constituteproject.org/constitution/Antigua_and_Barbuda_1981?lang=en']

These links can now be written out to a *.txt or *.csv file and parsed using a library like BeautifulSoup, or manipulated later in the script.


## Now we can close the driver:
driver.close()

5. Conclusion

Traditional web scraping libraries and packages are well developed tools that make web scraping easier. They do, however, fall short on some fronts. WebDrivers provide an elegant solution to many of the problems faced by these traditional methods. As shown above, WebDrivers can navigate dynamic websites with ease and are easily adaptable to most situations.

Disclaimer

Just because you can doesn’t mean you should. Be sure to check and observe sites’ policies regarding scrapers. These policies can most often be found at “[insert url here].com/robots.txt