“Page scraping” means to extract content from web pages. Its often used when the information you want is only available on web pages.
These notes are the typical steps in doing page scraping with Selenium, but not a single example.
Navigate to a Web Page
driver.get("https://www.google.com")
Find an element on the page
Suppose we want to find:
<input type="text" id="input_username" name="username"
placeholder="Type your name" />
We can use:
# find type id
element = driver.find_element_by_id("input_username")
# or find it by name
element = driver.find_element_by_name("username")
# find by tag name
input_tag = driver.find_element_by_tag_name("input")
# XPath expression
# Careful! If expression matches multiple elements, only the first is returned.
element = driver.find_element_by_xpath("//input[@id='input_username']")
What if nothing matches?
- Most methods throw
selenium.common.exceptions.NoSuchElementException
- Some methods may return
None
or empty List
Finding Many Matches
The find_element_by_*
methods have a find_elements_by_*
that
returns a list of all matches.
WebElement May Contain Other Web Elements
Each WebElement represents a part of the DOM tree. It may contain other WebElements.
# element is a WebElement containing the entire <table>...</table> tree
element = browser.find_element_by_tag_name("table")
# now get the rows in the table
rows = element.find_elements_by_tag_name("tr")
# inside of each row, find the columns
for row_element in rows:
columns = row_element.find_elements_by_tag_name("td")
# each column is also a WebEelement
# Get the text in each <td>...</td>
for column_element in columns:
print(column_element.text)
Getting the Text
What if you want the text on a hyperlink?
element = browser.find_element_by_tag_name("a")
# get the hyperlink url (may throw NoSuchElementException)
url = element.get_attribure("href")
# get the text inside <a>...</a>
text - element.text
Don’t Load Images (Make the Webdriver More Efficient)
For Firefox, use:
from selenium import webdriver
profile = webdriver.FirefoxProfile()
profile.set_preference('permissions.default.image', 2)
profile.set_preference('dom.ipc.plugins.enabled.libflashplayer.so', 'false')
driver = webdriver.Firefox(firefox_profile=profile)
Reference
-
Navigating in the Selenium Python Docs
-
Official Python API Docs, best for API reference
-
ReadTheDocs Version also has other info about installing and using Selenium