“Scraping” website data is a technique used to gather information from a website without having to visit the site or physically copying and pasting the information.
Pandas has an easy way to scrape data represented in tables via its
read_html() method and using the
table tag in the HTML. (More information here: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_html.html)
To illustrate this, let’s start off with an example. We will scrape mortgage loan borrowing limits from Ally’s website: www.ally.com/do-it-right/home/jumbo-loan-limits/
To verify that the state data is nested inside a
table tag, we take a look at the HTML code:
Having verified this, the actual code to scrape the table is simple:
You’ll notice the result is an array. We can access the first element in the resulting array via
 to access the first table on the page. If there had been multiple tables, we could’ve used
, etc. to access the additional tables.
Notice the difference: without specifying the index of the array, the result is simply an array wherein each element is a table. With the index, the resulting output is a Pandas DataFrame.
Bonus: formatting cell content
Having successfully scraped data from a website, let’s look at some data transformations to make it easier to explore. First, we notice some rows have multiple counties on one line. To assign one row to each county, we’ll utilize the
Finally, we’ll perform a couple of simple string replacements:
The data is now ready for analysis!