How to scrape table data from a website

“Scraping” website data is a technique used to gather information from a website without having to visit the site or physically copying and pasting the information.

Pandas has an easy way to scrape data represented in tables via its read_html() method and using the table tag in the HTML. (More information here: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_html.html)

To illustrate this, let’s start off with an example. We will scrape mortgage loan borrowing limits from Ally’s website: www.ally.com/do-it-right/home/jumbo-loan-limits/

To verify that the state data is nested inside a table tag, we take a look at the HTML code:

The data is contained within a HTML table format.

Having verified this, the actual code to scrape the table is simple:

read_html(): scraped table data

You’ll notice the result is an array. We can access the first element in the resulting array via [0] to access the first table on the page. If there had been multiple tables, we could’ve used [1],[2], etc. to access the additional tables.

Notice the difference: without specifying the index of the array, the result is simply an array wherein each element is a table. With the index, the resulting output is a Pandas DataFrame.

Bonus: formatting cell content

Having successfully scraped data from a website, let’s look at some data transformations to make it easier to explore. First, we notice some rows have multiple counties on one line. To assign one row to each county, we’ll utilize the explode() function.

exploded DataFrame

Finally, we’ll perform a couple of simple string replacements:

The data is now ready for analysis!

How do you put out a fire in your office wastebasket? First, set fire to more wastebaskets to get a larger sample size. Setting wastebasket fires since 2020.