Multi source Webscraping

1 minute read

A project involving the data scraping of multiple sources that include APIs and normal Web pages.

Overview

This project solves a interesting request of scraping multiple sources that have different structures and combining them into a single data output (excel spreadsheet).

Implementation

Here Python was used in the entirety of the project along with a few python packages such as

  • Pandas
  • Selenium
  • Requests
  • BeautifulSoup

Features

The framework was built to encourage the addition of new sources by creating a single script file responsible for the source.

The main app (parent) would automatically trigger the sub (child) source scripts and combine the outputs generated into a single excel spreadsheet.

The project involved scraping multiple publicly available websites and captured the following information,

  • Names
  • City and State
  • Date of Birth
  • Date of Death
  • Descriptions
  • Organization information

Execution

  • This project was set to execute every week on a remote machine.
  • The output excel spreadsheet would then be pushed to a shared folder for the client to access (Sharepoint)
  • The average run-time of a single triggered excecution was approximately 2-3 hours.
  • The number of hours needed for manually performing the same task was estimated to be about 2 min per entry with about a million entries to process.