Multi source Webscraping

1 minute read

A project involving the data scraping of multiple sources that include APIs and normal Web pages.

Overview

This project solves a interesting request of scraping multiple sources that have different structures and combining them into a single data output (excel spreadsheet).

Implementation

Here Python was used in the entirety of the project along with a few python packages such as

Pandas
Selenium
Requests
BeautifulSoup

Features

The framework was built to encourage the addition of new sources by creating a single script file responsible for the source.

The main app (parent) would automatically trigger the sub (child) source scripts and combine the outputs generated into a single excel spreadsheet.

The project involved scraping multiple publicly available websites and captured the following information,

Names
City and State
Date of Birth
Date of Death
Descriptions
Organization information

Execution

This project was set to execute every week on a remote machine.
The output excel spreadsheet would then be pushed to a shared folder for the client to access (Sharepoint)
The average run-time of a single triggered excecution was approximately 2-3 hours.
The number of hours needed for manually performing the same task was estimated to be about 2 min per entry with about a million entries to process.

Elvis Tony

Overview

Implementation

Features

Execution