With the Source step, we aim to collect data from 3 different types of sources:
- Web scraping
- REST APIs
- CSV files
In scope:
- Web scraping module
- REST APIs module
Out of scope:
- CSV files module
The result of the source step is JSON. Sample file based on Twitter API
- header: input from webscraping
- api: input from webscraping
- content: input from webscraping
- timestamp: metadata - date and time when the scraping happened
- source: metadata - the webpage that got scraped
The output does not need to be saved to disk. JSON gets passed on to the Transform step in memory.
The command line takes either a parameter that
- points to a folder (and then runs every config file in the folder) or
- points to a specific config file (and runs only that file)
The logic for the Source step datasources is saved in config files for each datasource.
Out of scope