A news crawler for BBC News, Reuters and New York Times.
- For BBC: It is a news collection from BBC front pages https://www.bbc.com/news, starting from 2015/07/01/. This archive is collected by @dracos. Please refer to his website: https://dracos.co.uk/made/bbc-news-archive/archive.php.
- For Reuters, they have disabled their original archive website. The new website https://www.reuters.com/news/archive has only a limited number of historical articles (starting from 2020/03/08), so I did not update codes for Reuters anymore. But I still keep the codes for Reuters as an example, in case that you want to implement your own codes for Reuters.
- python3
- configobj
- dateutil
- requests
- bs4
- goose3
pip install -r requirements.txt
- xxx_crawler: the executive file to crawl news.
- xxx.cfg: configurations for the crawler, including api, time range and storage path etc.
- xxx_link.py: fetch download links.
- xxx_article: extract content and some meta data of one news article.
python bbc_crawler.py settings/bbc.cfg
python reuters_crawler.py reuters.cfg
python nytimes_crawler.py nytimes.cfg
Modify reuters.cfg
, nytimes.cfg
and bbc.cfg
in settings folder, the main configuration items may be start_date
, end_date
and path
.
If other news sources need to be added, just add files as the architecture, extend the basic class in each folder. Some methods may need to be rewrote.