A reference implementation for processing the content.log files found at opendata.dwd.de/weather.
Example usage:
LAST_RUN_AT=$(date -ud 00:00 -Ihours)
wget $CONTENT_LOG_URL -O content.log.bz2
bzgrep $PATTERN content.log.bz2 > my_content.log
./get_updated_files.py -b $CONTENT_LOG_URL -u $LAST_RUN_AT my_content.log > updated_files.txt
wget -i updated_files.txt
Running the program above will download all updated files into the current working directory. The produced file
will hold hyperlinks to files that are updated since the given date-time according
to the file's modification date found in content.log.
Also mind that there are multiple servers behind https://opendata.dwd.de which might not be exactly in sync with each
other regarding file modification timestamps. Look into the code of get_updated_files.py
for a suggestion on how to
deal with that.
While this program relies on the file modification timestamp dumped into content.log.bz2
, it might be more feasible
to process the data reference time that is contained in the filenames instead.
$ ./get_updated_files.py --help
usage: get_updated_files.py [-h] --updated-since UPDATED_SINCE [--url-base URL_BASE]
[--min-delta MIN_DELTA] [--version]
Filters paths of a DWD Open Data content.log file for entries that have been updated.
positional arguments:
CONTENT_LOG_FILE The decompressed content.log file (default: STDIN)
optional arguments:
-h, --help show this help message and exit
last time files were checked for updates
--url-base URL_BASE, -b URL_BASE
resolve the paths taken from content.log relative to the given
base URL; put the URL of the content.log.bz2 here to end up with
correct hyperlinks to DWD's Open Data
--min-delta MIN_DELTA, -d MIN_DELTA
minimum number of seconds a file needs to be younger than
UPDATED_SINCE (default: 60)
--version show program's version number and exit