Please read this carefully as the download approach has been changed slightly.
Python library for Dewey Data Inc.
Find the release notes here: Release Notes.
Bug report: https://community.deweydata.io/c/help/python/43.
Explore data at https://www.deweydata.io/.
Underlying Amplify API tutorial: https://github.com/amplifydata/amplifydata-public/blob/main/README.md
In the system, click Connections → Add Connection to create your API key.
As the message says, please make a copy of your API key and store it somewhere. Also, please hit the Save button before use.
Choose your product and Get / Subscribe → Connect to API then you can get API endpoint (product path). Make a copy of it.
You can install this library directly from the GitHub source as following.
pip install deweydatapy@git+https://github.com/Dewey-Data/deweydatapy
If you use PyCharm, [Python Packages] → [Add Package] → [From Version Control] → Select [Git] and input
https://github.com/Dewey-Data/deweydatapy
# Use deweydatapy library
import deweydatapy as ddp
deweydatapy
package has the following functions:
get_meta
: gets meta information of the datset, especially date range as returned in adict
get_file_list
: gets the list of files in aDataFrame
download_files
: download files from the file list to a destination folderdownload_files0
: download files with apikey and product path to a destination folderdownload_files1
: download files with apikey and product path to a destination folder (see below Examples for the difference betweendownload_files0
anddownload_files1
)read_sample
: read a sample of data for a file download URLread_sample0
: read a sample of data for the first file with apikey and product pathread_local
: read data from locally saved csv.gz file
I am going to use Advan Weekly Pattern
as an example.
import deweydatapy as ddp
# API Key
apikey_ = "Paste your API key from step 1 here."
# Advan product path
pp_advan_wp = "Paste product path from step 2 here."
You will only have one API Key while having different product paths for each product.
As a first step, check out the meta information of the dataset by
meta = ddp.get_meta(apikey_, pp_advan_wp, print_meta = True);
This will return a DataFrame
with meta information. print_meta = True
will print the meta information.
You can see that the data has a partition column DATE_RANGE_START
. Dewey data is usally huge and the data will be partitioned by this column into multiple files. We can also see that the minimum data available date is 2018-01-01
and maximum data available date is 2024-01-08
.
After checking this, I will download data between 2023-09-03
and 2023-12-31
.
Next, collect the list of files to download by
files_df = ddp.get_file_list(apikey_, pp_advan_wp,
start_date = '2023-09-03',
end_date = '2023-12-31',
print_info = True);
Be careful!! ------------------------------------------------
For a selected date range, the download sever assigns file numbering (0, 1, 2, ...) for each file. Thus, if you have different date ranges (different start_date
and end_date
), file names will change due to the file numbering.
For example, the following files_df1
and files_df2
will have different file names due to different start_date
.
files_df1 = ddp.get_file_list(apikey_, pp_advan_wp,
start_date = '2023-09-03',
end_date = '2023-12-31',
print_info = True)
files_df2 = ddp.get_file_list(apikey_, pp_advan_wp,
start_date = '2023-07-01',
end_date = '2023-12-31',
print_info = True)
This also applies to the funtion download_files0
and download_files1
(demonstrated below) in the same way.
------------------------------------------------------------
If you do not specifiy start_date
, it will collect all the files from the minimum available date, and do not spesify end_date
, all the files to the maximum available date.
Most Dewey datasets are very large. Please specify start_date
and end_date
.
print_info = True
set to print another meta information of the files like below:
files_df
has a file links (DataFrame
) with the following information:
index
: file index ranges from 1 to the number of filespage
: page of the filelink
: file download linkpartition_key
: to subselect files based on datefile_name
file_extension
file_size_bytes
modified_at
Finally, you can download the data to a local destination folder by
ddp.download_files(files_df, "C:/Temp", skip_exists = True)
This will download files to C:/Temp
directory, with the following progress messages.
If you attempt to download all the files again and want to skip already existing downloaded files, set skip_exists = True
. The default value is set to False
(the default value was True
in versions 0.1.x).
You can also use filename_prefix
option to give file name prefix for all the files. For example, following will save all the files in the format of advan_wp_xxxxxxx.csv.gz
.
ddp.download_files(files_df, "C:/Temp", filename_prefix = "advan_wp_", skip_exists = True)
Alternatively, you can download files skipping get_file_list
by
ddp.download_files0(apikey_, pp_advan_wp, "C:/Temp",
start_date = '2023-09-03', end_date = '2023-12-31')
or
ddp.download_files1(apikey_, pp_advan_wp, "C:/Temp",
start_date = '2023-09-03', end_date = '2023-12-31')
The difference between download_files0
and download_files1
is that download_files0
collects all the file list (link) upfront and start downloading. As the links are valid for 24 hours, this may cause an interruption if the download takes over 24 hours. download_files1
, on the other hand, collects a small page (group) of flie links and download them, and move on to the next page and download them, and so on. This helps the collected links to be valid while downloading. So, it is recommended to use download_files1
for a large number of files that may take over 24 hours to download.
Some datasets do not have partition column as they are time invariant (SafeGraph Global Places (POI) & Geometry, for example).
meta = ddp.get_meta(apikey_, pp_sg_poipoly, print_meta = True);
There is no partition column and minimum and maximum dates are not available. In that case, you can download the data without specifiying a date range.
files_df = ddp.get_file_list(apikey_, pp_sg_poipoly, print_info = True);
You can quickly load/see a sample data by
sample_df = ddp.read_sample(files_df['link'][0], nrows = 100)
This will load sample data for the first file in files_df (files_df['link'][0])
for the first 100 rows. You can see any files in the list.
You can also see the sample of the first file by
sample_data = ddp.read_sample0(apikey_, pp_advan_wp, nrows = 100);
This will load the first 100 rows for the first file of Advan data.
You can open a downloaded local file (a csv.gz or csv file) by
sample_local = ddp.read_local("C:/Temp/Weekly_Patterns_Foot_Traffic_Full_Historical_Data-0-DATE_RANGE_START-2023-09-04.csv.gz",
nrows = 100)
Thanks