This workshop is intended for researchers who have experience in analyzing data that comfortably fits in memory but are interested in scaling up to bigger than memory datasets. The following topics will be covered: measuring performance and memory usage; sampling and split-apply-combine strategy; data type optimization; efficient storage with parquet; simple parallelization; introduction to Dask. Participants interested in following along will be provided with an example dataset and instructions on setting up a programming environment. All workshop materials will be publicly available in this GitHub repository. A prerequisite exercise should give you an idea of expected prior knowledge of Python and pandas.
If you are new to Python, I recommend reading about Jupyter and pandas. This book shows how to use Jupyter notebooks for teaching and learning, and QuantEcon lectures use Python for economics and finance and are also a good resource for beginners.
This workshop has been conducted by Anton Babkin at:
- Department of Agricultural and Resource Economics, University of Connecticut, January 27 and February 3, 2021
- Department of Agricultural and Applied Economics, University of Wisconsin-Madison, February 3 and 9, 2021
- 2021 Data Science Research Bazaar, University of Wisconsin-Madison, February 10, 2021
You need a running Jupyter server in order to work with workshop notebooks. The easiest way is to launch a free cloud instance in Binder. A more difficult (but potentially more reliable) alternative is to create conda Python environment on your local computer.
Click this link to launch a new Binder instance and connect to it from your browser, then open and run the setup notebook to test the environment and download data. Normal launch time is under 30 seconds, but it might take longer if the repository has been recently updated, because Binder will need to rebuild the environment from scratch.
Notice that Binder platform provides computational resources for free, and so limitations are in place and availability can not be guaranteed. Read here about usage policy and available resources.
This method requires some experience or readiness to read documentation. As reward, you will have persistent environment under your control that does not depend on cloud service availability. This is also a typical way to set up Python for data work.
-
Download and install miniconda (Python 3), following instructions for your operating system.
-
Open terminal (Anaconda Prompt on Windows) and clone this repository in a folder of your choice (
git clone https://github.com/antonbabkin/ds-bazaar-workshop.git
). Alternatively, download and unpack repository code as ZIP. -
In the terminal, navigate to the repository folder and create new conda environment. Environment specification will be read from the
environment.yml
file, all required packages will be downloaded and installed.
cd ds-bazaar-workshop
conda env create -f binder/environment.yml
- Activate the environment and start JupyterLab server. This will start a new Jupyter server and open Jupyter interface in browser window.
conda activate ds-bazaar-workshop
jupyter lab
- In Jupyter, open and run the setup notebook to test the environment and download data.
Run cells of the setup notebook to download data into your environment.
The core dataset used in examples is a synthetic fake, generated from annual historical snapshops of InfoGroup data. InfoGroup is a proprietary database of all businesses in the US, available to University of Wisconsin researchers.
Synthetic version (SynIG) provides a subset of core variables and was generated from the original data using a combination of random fake data, modeling, random sampling, record shuffling and noise infusion to protect original data confidentiality. It has the same format and some resemblance of the original (eg. cross-sectional distribution of establishments and employment across states and sectors) and is suitable for educational purposes or methodology development, but can not be used for analysis of actual businesses. Generation is described and performed in this notebook.
Project code is licensed under the MIT license.
The content and provided data are licensed under the Creative Commons Attribution 4.0 International license.