- 1 Fundamentals (Week 1)
- 2 Data manipulation with Pandas (Week 2)
- 2.1 (Optional) Review collections
- 2.2 A very brief introduction to NumPy
- 2.3 A very brief introduction to Pandas
- 2.4 (Optional) Where are we?
- 2.5 Reading tabular data into data frames
- 2.6 Data frames are objects that can tell you about their contents
- 2.7 Subsetting Data
- 2.8 Filtering (i.e. masking) data on contents
- 2.9 Working with missing data
- 2.10 Sorting and grouping
- 2.11 Write output
- 2.12 Working with multiple tables
- 2.13 (Optional) Text processing in Pandas
- 2.14 (Optional) Adding rows to DataFrames
- 2.15 (Optional) Scientific Computing Libraries
- 2.16 (Optional) Things we didn't talk about
- 2.17 (Optional) Pandas method chaining in the wild
- 2.18 (Optional) Introspecting on the DataFrame object
- 2.19 (Carpentries version) Group By: split-apply-combine
- 3 Building Programs (Week 3)
- 4 Visualization with Matplotlib and Seaborn (Week 4)
- 5 Special Topics
- 6 Endnotes
- Use the language that your friends use (so you can ask them for help)
- Use a language that has a community of practice for your desired use case (you can find documentation, bug reports, sample code, etc.)
- Use a language that is "best" by some technical definition
- "Glue" language intended to replace shell and Perl
- Concise, readable, good for rapid prototyping
- Access to linear algebra libraries in FORTRAN/C → user-friendly numeric computing
- General purpose, not just an academic language; we will spend more time on some of the general purpose aspects.
- Blend code, documentation, and visualization
- Good for trying things, demos
- Bad for massive or long-running processes
- You can export notebooks as .py files when they outgrow the notebook format
-
Method 1
- Open Anaconda Navigator
- Run Jupyter Lab
-
Method 2 Open Terminal (MacOS/Linux) or Anaconda Prompt (Windows)
cd Desktop/data jupyter lab
- Navigate to where you want to be before creating new notebook
- Rename your notebook to something informative
- Use drag-and-drop interface to move .ipynb file to new location
-
Execute cell with CTRL-Enter
3 + 7
-
Execute cell and move to new cell with Shift-Enter
# This is a comment print("hello")
-
Cells can be formatted as Code or Markdown
-
Many keyboard shortcuts are available; see https://gist.github.com/discdiver/9e00618756d120a8c9fa344ac1c375ac
-
(Optional) Jupyter Lab understands (some) terminal commands
ls
-
(Optional) Jupyter Lab (IPython, actually) has "magic" commands that start with
%
(line) or%%
(cell)# Print current items in memory %dirs # Get environment variables %env # Cell magic: Run bash in a subprocess %%bash # Cell magic: Time cell execution %%time
- The magic command must be in the first line of the cell (no comments)
- Some commands are not available on Windows (e.g.
%%bash
)
Variables are names for values.
first_name = 'Derek'
age = 42
- Can only contain letters, digits, and underscore
- Cannot start with a digit
- Are case sensitive:
age
,Age
andAGE
print(first_name, 'is', age, 'years old')
- Functions are verbs
- Functions end in
()
- Functions take arguments (i.e. they do stuff with the values that you give them)
print()
useful for tracking progress, debugging
-
Python will evaluate and echo the last item
first_name age
-
If you want to see multiple items, you should explicitly print them
print(first_name) print(age)
# Prints an informative error message; more about this later
print(last_name)
print(age)
age = age + 3
print(age)
Order of operations matters!
first = 1
second = 5 * first
first = 2
# What will this print?
print('first:', first)
print('second:', second)
Most data is text and numbers, but there are many other types.
- Integers: whole numbers (counting)
- Floats: real numbers (math)
- Strings: text
- Files
- Various collections (lists, sets, dictionaries, data frames, arrays)
- More abstract stuff (e.g., database connection)
-
Example 1: Subtraction makes sense for some kinds of data but not others
print(5 - 3) print('hello' - 'h')
-
Example 2: Some things have length and some don't Note that we can put functions inside other functions!
print(len('hello')) print(len(5))
-
Variables point to values
print(type(53), type(age))
-
There are many types
print(type(3.12), type("hello"), type(True), type([]))
-
It will (mostly) refuse to convert things automatically. You can explicitly convert data to a different type.
-
Can't do math with text
1 + '2'
-
If you have string data, you can explicitly convert it to numeric data…
print(1 + float('2')) print(1 + int('2'))
-
…and vice-versa
text = str(3) print(text) print(type(text))
-
The exception is mathematical operations with integers and floats.
int_sum = 3 + 4 mixed_sum = 3 + 4.0 print(type(int_sum)) print(type(mixed_sum))
-
What's going on under the hood?
int
,float
, andstr
are types. More precisely, they are classes.int()
,float()
, andstr()
are functions that create new instances of their respective classes. The argument to the creation function (e.g.,'2'
) is the raw material for creating the new instance.
-
This can work for more complex data types as well, e.g. Pandas data frames and Numpy arrays.
# Floor
print('5 // 3:', 5 // 3)
# Floating point
print('5 / 3:', 5 / 3)
# Modulus (remainder)
print('5 % 3:', 5 % 3)
print('before')
print()
print('after')
# By default, we round to the nearest integer
round(3.712)
# You can optionally specify the number of significant digits
round(3.712, 1)
-
View the documentation for
round()
help(round)
- 1 mandatory argument
- 1 optional argument with a default value:
ndigits=None
-
You can proved arguments implicitly by order, or explicitly in any order
# You can optionally specify the number of significant digits round(4.712823, ndigits=2)
-
Getting more help
- Python.org tutorial
- Standard library reference (we will discuss libraries in the next section)
- References section of this document
- Stack Overflow
-
Use comments to add documentation to your own programs
# This line isn't executed by Python print("This cell has many comments") # The rest of this line isn't executed either
-
Collect the results of a function in a new variable. This is one of the ways we build complex programs.
# You can optionally specify the number of significant digits rounded_num = round(4.712823, ndigits=2) print(rounded_num)
result = len("hello") print(result)
-
(Optional) Some function only have "side effects"; they return
None
result = print("hello") print(result) # print(type(result))
-
max()
andmin()
do the intuitively correct thing with numerical and text dataprint(max(1, 2, 3)) print(min('a', 'A', '0')) # sort order is 0-9, A-Z, a-z
-
Mixed numbers and text aren't meaningfully comparable
max(1, 'a')
-
Python reports a syntax error when it can’t understand the source of a program
name = 'Bob age = = 54 print("Hello world"
-
Python reports a runtime error when something goes wrong while a program is executing
Explain in simple terms the order of operations in the following program: when does the addition happen, when does the subtraction happen, when is each function called, etc. Extra credit: What is the final value of radiance?
radiance = 1.0
radiance = max(2.1, 2.0 + min(radiance, 1.1 * radiance - 0.5))
https://docs.python.org/3/library/index.html
import math
print(math.pi)
print(math.cos(math.pi))
- Refer to things from the module as
module-name.thing-name
- Python uses "." to mean "part of" or "belongs to".
help(math) # user friendly
dir(math) # brief reminder, not user friendly
-
Import specific items from a library module. You want to be careful with this. It's safer to keep the namespace.
from math import cos, pi cos(pi)
-
Create an alias for a library module when importing it
import math as m print(m.cos(m.pi))
import this
Lists are the central data structure in Python; we will explain many things by making analogies to lists.
fruits = ["apple", "banana", "cherry", "date", "elderberry", "fig"]
print(fruits)
print(len(fruits))
# First item
print(fruits[0])
# Fifth item
print(fruits[4])
-
You slice a list from the start position up to, but not including, the stop position
print(fruits[0:3]) print(fruits[2:5])
-
You can omit the start position if you're starting at the beginning…
# Two ways to get the first 5 items print(fruits[0:5]) print(fruits[:5])
-
…and you must omit the end position if you're going to the end (otherwise it's up to, but not including, the end!). This is useful if you don't know how long the list is:
# Everything but the first 3 items print(fruits[3:])
-
You can add an optional step interval (every 2nd item, every 3rd item, etc.)
# First 5 items, every other item print(fruits[0:5:2]) # Every third item print(fruits[::3])
cf. https://stackoverflow.com/a/11364711
-
Slice endpoints are compliments In both cases, the number you see represents what you want to do.
# Get the first two items print(fruits[:2]) # Get everything except the first two items print(fruits[2:])
-
For non-negative indices, the length of a slice is the difference of the indices
len(fruits[1:3]) == 2
Try these statements. What are they doing? Can you explain the differences in their behavior?
fruits[-1]
fruits[20]
fruits[-3:]
- You can count backwards from the end with negative integers
- Indexing beyond the end of the collection is an error
-
You can replace a value at a specific index location
fruits[0] = "apricot" print(fruits)
-
Add an item to list with
append()
. This is a method of the list (more on this later!).fruits.append("grape") print(fruits)
-
Add the items from one list to another with
extend()
more_fruits = ["honeydew", "imbe", "jackfruit"] # Add all of the elements of more_fruits to fruits fruits.extend(more_fruits) print(fruits)
# Assessing the overall productivity of our wide receivers
receiving_yards = [450, 370, 870, 150]
mean_yards = sum(receiving_yards)/len(receiving_yards)
print(mean_yards)
-
Use
del
to remove an item at an index locationprint(more_fruits) del more_fruits[1] print(more_fruits)
-
Use
pop()
to remove the last item and assign it to a variable. This is useful for destructive iteration.f = fruits.pop() print('Last fruit in list:', f) print(fruits)
-
You can put anything in a list
ages = ['Derek', 42, 'Bill', 24, 'Susan', 37]
-
(Optional) You could use this to manage complex data, but you shouldn't
# Get first pair print(ages[0:2]) # Get all the names print(ages[::2]) # Get all the ages print(ages[1::2])
-
You can put lists inside other lists
ages.append(more_fruits) # List in our list print(ages) # The last item is a list print(ages[-1]) # Get an item from that list print(ages[-1][0])
Create a new list that contains all of the items from fruits
in the reverse order.
rev_fruits = fruits[len(fruits)-1::-1]
print(rev_fruits)
Usually you don't need to find list items by index. What you actually want to do is go through each item in the list and use it for something.
"For each thing in this group, do these operations"
for fruit in fruits:
print(fruit)
- A for loop is made up of a collection, a loop variable, and a body
- The collection, fruits, is what the loop is being run on.
- The loop variable, fruit, is what changes for each iteration of the loop (i.e. the “current thing”)
- The body, print(fruit), specifies what to do for each value in the collection.
for fruit in fruits:
print(fruit)
for bob in fruits:
print(bob)
primes = [2, 3, 5]
for p in primes:
squared = p ** 2
cubed = p ** 3
print(p, squared, cubed)
We will learn how to vectorize this when we get to Numpy and Pandas
prime_exponents = []
for p in primes:
prime_exponents.append(p**2)
print(prime_exponents)
Get the total length of all the words in the fruits
list.
total = 0
for f in fruits:
total = total + len(f)
print(total)
lengths = []
for f in fruits:
lengths.append(len(f))
print(sum(lengths))
sum(len(f) for f in fruits)
-
Use
range()
to iterate over a sequence of numbersfor number in range(0, 3): print(number)
- range() produces numbers on demand (a "generator" function)
- useful for tracking progress
-
Use
enumerate()
to iterate over a sequence of items and their positionsfor number, fruit in enumerate(fruits): print(number, ":", fruit)
-
Use functional programming idioms
- Comprehensions: generator, list, dictionary
- itertools library
-
Test to see if an object is iterable
# Lists, dictionaries, and strings are iterable hasattr(location, "__iter__") #Integers are not iterable hasattr(5, "__iter__")
-
Strings are indexed like lists
# Use an index to get a single character from a string fruit = "gooseberry" print(fruit[0]) print(fruit[0:3])
-
Strings have length
len(fruit)
-
Can't change a string in place
fruit[0] = 'G'
-
Solution: String methods create a new string
fruit_title = fruit.capitalize() print(fruit_title)
bad_str1 = " Hello world! "
bad_str2 = "|...goodbye cruel world|"
good_str1 = bad_str1.strip()
good_str2 = bad_str2.strip("|")
print(good_str1, "\n", good_str2)
-
An object packages data together with functions that operate on that data. This is a very common organizational strategy in Python.
sentence = "Hello world!" # Call the swapcase method on the my_string object print(sentence.swapcase())
-
You can chain methods into processing pipelines
print(sentence.isupper()) # Check whether all letters are uppercase print(sentence.upper()) # Capitalize all the letters
# The output of upper() is as string; you can use more string methods on it sentence.upper().isupper()
-
You can view an object's attributes (i.e. methods and fields) using
help()
ordir()
. Some attributes are "private"; you're not supposed to use these directly.# More verbose help help(str)
# The short, short version dir(my_string)
You want to iterate through the fruits
list in a random order. For each randomly-selected fruit, capitalize the fruit and print it.
- Which standard library module could help you? https://docs.python.org/3/library/
- Which function would you select from that module? Are there alternatives?
- Try to write a program that uses the function.
import random
random.shuffle(fruits)
for f in fruits:
print(f.title())
random_fruits = random.sample(fruits, len(fruits))
for f in random_fruits:
print(f.title())
-
Given this Python code…
print('string to list:', list('tin')) print('list to string:', ''.join(['g', 'o', 'l', 'd']))
-
What does
list('some string')
do? -
What does
'-'.join(['x', 'y', 'z'])
generate?
Dictionaries are sets of key/value pairs. Instead of being indexed by position, they are indexed by key.
ages = {'Derek': 42,
'Bill': 24,
'Susan': 37}
ages["Derek"]
-
Update a pre-existing key with a new value
ages["Derek"] = 44 print(ages)
-
Add a new key/value pair
ages["Beth"] = 19 print(ages)
-
Does a key already exist?
"Derek" in ages
-
(Optional) Does a value already exist (you generally don't want to do this; keys are unique but values are not)?
24 in ages.values()
print("Original dictionary", ages)
del ages["Derek"]
print("1st deletion", ages)
susan_age = ages.pop("Susan")
print("2nd deletion", ages)
print("Returned value", susan_age)
As with lists, you can put anything in a dictionary.
location = {'latitude': [37.28306, 'N'],
'longitude': [-120.50778, 'W']}
print(location['longitude'][0])
-
Iterate over key: value pairs
for key, val in ages.items(): print(key, ":", val)
-
(Optional) You can iterate over keys and values separately
# Iterate over keys; you can also explicitly call .keys() for key in ages: print(key) # Iterate over values for val in ages.values(): print(val)
-
(Optional) Iteration can be useful for unpacking complex dictionaries
for key, val in location.items(): print(key, 'is', val[0], val[1])
-
You have the following key/value pairs:
'Derek' 42 'Bill' 24 'Susan' 37
-
Create a dictionary that contains all of them. You may find the following useful:
help(dict) help(zip)
names = ["Derek", "Bill", "Susan"]
ages = [42, 24, 37]
ages_dict = dict(zip(names, ages))
How can you convert our list of names and ages into a dictionary? Hint: You will need to populate the dictionary with a list of keys and a list of values.
# Starting data
ages = ['Derek', 42, 'Bill', 24, 'Susan', 37]
# Get dictionary help
help({})
ages_dict = dict(zip(ages[::2], ages[1::2]))
- Tuples
- Sets
- Reference item by index/key
- Insert item by index/key
- Indices/keys must be unique
- Similar to lists: Reference item by index, have length
- Immutable, so need to use string methods
'/'.join()
is a very useful method
Introductory documentation: https://numpy.org/doc/stable/user/quickstart.html
-
NumPy is the linear algebra library for Python
import numpy as np # Create an array of random numbers m_rand = np.random.rand(3, 4) print(m_rand)
-
Arrays are indexed like lists
print(m_rand[0,0])
-
Arrays have attributes
print(m_rand.shape) print(m_rand.size) print(m_rand.ndim)
-
Arrays are fast but inflexible - the entire array must be of a single type.
Don't use for
loops with DataFrames or Numpy matrices. There is almost always a faster vectorized function that does what you want.
x = np.arange(9)
y = np.arange(9)
print(x)
print(y)
-
Operations are element-wise by default
print(x * y)
-
Matrix-wise operations (e.g. dot product) use NumPy functions
# Use a special operator if it exists print(x @ y) # Otherwise, use a numpy function print(np.dot(x, y))
-
You can rearrange the same array into different configurations
# Use method chaining to link actions together x1 = x.reshape(3,3) x2 = x.reshape(9,1) print(x1) print(x2)
-
(Optional) Matlab gotcha: 1-D arrays have no transpose
print(x) print(x.T) print(x.reshape(-1,1))
- Create a 3x3 matrix containing the numbers 0-8. Hint: Consult the NumPy Quickstart documentation here: https://numpy.org/doc/stable/user/quickstart.html
- Multiply the matrix by itself (element-wise).
- Multiply the matrix by its transpose.
- Divide the matrix by itself. What happens?
# Use method chaining to link actions together
x = np.arange(9).reshape(3,3)
print(x * x)
print(x * x.T)
print(x / x)
- Pandas is a library for working with spreadsheet-like data ("DataFrames")
- A DataFrame is a collection (dict) of Series columns
- Each Series is a 1-dimensional NumPy array with optional row labels (dict-like, similar to R vectors)
- Therefore, each series inherits many of the abilities (linear algebra) and limitations (single data type) of NumPy
import os
# print current directory
print("Current working directory:", os.getcwd())
# print all of the files and directories
print("Working directory contents:", os.listdir())
# Get 1 level of subdirectories
print("Just print the sub-directories:", sorted(next(os.walk('.'))[1]))
# Move down one directory
os.chdir("data")
print(os.getcwd())
# Move up one directory
os.chdir("..")
print(os.getcwd())
import pandas as pd
data = pd.read_csv('data/gapminder_gdp_oceania.csv')
print(data)
# Jupyter Lab will give you nice formatting if you echo
data
- File and directory names are strings
- You can use relative or absolute file paths
Rows are indexed by number by default (0, 1, 2,….). For convenience, we want to index by country:
data = pd.read_csv('data/gapminder_gdp_oceania.csv', index_col='country')
print(data)
- By default, rows are indexed by position, like lists.
- Setting the
index_col
parameter lets us index rows by label, like dictionaries. For this to work, the index column needs to have unique values for every row. - You can verify the contents of the CSV by double-clicking on the file in Jupyter Lab
- Main documentation link: https://pandas.pydata.org/docs/user_guide/index.html
- Pandas can read many different data formats: https://pandas.pydata.org/docs/user_guide/io.html
Data frames have methods (i.e. functions) that perform operations using the data frame's contents as input
-
Use
.info()
to find out more about a data framedata.info()
-
Use
.describe()
to get summary statistics about datadata.describe()
-
(Optional) Look at the first few rows
data.head(1)
A "field" is a variable that belongs to an object.
-
The
.index
field stores the row Index (list of row labels)print(data.index)
-
The
.columns
field stores the column Index (list of column labels)print(data.columns)
-
The
.shape
variable stores the matrix shapeprint(data.shape)
-
Use
DataFrame.T
to transpose a DataFrame. This doesn't copy or modify the data, it just changes the caller's view of it.print(data.T) print(data.T.shape)
# DataFrame type
type(data)
type(data.T)
# Series type
type(data['gdpPercap_1952'])
# Index type
type(data.columns)
- You can convert data between NumPy arrays, Series, and DataFrames
- You can read data into any of the data structures from files or from standard Python containers
- Read the data in
gapminder_gdp_americas.csv
into a variable calledamericas
and display its summary statistics. - After reading the data for the Americas, use
help(americas.head)
andhelp(americas.tail)
to find out whatDataFrame.head
andDataFrame.tail
do.- How can you display the first three rows of this data?
- How can you display the last three columns of this data? (Hint: You may need to change your view of the data).
- As well as the
read_csv
function for reading data from a file, Pandas provides ato_csv
function to write DataFrames to files. Applying what you’ve learned about reading from files, write one of your DataFrames to a file calledprocessed.csv
. You can usehelp
to get information on how to useto_csv
.
americas = pd.read_csv('data/gapminder_gdp_americas.csv', index_col='country')
americas.describe()
americas.head(3)
americas.T.tail(3)
americas.to_csv('processed.csv')
Use DataFrame.iloc[..., ...]
to select values by their (entry) position. The i
in iloc
stands for "index".
#import pandas as pd
data = pd.read_csv('data/gapminder_gdp_europe.csv', index_col='country')
data.iloc[0,0]
This is most common way to get data
-
Use
DataFrame.loc[..., ...]
to select values by their label# This returns a value data.loc["Albania", "gdpPercap_1952"]
-
Standard Python has string methods
big_hello = "hello".title() print(big_hello) help("hello".title) print(dir("hello"))
-
Pandas data frames are complex objects
print(data.columns) print(dir(data.columns.str))
-
Use built-in methods to transform the entire data frame
# The columns index can update all of its values in a single operation data.columns = data.columns.str.strip("gdpPercap_") print(data.columns)
-
Select multiple columns or rows using
.loc
and a named slice. This generalizes the concept of a slice to include labeled indexes.# This returns a DataFrame data.loc['Italy':'Poland', '1962':'1972']
-
Use
:
on its own to mean all columns or all rows. This is Python’s usual slicing notation, which allows you to treat data frames as multi-dimensional lists.# This returns a DataFrame data.loc['Italy':'Poland', :]
-
(Optional) If you want specific rows or columns, pass in a list
data.loc[['Italy','Poland'], :]
-
(Optional)
.iloc
follows list index conventions ("up to, but not including)", but.loc
does the intuitive right thing ("A through B")index_subset = data.iloc[0:2, 0:2] label_subset = data.loc["Albania":"Belgium", "1952":"1962"] print(index_subset) print(label_subset)
-
Result of slicing can be used in further operations
subset = data.loc['Italy':'Poland', '1962':'1972'] print(subset.describe()) print(subset.max())
-
(Optional) Insert new values using
.at
(for label indexing) or.iat
(for numerical indexing)subset.at["Italy", "1962"] = 2000 print(subset)
- Calculate
subset.max()
and assign the result to a variable. What kind of thing is it? What are its properties? - What is the maximum value of the new variable? Can you determine this without creating an intermediate variable?
-
Pandas always drills down to the most parsimonious representation. On one hand, this is convenient; on the other, it violates the Pythonic expectation for strong types.
Shape of data selection Pandas return type 2D DataFrame 1D Series 0D single value -
Use method chaining
print(subset.max().max()) # Alternatively print(subset.max(axis=None))
-
.filter()
always returns the same type as the original item, whereas.loc
and.iloc
might return a data frame or a series.italy = data.filter(items=["Italy"], axis="index") print(italy) print(type(italy))
-
.filter()
is a general-purpose, flexible methodhelp(data.filter) data.filter(like="200", axis="columns") data.filter(like="200", axis="columns").filter(items=["Italy"], axis="index")
-
Show which data frame elements match a criterion.
# Which GDPs are greater than 10,000? subset > 10000
-
Use the criterion match to filter the data frame's contents. This uses index notation:
df = subset[subset > 10000] print(df)
subset > 10000
returns a data frame of True/False valuessubset[subset > 10000]
filters its contents based on that True/False data frame. AllTrue
values are returned, element-wise.- This section is more properly called "Masking Data," because it involves operations for overlaying a data frame's values without changing the data frame's shape. We don't drop anything from the data frame, we just replace it with
NaN
.
-
(Optional) Use
.where()
method to find elements that match the criterion:df = subset.where(subset > 10000) print(df)
For example, get the GDP for all countries greater than the median.
# Get the overall median
subset.median() # Returns Series
subset.median(axis=None) # Returns single valuey
# Which data points are above the median
subset > subset.median(axis=None)
# Return the masked data set
subset[subset > subset.median(axis=None)]
# The .rank() method turns numerical scores into ranks
data.rank()
# Get mean rank over time and sort the output
mean_rank = data.rank().mean(axis=1).sort_values()
print(mean_rank)
Examples include min, max, mean, std, etc.
-
Missing values ignored by default
print("Column means") print(df.mean()) print("Row means") print(df.mean(axis=1))
-
Force inclusions with the
skipna
argumentprint("Column means") print(df.mean(skipna=False)) print("Row means") print(df.mean(axis=1, skipna=False))
-
Show which items are missing. "NA" includes
NaN
andNone
. It doesn't include empty strings ornumpy.inf
.# Show which items are NA df.isna()
-
Count missing values
# Missing by row print(df.isna().sum()) # Missing by column print(df.isna().sum(axis=1)) # Aggregate sum df.isna().sum().sum()
-
Are any values missing?
df.isna().any(axis=None)
-
(Optional) Are all of the values missing?
df.isna().all(axis=None)
-
Replace with a fixed value
df_fixed = df.fillna(99) print(df_fixed)
-
Replace values that don't meet a criterion with an alternate value
subset_fixed = subset.where(subset > 10000, 99) print(subset_fixed)
-
(Optional) Impute missing values. Read the docs, this may or may not be sufficient for your needs.
df_imputed = df.interpolate()
Drop all rows with missing values
df_drop = df.dropna()
-
Create an array of random numbers matching the
data
data framerandom_filter = np.random.rand(30, 12) * data.max(axis=None)
-
Create a new data frame that filters out all numbers lower than the random numbers
-
Interpolate new values for the missing values in the new data frame. How accurate do you think they are?
new_data = data[data > random_filter]
# Data is not missing randomly
print(new_data)
new_data.interpolate()
new_data.interpolate().mean(axis=None)
A DataFrame is a dictionary of Series columns. With this in mind, experiment with the following code and try to explain what each line is doing. What operation is it performing, and what is being returned?
Feel free to use print()
, help()
, type()
, etc as you investigate.
df["1962"]
df["1962"].notna()
df[df["1962"].notna()]
- Line 1 returns the column as a Series vector
- Line 2 returns a boolean Series vector (True/False)
- Line 3 performs boolean indexing on the DataFrame using the Series vector. It only returns the rows that are True (i.e. it performs true filtering).
# Calculate z scores for all elements
# z = (data - data.mean(axis=None))/data.std()
# As of July 2024, pandas dataframe.std(axis=None) doesn't work. We are dropping down to
# Numpy to use the .std() method on the underlying values array.
z = (data - data.mean(axis=None))/data.values.std(ddof=1)
# Get the mean z score for each country (i.e. across all columns)
mean_z = z.mean(axis=1)
# Group countries into "wealthy" (z > 0) and "not wealthy" (z <= 0)
z_bool = mean_z > 0
print(mean_z)
print(z_bool)
Data frames are dictionaries of Series:
data["mean_z"] = mean_z
data["wealthy"] = z_bool
data.sort_values(by="mean_z")
# Get descriptive statistics for the group
data.groupby("wealthy").mean()
data.groupby("wealthy").describe()
Capture the results of your filter in a new file, rather than overwriting your original data.
# Save to a new CSV, preserving your original data
data.to_csv('gapminder_gdp_europe_normed.csv')
# If you don't want to preserve row names:
#data.to_csv('gapminder_gdp_europe_normed.csv', index=False)
surveys = pd.read_csv('data/surveys.csv', index_col="record_id")
print(surveys.shape)
df1 = surveys.head(10)
df2 = surveys.tail(10)
df3 = pd.concat([df1, df2])
print(df3.shape)
-
Import species data
species = pd.read_csv('data/species.csv', index_col="species_id") print(species.shape)
-
Join tables on common column. The "left" join is a strategy for augmenting the first table (surveys) with information from the second table (species).
df_join = surveys.merge(species, on="species_id", how="left") print(df_join.head()) print(df_join.shape)
-
The resulting table loses its index because
surveys.record_id
is not being used in the join. To keeprecord_id
as the index for the final table, we need to retain it as an explicit column.# Don't set record_id as index during initial import surveys = pd.read_csv('data/surveys.csv') df_join = surveys.merge(species, on="species_id", how="left").set_index("record_id") df_join.index
-
Get the subset of species that match a criterion, and join on that subset. The "inner" join only includes rows where both tables match on the key column; it's a strategy for filtering the first table by the second table.
# Get the taxa column, masking the rows based on which values match "Bird" birds = species[species["taxa"] == "Bird"] df_birds = surveys.join(birds, on="species_id").set_index("record_id") print(df_birds.head()) print(df_birds.shape)
cf. https://pandas.pydata.org/docs/user_guide/text.html
-
Import tabular data that contains strings
species = pd.read_csv('data/species.csv', index_col='species_id') # You can explicitly set all of the columns to type string # species = pd.read_csv('data/species.csv', index_col='species_id', dtype='string') # ...or specify the type of individual columns # species = pd.read_csv('data/species.csv', index_col='species_id', # dtype = {"genus": "string", # "species": "string", # "taxa": "string"}) print(species.head()) print(species.info()) print(species.describe())
-
A Pandas Series has string methods that operate on the entire Series at once
# Two ways of getting an individual column print(type(species.genus)) print(type(species["genus"])) # Inspect the available string methods print(dir(species["genus"].str))
-
Use string methods for filtering
# Which species are in the taxa "Bird"? print(species["taxa"].str.startswith("Bird")) # Filter the dataset to only look at Birds print(species[species["taxa"].str.startswith("Bird")])
-
Use string methods to transform and combine data
binomial_name = species["genus"].str.cat(species["species"].str.title(), " ") species["binomial"] = binomial_name print(species.head())
A row is a view onto the nth item of each of the column Series. Appending rows is a performance bottleneck because it requires a separate append operation for each Series. You should concatenate data frames instead.
-
Create a single row as a data frame and concatenate it.
row = pd.DataFrame({"1962": 5000, "1967": 5000, "1972": 5000}, index=["Latveria"]) pd.concat([subset, row])
-
If you have individual rows as Series,
pd.concat()
will produce a data frame.# Get each row as a Series italy = data.loc["Italy", :] poland = data.loc["Poland", :] # Omitting axis argument (or axis=0) concatenates the 2 series end-to-end # axis=1 creates a 2D data frame # Transpose recovers original orientation # Column labels come from Series index # Row labels come from Series name pd.concat([italy, poland], axis=1).T
- SciPy projects
- Numpy: Linear algebra
- Pandas
- Scipy.stats: Probability distributions and basic tests
- Statsmodels: Statistical models and formulae built on Scipy.stats
- Scikit-Learn: Machine learning tools built on NumPy
- Tensorflow/PyTorch: Deep learning and other voodoo
Scikit-Learn documentation: https://scikit-learn.org/stable/
-
Motivating example: Ordinary least squares regression
from sklearn import linear_model # Create some random data x_train = np.random.rand(20) y = np.random.rand(20) # Fit a linear model reg = linear_model.LinearRegression() reg.fit(x_train.reshape(-1,1), y) print("Regression slope:", reg.coef_)
-
Estimate model fit
from sklearn.metrics import r2_score # Test model fit with new data x_test = np.random.rand(20) y_prediction = reg.predict(x_test.reshape(-1,1)) # Get model stats mse = mean_squared_error(y, y_prediction) r2 = r2_score(y, y_prediction) print("R squared:", "{:.3f}".format(r2))
-
(Optional) Inspect our prediction
import matplotlib.pyplot as plt fig, ax = plt.subplots() ax.scatter(x_train, y, color="black") ax.plot(x_test, y_prediction, color="blue") # `fig` in Jupyter Lab fig.show()
-
(Optional) Compare with Statsmodels
# Load modules and data import statsmodels.api as sm # Fit and summarize OLS model (center data to get accurate model fit mod = sm.OLS(y - y.mean(), x_train - x_train.mean()) res = mod.fit() print(res.summary())
-
Import data
data = pd.read_csv('surveys.csv') # Check for NaN print("Valid weights:", data['weight'].count()) print("NaN weights:", data['weight'].isna().sum()) print("Valid lengths:", data['hindfoot_length'].count()) print("NaN lengths:", data['hindfoot_length'].isna().sum())
-
Fit OLS regression model
from statsmodels.formula.api import ols model = ols("weight ~ hindfoot_length", data, missing='drop').fit() print(model.summary())
-
Generic parameters for all models
import statsmodels help(statsmodels.base.model.Model)
- pipe
- map/applymap/apply (in general you should prefer vectorized functions)
wine.rename(columns={"color_intensity": "ci"})
.assign(color_filter=lambda x: np.where((x.hue > 1) & (x.ci > 7), 1, 0))
.query("alcohol > 14 and color_filter == 1")
.sort_values("alcohol", ascending=False)
.reset_index(drop=True)
.loc[:, ["alcohol", "ci", "hue"]]
-
DataFrames have a huge number of fields and methods, so dir() is not very useful
print(dir(data))
-
Create a new list that filters out internal attributes
df_joinpublic = [item for item in dir(data) if not item.startswith('_')] print(df_public)
-
(Optional) Pretty-print the new list
importort pprint pp = pprint.PrettyPrinter(width=100, compact=True, indent=2) pp.pprint(df_public)
-
Objects have fields (i.e. data/variables) and methods (i.e. functions/procedures). The difference between a method and a function is that methods are attached to objects, whereas functions are free-floating ("first-class citizens"). Methods and functions are "callable":
# GeneratorExitenerate a list of public methods and a list of public fields. We do this # by testing each attribute to determine whether it is "callable". # NB: Because Python allows you to override any attribute at runtime, # testing with `callable` is not always reliable. # List of methods (callable attributes) df_methods = [item for item in dir(data) if not item.startswith('_') and callable(getattr(data, item))] # List of fields (non-callable attributes) df_attr = [item for item in dir(data) if not item.startswith('_') and not callable(getattr(data, item))] pp.pprint(df_methods) pp.pprint(df_attr)
-
Split data according to criterion, do numeric transformations, then recombine.
# Get all GDPs greater than the mean mask_higher = data > data.mean() # Count the number of time periods in which each country exceeds the mean higher_count = mask_higher.aggregate('sum', axis=1) # Create a normalized wealth-over-time score wealth_score = higher_count / len(data.columns) wealth_score
-
A DataFrame is a spreadsheet, but it is also a dictionary of columns.
data['gdpPercap_1962']
-
Add column to data frame
# Warningealth Score is a series type(wealth_score) data['normalized_wealth'] = wealth_score
- Export notebook to .py file
- Move .py file into data directory
- Compare files in TextEdit/Notepad
Broadly, a trade-off between managing big code bases and making it easy to experiment. See: https://github.com/elliewix/Ways-Of-Installing-Python/blob/master/ways-of-installing.md#why-do-you-need-a-specific-tool
- Interactive testing and debugging
- Graphics integration
- Version control
- Remote scripts
-
Python is an interactive interpreter (REPL)
python
-
Python is a command line program
# hello.py print("Hello!")
python hello.py
-
(Optional) Python programs can accept command line arguments as inputs
- List of command line inputs:
sys.argv
(https://docs.python.org/3/library/sys.html#sys.argv) - Utility for working with arguments:
argparse
(https://docs.python.org/3/library/argparse.html)
- List of command line inputs:
- File paths as literal strings
- File paths as string patterns
- File paths as abstract Path objects
import pandas as pd
file_list = ['data/gapminder_gdp_africa.csv', 'data/gapminder_gdp_asia.csv']
for filename in file_list:
data = pd.read_csv(filename, index_col='country')
print(filename)
print(data.head(1))
-
Get a list of all the CSV files
import glob glob.glob('data/*.csv')
-
In Unix, the term “globbing” means “matching a set of files with a pattern”. It uses shell expansion rules, not regular expressions, so there's an upper limit to how flexible it can be. The most common patterns are:
- `*` meaning “match zero or more characters”
- `?` meaning “match exactly one character”
-
(Optional) Get a list of all CSV or TSV files
glob.glob('data/*.?sv')
-
Get a list of all the Gapminder CSV files
glob.glob('data/gapminder_gdp_*.csv')
-
(Optional) Exclude the "all" CSV file
glob.glob('data/gapminder_[!all]*.csv')
data_frames = []
for filename in glob.glob('data/gapminder_gdp_*.csv'):
print(filename)
data = pd.read_csv(filename)
data_frames.append(data)
all_data = pd.concat(data_frames)
print(all_data.shape)
-
Does a file end in
"all.csv"
?for filename in glob.glob('data/gapminder*.csv'): print("Current file:", filename) print(filename.endswith("all.csv")
-
Value of a variable
mass = 3 print(mass == 3) print(mass > 5) print(mass < 4)
-
Membership in a collection
primes = [2, 3, 5] print(2 in primes) print(7 in primes)
-
Missing values
# Recreate data frame with missing data if necessary data = pd.read_csv('data/gapminder_gdp_europe.csv', index_col='country') subset = data.loc['Italy':'Poland', '1962':'1972'] df = subset[subset > 10000] # Which values are missing? print(df.isna()) # Are any values missing? print(df.isna().any(axis=None))
-
(Optional) Truth of a collection Note that
any()
andall()
evaluate each item using.__bool__()
or.__len()__
, which tells you whether an item is "truthy" or "falsey" (i.e. interpreted as being true or false).my_list = [2.75, "green", 0] print(any(my_list)) print(all(my_list))
-
(Optional) Understanding "truthy" and "falsey" values in Python (cf. https://stackoverflow.com/a/53198991) Every value in Python, regardless of type, is interpreted as being
True
except for the following values (which are interpreted asFalse
). "Truthy" values satisfyif
orwhile
statements; "Falsey" values do not.- Constants defined to be false:
None
andFalse
. - Zero of any numeric type:
0
,0.0
,0j
,Decimal(0)
,Fraction(0, 1)
- Empty sequences and collections:
''
,()
,[]
,{}
,set()
,range(0)
- Constants defined to be false:
-
An
if
statement (more properly called a conditional statement) controls whether some block of code is executed or not.mass = 3.5 if mass > 3.0: print(mass, 'is large')
mass = 2.0 if mass > 3.0: print (mass, 'is large')
-
Structure is similar to a
for
statement:- First line opens with
if
and ends with a colon - Body containing one or more statements is indented (usually by 4 spaces)
- First line opens with
-
Use conditionals to decide which files to process
data_frames = [] for filename in glob.glob('data/gapminder*.csv'): print("Current file:", filename) if not filename.endswith("all.csv"): print("Passes test:", filename) data = pd.read_csv(filename) data_frames.append(data) all_data = pd.concat(data_frames) print(all_data.shape)
-
else
can be used following anif
. This allows us to specify an alternative to execute when the if branch isn’t taken.if m > 3.0: print(m, 'is large') else: print(m, 'is small')
-
This lets you explicitly handle the base case
data_frames = [] for filename in glob.glob('data/gapminder*.csv'): if filename.endswith("all.csv"): print("I don't want any of that") else: print("Passes test:", filename) data = pd.read_csv(filename) data_frames.append(data) all_data = pd.concat(data_frames) print(all_data.shape)
Iterate through all of the CSV files in the data directory. Print the name and length (number of lines) of any file that is less than 30 lines long.
Note that the data frame will report the number of data rows, which doesn't include the column headers (the actual file has a leading row with the header names).
for filename in glob.glob('data/*.csv'):
data = pd.read_csv(filename)
if len(data) < 30:
print(filename, len(data))
Iterate through all of the CSV files in the data directory. Print the file name that includes "europe", then print the column names for the file.
for filename in glob.glob('data/*.csv'):
if "europe" in filename.lower():
print(filename)
data = pd.read_csv(filename)
print(data.columns)
May want to provide several alternative choices, each with its own test; use elif
(short for “else if”) and a condition to specify these.
if m > 9.0:
print(m, 'is huge')
elif m > 3.0:
print(m, 'is large')
else:
print(m, 'is small')
- Always associated with an
if
. - Must come before the
else
(which is the “catch all”).
Python steps through the branches of the conditional in order, testing each in turn. Order matters! The following is wrong:
grade = 85
if grade >= 70:
print('grade is C')
elif grade >= 80:
print('grade is B')
elif grade >= 90:
print('grade is A')
Often, you want some combination of things to be true. You can combine relations within a conditional using and
and or
.
mass = [1, 2, 3, 4, 5]
velocity = [5, 4, 3, 2, 5]
for m, v in zip(mass, velocity):
if m <= 3 and v <= 3:
print("Small and slow")
elif m <= 3 and v > 3:
print("Small and fast")
elif m > 3 and v <= 3:
print("Large and slow")
else:
print("Check data")
- Use () to group subsets of conditions
- Aside: For a more natural way of working with many lists, look at
zip()
for count, filename in enumerate(glob.glob('data/gapminder_*.csv')):
# Print every other filename
if count % 2 == 0:
print(count, filename)
-
Pathlib provides cross-platform path objects
from pathlib import Path # Create Path objects raw_path = Path("data") processed_path = Path("data/processed") print("Relative path:", processed_path) print("Absolute path:", processed_path.absolute())
-
The file objects have methods that provide much better information about files and directories.
#Note the careful testing at each level of the code. data_frames = [] if raw_path.exists(): for filename in raw_path.glob('gapminder_gdp_*.csv'): if filename.is_file(): data = pd.read_csv(filename) print(filename) data_frames.append(data) all_data = pd.concat(data_frames) # Check for destination folder and create if it doesn't exist if not processed_path.exists(): processed_path.mkdir() all_data.to_csv(processed_path.joinpath("combined_data.csv"))
Pandas understands specific file types, but what if you need to work with a generic file?
with open("data/bouldercreek_09_2013.txt", "r") as infile:
lines = infile.readlines()
- The context manager closes the file when you're done reading it
"bouldercreek_09_2013.txt"
is the name of the fileinfile
is a variable that refers to the file on disk
.readlines()
produces the file contents as a list of lines; each line is a string.
print(len(text))
print(type(text))
# View the first 10 lines
print(text[:10])
Compare the following:
# This displays the nicely-formatted document
print(lines[0])
# This shows the true nature of the string; you can see newlines (/n),
# tabs (/t), and other hidden characters
lines[0]
-
The file contains front matter that we can discard
tabular_lines = [] for line in lines: if not line.startswith("#"): tabular_lines.append(line)
-
Now the first line is tab-separated data. Note that the print statement prints the tabs instead of showing us the
\t
character.tabular_lines[0]
outfile_name = "data/tabular_data.txt"
with open(outfile_name, "w") as outfile:
outfile.writelines(tabular_lines)
-
Strip trailing whitespace
stripped_line = tabular_lines[0].strip() stripped_line
-
Split each line into a list based using the tabs.
split_line = stripped_line.split("\t") split_line
-
Use a special-purpose library to create a correctly-formatted CSV file
import csv outfile_name = "data/csv_data.csv" with open(outfile_name, "w") as outfile: writer = csv.writer(outfile) for line in tabular_lines: csv_line = line.strip().split("\t") writer.writerow(csv_line)
-
You can initialize
csv.reader
andcsv.writer
with different "dialects" or with custom delimiters and quotechars; see https://docs.python.org/3/library/csv.html
infile_name = "data/bouldercreek_09_2013.txt"
outfile_name = "data/csv_data.csv"
with open(infile_name, "r") as infile, open(outfile_name, "w") as outfile:
writer = csv.writer(outfile)
for line in infile:
if not line.startswith("#"):
writer.writerow(line.strip().split("\t"))
-
Pandas has utilities for reading fixed-width files: https://pandas.pydata.org/docs/reference/api/pandas.read_fwf.html
-
Saving datasets with new-style string formatting
for i in datasets_list: do_something(f'{i}.png'
- Human beings can only keep a few items in working memory at a time.
- Understand larger/more complicated ideas by understanding and combining pieces
- Functions serve the same purpose in programs:
- Encapsulate complexity so that we can treat it as a single “thing”
- Removes complexity from remaining code, making it easier to test
- Enables re-use: Write one time, use many times
def print_greeting():
print('Hello!')
- Begin the definition of a new function with
def
, followed by the name of the function. - Must obey the same rules as variable names.
- Parameters in parentheses; empty parentheses if the function doesn’t take any inputs.
- Indent function body
print_greeting()
- Like assigning a value to a variable
- Must call the function to execute the code it contains.
-
Positional arguments
def print_date(year, month, day): joined = '/'.join([year, month, day]) print(joined) print_date(1871, 3, 19)
-
(Optional) Keyword arguments
print_date(month=3, day=19, year=1871)
-
Use
return ...
to give a value back to the caller.return
ends the function's execution and returns you to the code that originally called the function.def average(values): """Return average of values, or None if no values are supplied.""" if len(values) == 0: return None else: return sum(values) / len(values)
a = average([1, 3, 4]) print(a)
-
You should explicitly handle common problems:
print(average([]))
-
Notes:
return
can occur anywhere in the function, but functions are easier to understand if return occurs:- At the start to handle special cases
- At the very end, with a final result
- Docstring provides function help. Use triple quotes if you need the docstring to span multiple lines.
Write a function that takes line
as an input and returns the information required by writer.writerow()
.
Write a function that encapsulates the Z-score calculations from the Pandas workshop. Your function needs to do the following:
- Read a CSV file into a data frame
- Calculate the Z score for each item
- Calculate the mean Z score for each country
- Append the mean Z scores as a new column
- Return the data frame
def norm_data(filename):
"""Add a Z score column to a data frame."""
df = pd.read_csv(filename, index_col = "country")
# If you need to drop the continent column
if "continent" in df.columns:
df.drop("continent", axis=1)
# Calculate individual Z scores
z = (data - data.mean(axis=None))/data.values.std(ddof=1)
# Get the mean z score for each country
mean_z = z.mean(axis=1)
df["mean_z"] = mean_z
return df
df = norm_data("data/gapminder_gdp_europe.csv")
# If you need to drop the contintent column
# mean_z, z_bool = norm_data(data.drop("continent", axis=1))
for filename in glob.glob('data/gapminder_*.csv'):
# Print a status message
print("Current file:", filename)
# Read the data into a DataFrame and modify it
data = pd.read_csv(filename, index_col = "country")
mean_z, z_bool = norm_data(data)
# Append to DataFrame
data["mean_z"] = mean_z
data["wealthy"] = z_bool
# Generate an output file name
parts = filename.split(".csv")
newfile = ''.join([parts[0], "_normed.csv"])
data.to_csv(newfile)
https://matplotlib.org/stable/gallery/mplot3d/lorenz_attractor.html
An if
statement (more properly called a conditional statement) controls whether some block of code is executed or not.
mass = 3.54
if mass > 3.0:
print(mass, 'is large')
mass = 2.07
if mass > 3.0:
print (mass, 'is large')
Structure is similar to a for
statement:
- First line opens with
if
and ends with a colon - Body containing one or more statements is indented (usually by 4 spaces)
Not much point using a conditional when we know the value (as above), but useful when we have a collection to process.
masses = [3.54, 2.07, 9.22, 1.86, 1.71]
for m in masses:
if m > 3.0:
print(m, 'is large')
else
can be used following an if
. This allows us to specify an alternative to execute when the if branch isn’t taken.
masses = [3.54, 2.07, 9.22, 1.86, 1.71]
for m in masses:
if m > 3.0:
print(m, 'is large')
else:
print(m, 'is small')
May want to provide several alternative choices, each with its own test; use elif
(short for “else if”) and a condition to specify these.
masses = [3.54, 2.07, 9.22, 1.86, 1.71]
for m in masses:
if m > 9.0:
print(m, 'is HUGE')
elif m > 3.0:
print(m, 'is large')
else:
print(m, 'is small')
- Always associated with an
if
. - Must come before the
else
(which is the “catch all”).
Python steps through the branches of the conditional in order, testing each in turn. Order matters! The following is wrong:
grade = 85
if grade >= 70:
print('grade is C')
elif grade >= 80:
print('grade is B')
elif grade >= 90:
print('grade is A')
velocity = 10.0
for i in range(5): # execute the loop 5 times
print(i, ':', velocity)
if velocity > 20.0:
velocity = velocity - 5.0
else:
velocity = velocity + 10.0
print('final velocity:', velocity)
- This is how dynamical systems simulations work
Often, you want some combination of things to be true. You can combine relations within a conditional using and
and or
. Continuing the example above, suppose you have:
mass = [ 3.54, 2.07, 9.22, 1.86, 1.71]
velocity = [10.00, 20.00, 30.00, 25.00, 20.00]
i = 0
for i in range(5):
if mass[i] > 5 and velocity[i] > 20:
print("Fast heavy object. Duck!")
elif mass[i] > 2 and mass[i] <= 5 and velocity[i] <= 20:
print("Normal traffic")
elif mass[i] <= 2 and velocity[i] <= 20:
print("Slow light object. Ignore it")
else:
print("Whoa! Something is up with the data. Check it")
- Use () to group subsets of conditions
- Aside: For a more natural way of working with many lists, look at
zip()
- Python orientation
- Jupyter orientation
- Multiple interfaces
- Local graphs and global settings
- Matplotlib is the substrate for higher-level libraries
- Drawing things is verbose in any language
import matplotlib.pyplot as plt
fig, ax = plt.subplots()
time = [0, 1, 2, 3]
position = [0, 100, 200, 300]
ax.plot(time, position)
type(fig)
print(type(fig))
print(type(ax))
- Figure objects handle display, printing, saving, etc.
- Axes objects contain graph information
-
Show figure inline (Jupyter Lab default)
fig
-
Show figure in a separate window (command line default)
fig.show()
-
Show figure in a separate window from Jupyter Lab. You may need to specify a different "backend" parameter for
matplotlib.use()
depending on your exact setup: https://matplotlib.org/stable/tutorials/introductory/usage.html#the-builtin-backendsimport matplotlib matplotlib.use('TkAgg') fig.show()
-
Create mock data
import numpy as np y = np.random.random(10) # outputs an array of 10 random numbers between 0 and 1 x = np.arange(1980,1990,1) # generates an ordered array of numbers from 1980 to 1989 # Check that x and y contain the same number of values assert len(x) == len(y)
-
Inspect our data
print("x:", x) print("y:", y)
-
Create the basic plot
# Convert y axis into a percentage y = y * 100 # Draw plot fig, ax = plt.subplots() ax.plot(x, y)
-
Show available styles
# What are the global styles? plt.style.available
# Set a global figure style plt.style.use("dark_background") # The style is only applied to new figures, not pre-existing figures fig
# Re-creating the figure applies the new style fig, ax = plt.subplots() ax.plot(x, y)
-
Customize the graph In principle, nearly every element on a Matplotlib figure is independently modifiable.
# Set figure size fig, ax = plt.subplots(figsize=(8,6)) # Set line attributes ax.plot(x, y, color='darkorange', linewidth=2, marker='o') # Add title and labels ax.set_title("Percent Change in Stock X", fontsize=22, fontweight='bold') ax.set_xlabel(" Years ", fontsize=20, fontweight='bold') ax.set_ylabel(" % change ", fontsize=20, fontweight='bold') # Adjust the tick labels ax.tick_params(axis='both', which='major', labelsize=18) # Add a grid ax.grid(True)
-
Save your figure
fig.savefig("mygraph_dark.png", dpi=300)
In this example, plot GDP over time for multiple countries.
-
Import data
import pandas as pd data = pd.read_csv('data/gapminder_gdp_europe.csv', index_col='country')
# Inspect our data data.head(3)
-
Transform column headers into an ordinal scale
-
(Optional) Original column names are object (i.e. string) data
data.columns
-
Strip off non-numeric portion of each column title
years = data.columns.str.strip('gdpPercap_') years
-
Convert years strings into integers and replace original data frame column headers
data.columns = years.astype(int)
-
-
Extract rows from the DataFrame
x_years = data.columns y_austria = data.loc['Austria'] y_bulgaria = data.loc['Bulgaria']
-
Create the plot object
# Change global background back to default plt.style.use("default") # Create GDP figure fig, ax = plt.subplots(figsize=(8,6)) # Create GDP plot ax.plot(x_years, y_austria, label='Austria', color='darkgreen', linewidth=2, marker='x') ax.plot(x_years, y_bulgaria, label='Bulgaria', color='maroon', linewidth=2, marker='o') # Decorate the plot ax.legend(fontsize=16, loc='upper center') #automatically uses labels ax.set_title("GDP of Austria vs Bulgaria", fontsize=22, fontweight='bold') ax.set_xlabel("Years", fontsize=20, fontweight='bold') ax.set_ylabel("GDP", fontsize=20, fontweight='bold')
Don't do this.
-
The basic plot syntax
ax = data.loc['Austria'].plot() fig = ax.get_figure() fig
-
Decorate your Pandas plot
ax = data.loc['Austria'].plot(figsize=(8,6), color='darkgreen', linewidth=2, marker='*') ax.set_title("GDP of Austria", fontsize=22, fontweight='bold') ax.set_xlabel("Years",fontsize=20, fontweight='bold' ) ax.set_ylabel("GDP",fontsize=20, fontweight='bold' ) fig = ax.get_figure() fig
-
Overlaying multiple plots on the same figure with Pandas. This is super unintuitive.
# Create an Axes object with the Austria data ax = data.loc['Austria'].plot(figsize=(8,6), color='darkgreen', linewidth=2, marker='*') print("Austria graph", id(ax)) # Overlay the Bulgaria data on the same Axes object ax = data.loc['Bulgaria'].plot(color='maroon', linewidth=2, marker='o') print("Bulgaria graph", id(ax))
-
The equivalent Matplotlib plot (optional)
# extract the x and y values from dataframe x_years = data.columns y_gdp = data.loc['Austria'] # Create the plot fig, ax = plt.subplots(figsize=(8,6)) ax.plot(x_years, y_gdp, color='darkgreen', linewidth=2, marker='x') # etc.
## Visualize the same data using a scatterplot
plt.style.use('ggplot')
# Create a scatter plot
fig, ax = plt.subplots(figsize=(8,6))
ax.scatter(y_austria, y_bulgaria, color='blue', linewidth=2, marker='o')
# Decorate the plot
ax.set_title("GDP of Austria vs Bulgaria", fontsize=22, fontweight='bold')
ax.set_xlabel("GDP of Austria",fontsize=20, fontweight='bold' )
ax.set_ylabel("GDP of Bulgaria",fontsize=20, fontweight='bold' )
- Matplotlib gallery: https://matplotlib.org/stable/gallery/index.html
- "Plotting categorical variables" example of multiple subplots
- Download code examples
- .py vs .ipynb
- Matplotlib tutorials: https://matplotlib.org/stable/tutorials/index.html
- Seaborn gallery: https://seaborn.pydata.org/examples/index.html
- Seaborn tutorials: https://seaborn.pydata.org/tutorial.html
- Get in the ball park
- Look at lots of data
- Try lots of presets
- Customize judiciously
- Build collection of interactive and publication code snippets
Seaborn is a set of high-level pre-sets for Matplotlib.
# Import the Seaborn library
import seaborn as sns
ax = sns.lineplot(data=data.T, legend=False, dashes=False)
- Doing more with this data set requires transforming the data from wide form to long form; see https://seaborn.pydata.org/tutorial/data_structure.html
Let's make a poster!
-
Import Iris data set https://gist.githubusercontent.com/curran/a08a1080b88344b0c8a7/raw/0e7a9b0a5d22642a06d3d5b9bcbad9890c8ee534/iris.csv
iris = pd.read_csv("../data/iris.csv") iris.head()
-
Create a basic scatter plot
ax = sns.scatterplot(data=iris, x='sepal_length',y='petal_length')
-
Change plotting theme
plt.style.use("dark_background") # Fix grid if necessary #plt.rcParams["axes.grid"] = False # Make everything visible at a distance sns.set_context('poster') # Color by species ax = sns.scatterplot(data=iris, x='sepal_length', y='petal_length', hue='species', palette='colorblind', size='petal_width') # Place legend ax.legend(bbox_to_anchor=(2,1))
- Read more about
loc
vs.bbox_to_anchor
in the legend documentation: https://matplotlib.org/stable/api/legend_api.html
- Read more about
-
The Seaborn plot uses Matplotlib under the hood
# Set the figure size fig = ax.get_figure() fig.set_size_inches(8,6) fig
-
Add styling to individual points
ax = sns.scatterplot(data=iris, x='sepal_length', y='petal_length', hue='species', palette='colorblind', style='species')
-
Prettify column names
words = [' '.join(i) for i in iris.columns.str.split('_')] iris.columns = words
-
Make a regression plot
# Color by species, size by petal width ax = sns.regplot(data=iris, x='sepal_length', y='petal_length', scatter=True, scatter_kws={'color':'white'})
-
Bar Plot
ax = sns.barplot(data=iris, x='species', y='sepal_width', palette='colorblind')
- Default summary statistic is mean, and default error bars are 95% confidence interval.
-
Add custom parameters
# Error bars show standard deviation ax = sns.barplot(data=iris, x='species', y='sepal_width', ci='sd', edgecolor='black')
-
(Optional) count plot counts the records in each category
ax = sns.countplot(data=iris, x='species', palette='colorblind')
-
Histogram of overall data set
ax = sns.histplot(data=iris, x='petal_length', kde=True)
- KDE: If True, compute a kernel density estimate to smooth the distribution and show on the plot as (one or more) line(s).
- There seems a bimodal distribution of petal length. What factors underly this distribution?
-
Histogram of data decomposed by category
ax = sns.histplot(data=iris, x='petal_length', hue='species', palette='Set2')
-
Create multiple subplots to compare binning strategies
# This generates 3 subplots (ncols=3) on the same figure fig, axes = plt.subplots(figsize=(12,4), nrows=1, ncols=3) # Note that we can use Seaborn to draw on our Matplotlib figure sns.histplot(data=iris,x='petal_length', bins=5, ax=axes[0], color='#f5a142') sns.histplot(data=iris,x='petal_length', bins=10, ax=axes[1], color='maroon') sns.histplot(data=iris,x='petal_length', bins=15, ax=axes[2], color='darkmagenta')
-
Box plot
ax = sns.boxplot(data=iris, x='species', y='petal_length')
-
Swarm plot
ax = sns.swarmplot(data=iris,x='species', y='petal_length', hue='species', palette='Set1') ax.legend(loc='upper left', fontsize=16) ax.tick_params(axis='x', labelrotation = 45)
This gives us a format warning.
-
Strip plot
ax = sns.swarmplot(data=iris,x='species', y='petal_length', hue='species', palette='Set1') ax.legend(loc='upper left', fontsize=16) ax.tick_params(axis='x', labelrotation = 45)
-
Overlapping plots
ax = sns.boxplot(data=iris, x='species', y='petal_length') sns.stripplot(data=iris, x='species', y='petal_length', ax=ax, palette='Set1')
- Everything is an Artist (object)
- Multiple levels of specificity
plt
vsaxes
- rcParams vs temporary stylings
- Simplified high-level interfaces, aka "syntactic sugar"
legend()
vs get legend handles and patches
- The
object.set_field(value)
usage is taken from Java, which was popular in 2003 when Matplotlib was developing its object-oriented syntax - You get values back out with
object.get_field(value)
- The Pythonic way to set a value would be
object.field = value
. However, the Matplotlib getters and setters do a lot of internal bookkeeping, so if you try to set field values directly you will get errors. For example, compareax.get_ylabel()
withax.yaxis.label
. - Read "The Lifecycle of a Plot": https://matplotlib.org/stable/tutorials/introductory/lifecycle.html
- Read "Why you hate Matplotlib": https://ryxcommar.com/2020/04/11/why-you-hate-matplotlib/
with open('pettigrew_letters_ORIGINAL.txt', 'r') as file_in:
text = file_in.read()
print(len(text))
Compare the following:
# This displays the nicely-formatted document
print(text[:300])
# This shows the true nature of the string; you can see newlines (/n),
# tabs (/t), and other hidden characters
text[:300]
type(text)
with open('pettigrew_letters_ORIGINAL.txt', 'r') as file_in:
text = file_in.readlines()
print(len(text))
print(type(text))
# View the first 10 lines
text[:10]
- Intro material
- Manifest of letters
- Individual letters
- check if all the letters reported in the manifest appear in the actual file
- check if all the letters in the file are reported in the manifest
- Therefore, construct two variables: (1) A list of every location line from the manifest, and (2) a list of every location line within the file proper
manifest_list = text[14:159]
Demonstrate string tests with manifest_list:
# Raw text
for location in manifest_list[:10]:
print(location)
# Remove extra whitespace
for location in manifest_list[:10]:
print(location.strip())
# Test whether the cleaned line starts with 'Box '
for location in manifest_list[:10]:
stripped_line = location.strip()
print(stripped_line.startswith('Box '))
# Test whether the cleaned line starts with 'box '
for location in manifest_list[:10]:
stripped_line = location.strip()
print(stripped_line.startswith('box '))
letters = text[162:]
for line in letters[:25]:
# Create a variables to hold current line and truth value of is_box
stripped_line = line.strip()
is_box = stripped_line.startswith('Box ')
if is_box == True:
print(stripped_line)
# If the line is empty, don't print anything
elif stripped_line == '\n':
continue
# Indent non-Box lines
else:
print('---', stripped_line)
- Before automate everything, we run the code with lots of
print()
statements so that we can see what's happening
letter_locations = []
for line in letters:
stripped_line = line.strip()
is_box = stripped_line.startswith("Box ")
if is_box == True:
letter_locations.append(stripped_line)
print('Items in manifest:', len(manifest_list))
print('Letters:', len(letter_locations))
- Which items are in one list but not the other?
- Are there other structural regularities you could use to parse the data? (Note that in the letters, sometimes there are multiple letters under a single box header)
Explicitly handle common errors, rather than waiting for your code to blow up.
def average(values):
"Return average of values, or None if no values are supplied."
if len(values) == 0:
return None
return sum(values) / len(values)
print(average([3, 4, 5])) # Prints expected output
print(average([])) # Explicitly handles possible divide-by-zero error
print(average(4)) # Unhandled exception
def average(values):
"Return average of values, or an informative error if bad values are supplied."
try:
return sum(values) / len(values)
except ZeroDivisionError as err:
return err
except TypeError as err:
return err
print(average([3, 4, 5]))
print(average(4))
print(average([]))
- Use judiciously, and be as specific as possible. When in doubt, allow your code to blow up rather than silently commit errors.
from timeit import time
import cProfile
import pstats
def my_fun(val):
# Get 1st timestamp
t1 = time.time()
# do work
# Get 2nd timestamp
t2 = time.time()
print(round(t2 - t1, 3))
# Run the function with the profiler and collect stats
cProfile.run('my_fun(val)', 'dumpstats')
s = pstats.Stats('dumpstats')
with open('pettigrew_letters_ORIGINAL.txt', 'r') as file_in:
for line in file_in:
# Do stuff to current line
pass
import sqlite3
conn = sqlite3.connect('my_database_name.db')
with conn:
c = conn.execute("SELECT column_name FROM table_name WHERE criterion")
results = c.fetchall()
c.close
# Do stuff with `results`
- Checking performance
- List comprehensions
- Defensive programming
- Debugging and Testing
- Plotting and Programming in Python (Pandas-oriented): http://swcarpentry.github.io/python-novice-gapminder/
- Programming with Python (NumPy-oriented): https://swcarpentry.github.io/python-novice-inflammation/index.html
- Python for Ecology: https://datacarpentry.org/python-ecology-lesson/
- Humanities Python Tour (file and text processing): https://github.com/elliewix/humanities-python-tour/blob/master/Two-Hour-Beginner-Tour.ipynb
- Introduction to Cultural Analytics & Python: https://melaniewalsh.github.io/Intro-Cultural-Analytics/welcome.html
- Rhondene Wint: Matplotlib and Seaborn notes
- Fruit Alphabet: https://en.wikibooks.org/wiki/Wikijunior:Fruit_Alphabet
- Python tutorial: https://docs.python.org/3/tutorial/index.html
- Python standard library: https://docs.python.org/3/library/
- String formatting: https://pyformat.info/
- True and False in Python: https://docs.python.org/3/library/stdtypes.html#truth-value-testing
- NumPy documentation: https://numpy.org/doc/stable/user/index.html
- Pandas documentation: https://pandas.pydata.org/pandas-docs/stable/
- Pandas user guide: https://pandas.pydata.org/docs/user_guide/index.html
- SciPy user guide: https://docs.scipy.org/doc/scipy/tutorial/index.html
- Statsmodels library: https://www.statsmodels.org/stable/index.html
- Scikit-Learn documentation: https://scikit-learn.org/stable/
- Statistics in Python tutorial: https://scipy-lectures.org/packages/statistics/
- Matplotlib gallery of examples: https://matplotlib.org/gallery/index.html
- Matplotlib tutorials: https://matplotlib.org/stable/tutorials/index.html
- Seaborn gallery of examples: https://seaborn.pydata.org/examples/index.html
- Seaborn tutorials: https://seaborn.pydata.org/tutorial.html
- How to choose a code editor: https://github.com/elliewix/Ways-Of-Installing-Python/blob/master/ways-of-installing.md#why-do-you-need-a-specific-tool
- IPython magic commands: https://ipython.readthedocs.io/en/stable/interactive/magics.html
- Writing documentation in Markdown: https://docs.github.com/en/get-started/writing-on-github/getting-started-with-writing-and-formatting-on-github/basic-writing-and-formatting-syntax
- Gapminder data: http://swcarpentry.github.io/python-novice-gapminder/files/python-novice-gapminder-data.zip
- Ecology data (field surveys): https://datacarpentry.org/python-ecology-lesson/data/portal-teachingdb-master.zip
- Social Science data (SAFI): https://datacarpentry.org/socialsci-workshop/data/
- Humanities data (Pettigrew letters): http://dx.doi.org/10.5334/data.1335350291