Skip to content

This project uses public data about the stock trades made by members of the US House of Representatives.

Notifications You must be signed in to change notification settings

noodleslove/House-of-Representative-Analysis-I

Repository files navigation

Stock Trades by Members of the US House of Representatives

This project uses public data about the stock trades made by members of the US House of Representatives. This data is collected and maintained by Timothy Carambat as part of the House Stock Watcher project. The project describes itself as follows:

With recent and ongoing investigations of incumbent congressional members being investigated for potentially violating the STOCK act. This website compiles this publicly available information in a format that is easier to digest then the original PDF source.

Members of Congress must report periodic reports of their asset transactions. This website is purely for an informative purpose and aid in transparency.

This site does not manipluate or censor any of the information from the original source. All data is transcribed by our community of contributors, which you can join for free by going to our transcription tool. Our moderation team takes great care in ensuring the accuracy of the information.

This site is built and maintained by Timothy Carambat and supported with our contributors.

Some interesting questions to consider for this data set include:

  • Is there a difference in stock trading behavior between political parties? For example:
    • does one party trade more often?
    • does one party make larger trades?
    • do the two parties invest in different stocks or sectors? For instance, do Democrats invest in Tesla more than Republicans?
  • What congresspeople have made the most trades?
  • What companies are most traded by congresspeople?
  • Is there evidence of insider trading? For example, Boeing stock dropped sharply in February 2020. Were there a suspiciously-high number of sales of Boeing before the drop?
  • When are stocks bought and sold? Is there a day of the week that is most common? Or a month of the year?

Getting the Data

The full data set of stock trade disclosures is available as a CSV or as JSON at https://housestockwatcher.com/api.

This data set does not, however, contain the political affiliation of the congresspeople. If you wish to investigate a question that relies on having this information, you'll need to find another dataset that contains it and perform a merge. Hint: Kaggle is a useful source of data sets.

Cleaning and EDA

  • Clean the data.
    • Certain fields have "missing" data that isn't labeled as missing. For example, there are fields with the value "--." Do some exploration to find those values and convert them to null values.
    • You may also want to clean up the date columns to enable time-series exploration.
  • Understand the data in ways relevant to your question using univariate and bivariate analysis of the data as well as aggregations.

Assessment of Missingness

  • Assess the missingness per the requirements in project03.ipynb

Hypothesis Test / Permutation Test

Find a hypothesis test or permutation test to perform. You can use the questions at the top of the notebook for inspiration.

Summary of Findings

Introduction

The following notebook analyzes publicly available information regarding stock trades made by US House of Representatives members to reach inferences. We first cleaned up the dataset and combined The 116th U.S. House of Representatives at https://www.kaggle.com/datasets/aavigan/house-of-representatives-congress-116 which contains information regarding political affiliation of congresspeople. Next, we evaluate the dataset's owner column's missingness association.

After that, we began processing the dataset for insights, and try to explore answers to the following questions:

  • Does one party trade more often?
  • Does one party make larger trades?
  • What congresspeople have made the most trades?
  • What companies are most traded by congresspeople?
  • When are stocks bought and sold? Is there a day of the week that is most common? Or a month of the year?

Cleaning and EDA

After obtaining the complete dataset of stock trade disclosures from https://housestockwatcher.com/api, we discover that the data need to be cleaned up since they are fairly untidy. To clean it, we did what is shown below:

  1. Change disclosure_date and transaction_date column to datetime type.
  2. Replace '--' value in ticker column with np.NaN.
  3. Replace '--' value in owner column with np.NaN.
  4. Convert amount to a pd.Categorical series.

The political affiliation of congressmen is missing from the dataset after it has been cleaned up, therefore we choose to utilize one from Kaggle at https://www.kaggle.com/datasets/aavigan/house-of-representatives-congress-116. Due of the distinctions in the names between the two datasets, we combined them using the first and last names of each participant. Then we examine a few rare occurrences and manually resolve them. We were able to successfully integrate stock trade activity with representative political allegiance as an outcome.

Moving on, we process to EDA and find out that:

  • owner, ticker, transaction_date, and asset_description are 4 columns that contain missing data, some of the missingness in transaction_date is because of the incorrect value. For example, there are a few cell with value 0009-06-09 which is clearly not a valid date.
  • transaction_date range between 2012-06-19 and 2022-10-21.
  • Most of the congresspeople are either from Democrat or Republican, there is only one house member who is listed as Independent regrading their political affiliation.

What congresspeople have made the most trades?

  • By plotting the value counts of representative column, we have discovered that the representative Josh Gottheimer has made the most trades.

What companies are most traded by congresspeople?

  • By plotting the value counts of ticker column, we have discovered that the ticker MSFT, which is Microsoft Corp., has the most trade transactions.

When are stocks bought and sold? Is there a day of the week that is most common? Or a month of the year?

  • By grouping the dateset by weekday of transaction_date, such that most of the transactions happened during weekdays, while only a tiny amount of transactions are done in weekend. Among weekdays, Thurday seems to have a slightly higher transaction volume.
  • By grouping the dataset by month of transation_date, we discover that February is the month has largest volume of transactions.

Assessment of Missingness

In this section we decided to evaluate the missingness of owner column as it has the most missing values across all columns. It has values like self, joint, dependent, and np.NaN. We think the missingness of owner column could be associated with type column. This concept arises from the fact that type describes the sort of transaction that is performed; if the type of transaction is not a stock exchange, it is less likely to fall into the self, joint, or dependant categories and end up as an empty value.

In order to validate this assumption, we have to perform a permutation test. To begin with, we determine the test statistic to be Total Variation Distance (TVD), as type is a categorical data. Then, we calculate the observed statistic for the original dataset, which is 0.07390. Afterwards, we shuffle the owner column and calculate the simulate statistics. By repeating this prcoess for 5,000 times, we then calculate the p-value for this permutation. As a result, we get a p-value of 0.0 which indicates that none of the simulate statistics has a more extreme result than the observed statistics. In conclusion, we conclude that the missingness of owner is Missing at Random (MAR), and it's dependent on type column the most.

Hypothesis Test

Which party trade more often?

  • Null hypothesis: the distribution of trading frequency among congresspeople from different party is the same. The difference between the two observed sample is due to chance.

  • Alternative hypothesis: the distribution of trading frequency among congresspeople from different party are different.

For the test statistics, we calculate the average trading transactions per month of each party and take the absolute difference between them. The observed statistics is 58.5469, and we shuffle the party column and run the permutation test for 5,000 times. At the end, we get a p-value of 0.8756, which indicates that majority of the permutation test cases have more extreme result than the observed statistics. Therefore, we fail to reject the null hypothesis, the distribution of trading frequency among congresspeople from various parties is probrably the same.

Code

import matplotlib.pyplot as plt
import numpy as np
import os
import pandas as pd
import seaborn as sns
%matplotlib inline
%config InlineBackend.figure_format = 'retina'  # Higher resolution figures

Load transaction dataset

transactions = pd.read_csv('data/all_transactions.csv')
transactions.head()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
disclosure_year disclosure_date transaction_date owner ticker asset_description type amount representative district ptr_link cap_gains_over_200_usd
0 2021 10/04/2021 2021-09-27 joint BP BP plc purchase $1,001 - $15,000 Hon. Virginia Foxx NC05 https://disclosures-clerk.house.gov/public_dis... False
1 2021 10/04/2021 2021-09-13 joint XOM Exxon Mobil Corporation purchase $1,001 - $15,000 Hon. Virginia Foxx NC05 https://disclosures-clerk.house.gov/public_dis... False
2 2021 10/04/2021 2021-09-10 joint ILPT Industrial Logistics Properties Trust - Common... purchase $15,001 - $50,000 Hon. Virginia Foxx NC05 https://disclosures-clerk.house.gov/public_dis... False
3 2021 10/04/2021 2021-09-28 joint PM Phillip Morris International Inc purchase $15,001 - $50,000 Hon. Virginia Foxx NC05 https://disclosures-clerk.house.gov/public_dis... False
4 2021 10/04/2021 2021-09-17 self BLK BlackRock Inc sale_partial $1,001 - $15,000 Hon. Alan S. Lowenthal CA47 https://disclosures-clerk.house.gov/public_dis... False

Cleaning and EDA

cleaned = transactions.copy()

# convert `disclosure_date`, `transaction_date` to datetime type
cleaned['disclosure_date'] = pd.to_datetime(cleaned['disclosure_date'])
cleaned['transaction_date'] = pd.to_datetime(cleaned['transaction_date'], errors='coerce')

# change `ticker` null values
cleaned['ticker'] = cleaned['ticker'].replace('--', np.NaN)

# cahnge `owner` null values
cleaned['owner'] = cleaned['owner'].replace('--', np.NaN)

# convert `amount` to categorical type
cleaned['amount'] = pd.Categorical(cleaned['amount'])

cleaned.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15674 entries, 0 to 15673
Data columns (total 12 columns):
 #   Column                  Non-Null Count  Dtype         
---  ------                  --------------  -----         
 0   disclosure_year         15674 non-null  int64         
 1   disclosure_date         15674 non-null  datetime64[ns]
 2   transaction_date        15667 non-null  datetime64[ns]
 3   owner                   8346 non-null   object        
 4   ticker                  14378 non-null  object        
 5   asset_description       15670 non-null  object        
 6   type                    15674 non-null  object        
 7   amount                  15674 non-null  category      
 8   representative          15674 non-null  object        
 9   district                15674 non-null  object        
 10  ptr_link                15674 non-null  object        
 11  cap_gains_over_200_usd  15674 non-null  bool          
dtypes: bool(1), category(1), datetime64[ns](2), int64(1), object(7)
memory usage: 1.2+ MB
cleaned.isna().sum()
disclosure_year              0
disclosure_date              0
transaction_date             7
owner                     7328
ticker                    1296
asset_description            4
type                         0
amount                       0
representative               0
district                     0
ptr_link                     0
cap_gains_over_200_usd       0
dtype: int64

Combine with political affliation dataset

# remove unwanted name suffixs
suffixs = ['Hon\\.', 'Mr\\.', 'Mrs\\.', 'None', 'Aston', 'S\\.', 'W\\.']
cleaned['representative'] = (cleaned['representative']
                             .str.replace('|'.join(suffixs), '', regex=True)
                             .str.strip())

cleaned['representative'].head()
0      Virginia Foxx
1      Virginia Foxx
2      Virginia Foxx
3      Virginia Foxx
4    Alan  Lowenthal
Name: representative, dtype: object
# split representative name into `first_name` and `last_name` for later merge 
cleaned['first_name'] = cleaned['representative'].apply(lambda x: x.split()[0].lower())
cleaned['last_name'] = cleaned['representative'].apply(lambda x: x.split()[-1].lower())

# fix special cases
cleaned.loc[cleaned['representative'] == 'Neal Patrick Dunn MD, FACS', 'last_name'] = 'dunn'

cleaned['first_name'].head()
0    virginia
1    virginia
2    virginia
3    virginia
4        alan
Name: first_name, dtype: object
# import member table 1
members1 = pd.read_csv('data/us-house.csv')
members1 = members1[['party', 'first_name', 'last_name']]
members1['first_name'] = members1['first_name'].str.lower()
members1['last_name'] = members1['last_name'].str.lower()
members1['party'] = members1['party'].str.capitalize()

members1.head(10)
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
party first_name last_name
0 Republican don young
1 Republican jerry carl
2 Republican felix moore
3 Republican mike rogers
4 Republican robert aderholt
5 Republican mo brooks
6 Republican gary palmer
7 Democrat terri sewell
8 Republican rick crawford
9 Republican french hill
# import member table 2
members2 = pd.read_csv('data/house_members_116.csv')
members2['first_name'] = members2['name'].apply(
    lambda x: x.split('-')[0].lower())
members2['last_name'] = members2['name'].apply(
    lambda x: x.split('-')[-1].lower())
members2 = members2.rename(columns={'current_party': 'party'})[
    ['first_name', 'last_name', 'party']]

# unify party values
members2.loc[members2['party'] == 'Democratic', 'party'] = 'Democrat'

members2.head(10)
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
first_name last_name party
0 ralph abraham Republican
1 alma adams Democrat
2 robert aderholt Republican
3 pete aguilar Democrat
4 rick allen Republican
5 colin allred Democrat
6 justin amash Independent
7 mark amodei Republican
8 kelly armstrong Republican
9 jodey arrington Republican
# combine 2 member tables
members = (pd.concat([members1, members2])
           .sort_values(['first_name', 'last_name'])
           .drop_duplicates(subset=['first_name', 'last_name'])
           .reset_index(drop=True))

# fix mismatch names
members.loc[members['first_name'] == 'k', 'first_name'] = 'k.'
members.loc[members['first_name'] == 'raul', 'first_name'] = 'raĂşl'
members.loc[members['first_name'] == 'wm', 'first_name'] = 'wm.'
members.loc[members['first_name'] == 'ro', 'first_name'] = 'rohit'
members.loc[members['first_name'] == 'cynthia', 'first_name'] = 'cindy'
members.loc[members['last_name'] == 'allen', 'first_name'] = 'richard'
members.loc[members['last_name'] == 'steube', 'first_name'] = 'greg'
members.loc[members['last_name'] == 'banks', 'first_name'] = 'james'
members.loc[(members['first_name'] == 'j') & (
    members['last_name'] == 'hill'), 'first_name'] = 'james'
members.loc[(members['first_name'] == 'mike') & (
    members['last_name'] == 'garcia'), 'first_name'] = 'michael'
members.loc[members['last_name'] == 'crenshaw', 'first_name'] = 'daniel'
members.loc[members['last_name'] == 'taylor', 'first_name'] = 'nicholas'
members.loc[members['last_name'] == 'gallagher', 'first_name'] = 'michael'
members.loc[(members['first_name'] == 'gregory') & (
    members['last_name'] == 'murphy'), 'first_name'] = 'greg'
members.loc[members['first_name'] == 'ashley', 'last_name'] = 'arenholz'
members.loc[members['last_name'] == 'buck', 'first_name'] = 'kenneth'
members.loc[members['last_name'] == 'costa', 'first_name'] = 'james'
members.loc[members['last_name'] == 'hagedorn', 'first_name'] = 'james'

# drop duplicate rows
members = members.drop_duplicates(subset=['first_name', 'last_name'])

# output cleaned representative table
members.to_csv('data/cleaned_members.csv', index=False)

members.shape
(547, 3)
# transaction table with member info table
combined = cleaned.merge(members, how='left', on=['first_name', 'last_name'])

combined.loc[combined['party'].isna(), 'representative'].unique()
array([], dtype=object)
combined.shape
(15674, 15)
transactions.shape
(15674, 12)
combined.to_csv('data/combined_transactions.csv', index=False)
combined['disclosure_year'].plot(kind='hist', density=True, bins=range(2019,2025,1));

png

combined['type'].value_counts().plot(kind='bar')
plt.xticks(rotation=0);

png

combined['amount'].value_counts().plot(kind='barh');

png

combined['representative'].value_counts().head(10).plot(kind='barh');

png

combined['district'].value_counts().head(10).plot(kind='barh');

png

combined['ticker'].value_counts().head(10).plot(kind='barh');

png

Assessment of Missingness

combined.isna().sum()
disclosure_year              0
disclosure_date              0
transaction_date             7
owner                     7328
ticker                    1296
asset_description            4
type                         0
amount                       0
representative               0
district                     0
ptr_link                     0
cap_gains_over_200_usd       0
first_name                   0
last_name                    0
party                        0
dtype: int64
combined['owner'].unique()
array(['joint', 'self', nan, 'dependent'], dtype=object)
def make_dist(df, missing_col, col):
    dist = (
        df
        .assign(**{f'{missing_col}_null': df[missing_col].isna()})
        .pivot_table(index=col, columns=f'{missing_col}_null', aggfunc='size', fill_value=0)
    )
    dist = dist / dist.sum()
    
    dist.plot(kind='barh', legend=True, title=f"{col.capitalize()} by Missingness of {missing_col.capitalize()}")
    plt.show()
    
    return dist


def calc_tvd(df, missing_col, col):
    dist = (
        df
        .assign(**{f'{missing_col}_null': df[missing_col].isna()})
        .pivot_table(index=col, columns=f'{missing_col}_null', aggfunc='size', fill_value=0)
    )
    dist = dist / dist.sum()
    return dist.diff(axis=1).iloc[:, -1].abs().sum() / 2


def missingness_perm_test(df, missing_col, col):
    shuffled = df.copy()
    shuffled[f'{missing_col}_null'] = shuffled[missing_col].isna()
    
    make_dist(df, missing_col, col)
    obs_tvd = calc_tvd(df, missing_col, col)

    n_repetitions = 1000
    tvds = []
    for _ in range(n_repetitions):

        # Shuffling genders and assigning back to the DataFrame
        shuffled[col] = np.random.permutation(shuffled[col])

        # Computing and storing TVD
        tvd = calc_tvd(shuffled, missing_col, col)
        tvds.append(tvd)

    tvds = np.array(tvds)
    pval = np.mean(tvds >= obs_tvd)
    
    # Draw the p-value graph
    pd.Series(tvds).plot(kind='hist', density=True, ec='w', bins=10, title=f'p-value: {pval}', label='Simulated TVDs')
    plt.axvline(x=obs_tvd, color='red', linewidth=4, label='Observed TVD')
    plt.legend()
    plt.show()
    
    return obs_tvd, pval

obs_tvd, pval = missingness_perm_test(combined, 'owner', 'type')

png

png

shuffled = combined.copy()
shuffled['owner_null'] = shuffled['owner'].isna()

n_repetitions = 1000
tvds = []
for _ in range(n_repetitions):
    
    # Shuffling genders and assigning back to the DataFrame
    shuffled['type'] = np.random.permutation(shuffled['type'])
    
    # Computing and storing TVD
    pivoted = (
        shuffled
        .pivot_table(index='type', columns='owner_null', aggfunc='size')
        .apply(lambda x: x / x.sum(), axis=0)
    )
    
    tvd = pivoted.diff(axis=1).iloc[:, -1].abs().sum() / 2
    tvds.append(tvd)
    
tvds = np.array(tvds)
tvds[:10]
array([0.00394066, 0.01337585, 0.01542276, 0.00559226, 0.01175366,
       0.00862066, 0.00919086, 0.00751793, 0.00661405, 0.01520279])
pval = np.mean(tvds >= obs_tvd)

pd.Series(tvds).plot(kind='hist', density=True, ec='w', bins=10, title=f'p-value: {pval}', label='Simulated TVDs')
plt.axvline(x=obs_tvd, color='red', linewidth=4, label='Observed TVD')
plt.legend();

png

So we conclude that the missingness of owner is MAR, and it's dependent on type column the most.

Hypothesis Test / Permutation Test

Which party trade more often?

  • Null hypothesis: the distribution of trading frequency among congresspeople from different party is the same. The difference between the two observed sample is due to chance.

  • Alternative hypothesis: the distribution of trading frequency among congresspeople from different party are different.

combined.head()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
disclosure_year disclosure_date transaction_date owner ticker asset_description type amount representative district ptr_link cap_gains_over_200_usd first_name last_name party
0 2021 2021-10-04 2021-09-27 joint BP BP plc purchase $1,001 - $15,000 Virginia Foxx NC05 https://disclosures-clerk.house.gov/public_dis... False virginia foxx Republican
1 2021 2021-10-04 2021-09-13 joint XOM Exxon Mobil Corporation purchase $1,001 - $15,000 Virginia Foxx NC05 https://disclosures-clerk.house.gov/public_dis... False virginia foxx Republican
2 2021 2021-10-04 2021-09-10 joint ILPT Industrial Logistics Properties Trust - Common... purchase $15,001 - $50,000 Virginia Foxx NC05 https://disclosures-clerk.house.gov/public_dis... False virginia foxx Republican
3 2021 2021-10-04 2021-09-28 joint PM Phillip Morris International Inc purchase $15,001 - $50,000 Virginia Foxx NC05 https://disclosures-clerk.house.gov/public_dis... False virginia foxx Republican
4 2021 2021-10-04 2021-09-17 self BLK BlackRock Inc sale_partial $1,001 - $15,000 Alan Lowenthal CA47 https://disclosures-clerk.house.gov/public_dis... False alan lowenthal Democrat
df = combined.assign(transaction_year=combined['transaction_date'].dt.year,
                     transaction_month=combined['transaction_date'].dt.month)
df = (
    df
    .groupby(['transaction_year', 'transaction_month', 'party'])[['representative']]
    .count()
    .reset_index()
)
democrat_stats = df.loc[df['party'] == 'Democrat', 'representative'].sum() / (df['party'] == 'Democrat').sum()
republican_stats =  df.loc[df['party'] == 'Republican', 'representative'].sum() / (df['party'] == 'Republican').sum()

obs_stats = abs(democrat_stats - republican_stats)

shuffled = combined.assign(transaction_year=combined['transaction_date'].dt.year,
                           transaction_month=combined['transaction_date'].dt.month)

n_repetitions = 5000
stats = []
for _ in range(n_repetitions):
    
    # Shuffling genders and assigning back to the DataFrame
    shuffled['party'] = np.random.permutation(shuffled['party'])
    
    # Computing and storing TVD
    pivoted = (
        shuffled
        .groupby(['transaction_year', 'transaction_month', 'party'])[['representative']]
        .count()
        .reset_index()
    )
    
    democrat_stats = pivoted.loc[pivoted['party'] == 'Democrat', 'representative'].sum() / (pivoted['party'] == 'Democrat').sum()
    republican_stats =  pivoted.loc[pivoted['party'] == 'Republican', 'representative'].sum() / (pivoted['party'] == 'Republican').sum()
    stats.append(abs(democrat_stats - republican_stats))
    
stats = np.array(stats)
pval = np.mean(stats >= obs_stats)

pd.Series(stats).plot(kind='hist', density=True, ec='w', bins=10, title=f'p-value: {pval}', label='Simulated Stats')
plt.axvline(x=obs_stats, color='red', linewidth=4, label='Observed Stats')
plt.legend();

png

Conclusion

The p-value of the permutation test is 0.8722, which is way larger the the 0.05. Thus, we fail to reject the null hypothesis, which means that distribution of trading frequency among congresspeople from different party may be the same.

df = combined.assign(transaction_year=combined['transaction_date'].dt.year,
                     transaction_month=combined['transaction_date'].dt.month)
df = (
    df
    .groupby(['transaction_year', 'transaction_month', 'party'])[['representative']]
    .count()
    .reset_index()
)
democrat_stats = df.loc[df['party'] == 'Democrat', 'representative'].sum() / (df['party'] == 'Democrat').sum()
republican_stats =  df.loc[df['party'] == 'Republican', 'representative'].sum() / (df['party'] == 'Republican').sum()

democrat_stats, republican_stats
(178.2, 119.65306122448979)
obs_stats = abs(democrat_stats - republican_stats)
obs_stats
58.5469387755102
shuffled = combined.assign(transaction_year=combined['transaction_date'].dt.year,
                           transaction_month=combined['transaction_date'].dt.month)

n_repetitions = 5000
stats = []
for _ in range(n_repetitions):
    
    # Shuffling genders and assigning back to the DataFrame
    shuffled['party'] = np.random.permutation(shuffled['party'])
    
    # Computing and storing TVD
    pivoted = (
        shuffled
        .groupby(['transaction_year', 'transaction_month', 'party'])[['representative']]
        .count()
        .reset_index()
    )
    
    democrat_stats = pivoted.loc[pivoted['party'] == 'Democrat', 'representative'].sum() / (pivoted['party'] == 'Democrat').sum()
    republican_stats =  pivoted.loc[pivoted['party'] == 'Republican', 'representative'].sum() / (pivoted['party'] == 'Republican').sum()
    stats.append(abs(democrat_stats - republican_stats))
    
stats = np.array(stats)
stats[:10]
array([59.98214286, 72.09833091, 70.87735849, 65.37517483, 74.30188679,
       57.72      , 55.32653061, 68.71225071, 60.90181818, 65.33776224])
pval = np.mean(stats >= obs_stats)

pd.Series(stats).plot(kind='hist', density=True, ec='w', bins=10, title=f'p-value: {pval}', label='Simulated Stats')
plt.axvline(x=obs_stats, color='red', linewidth=4, label='Observed Stats')
plt.legend();

png

ser1 = (
    combined
    .groupby('party')['transaction_date'].agg(['min', 'max'])
    .diff(axis=1)
    .iloc[:, -1]
    .apply(lambda x: x.days)
)
ser2 = combined.groupby('party')['representative'].count()

ser2 / ser1
party
Democrat       2.596398
Independent         inf
Republican     3.725079
dtype: float64
df2 = combined.groupby('party')['representative'].count()
def calc_test_statistics(df):
    counts = df.groupby('party')['representative'].count()
    ranges = (
        df
        .groupby('party')['transaction_date'].agg(['min', 'max'])
        .diff(axis=1)
        .iloc[:, -1]
        .apply(lambda x: x.days)
    )
    result = counts / ranges
    return abs(result['Democrat'] - result['Republican'])


calc_test_statistics(combined)
1.1286810599946193
def permutation_test(df, n_repetitions=1000):
    
    shuffled = df.copy()
    obs_stats = calc_test_statistics(shuffled)

    sim_stats = []
    for _ in range(n_repetitions):

        # Shuffling genders and assigning back to the DataFrame
        shuffled['party'] = np.random.permutation(shuffled['party'])

        # Computing and storing TVD
        stats = calc_test_statistics(shuffled)
        sim_stats.append(stats)
        
    sim_stats = np.array(sim_stats)
    pval = np.mean(sim_stats >= obs_stats)
    
    pd.Series(sim_stats).plot(kind='hist', density=True, ec='w', bins=20, title=f'p-value: {pval}', label='Simulated Stats')
    plt.axvline(x=obs_stats, color='red', linewidth=4, label='Observed Stats')
    plt.legend()
    plt.show()
    
    return pval, sim_stats

pval, sim_stats = permutation_test(combined)
sim_stats[:10]

png

array([3.49204486, 4.5007299 , 0.74758808, 1.03989987, 4.19301549,
       0.85884316, 3.71209501, 0.81663253, 0.58560186, 0.96367451])
sim_stats.min(), sim_stats.max()
(0.4204057417960114, 4.691782988465429)
combined.sort_values('transaction_date').head(10)
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
disclosure_year disclosure_date transaction_date owner ticker asset_description type amount representative district ptr_link cap_gains_over_200_usd first_name last_name party
10586 2021 2021-08-26 2012-06-19 NaN BLFSD BioLife Solutions Inc purchase $1,001 - $15,000 Tom Malinowski NJ07 https://disclosures-clerk.house.gov/public_dis... False tom malinowski Democrat
11456 2022 2022-03-03 2017-09-05 NaN SUP Superior Industries International Inc Common S... purchase $1,001 - $15,000 Thomas Suozzi NY03 https://disclosures-clerk.house.gov/public_dis... False thomas suozzi Democrat
11437 2022 2022-03-03 2017-12-06 NaN CAT Caterpillar Inc purchase $1,001 - $15,000 Thomas Suozzi NY03 https://disclosures-clerk.house.gov/public_dis... False thomas suozzi Democrat
11436 2022 2022-03-03 2018-04-17 NaN BA Boeing Company purchase $15,001 - $50,000 Thomas Suozzi NY03 https://disclosures-clerk.house.gov/public_dis... False thomas suozzi Democrat
11442 2022 2022-03-03 2018-04-30 NaN CTRL Control4 Corporation purchase $1,001 - $15,000 Thomas Suozzi NY03 https://disclosures-clerk.house.gov/public_dis... False thomas suozzi Democrat
11453 2022 2022-03-03 2018-05-08 NaN GE General Electric Company sale_full $1,001 - $15,000 Thomas Suozzi NY03 https://disclosures-clerk.house.gov/public_dis... False thomas suozzi Democrat
11443 2022 2022-03-03 2018-06-27 NaN CRTL Control4 Corporation purchase $1,001 - $15,000 Thomas Suozzi NY03 https://disclosures-clerk.house.gov/public_dis... False thomas suozzi Democrat
11455 2022 2022-03-03 2018-06-27 NaN IBM International Business Machines Corporation sale_full $1,001 - $15,000 Thomas Suozzi NY03 https://disclosures-clerk.house.gov/public_dis... False thomas suozzi Democrat
11457 2022 2022-03-03 2018-06-27 NaN VZ Verizon Communications Inc purchase $1,001 - $15,000 Thomas Suozzi NY03 https://disclosures-clerk.house.gov/public_dis... False thomas suozzi Democrat
4285 2021 2021-09-28 2018-09-08 dependent NaN iSelect Fund B St. Louis Inc purchase $1,001 - $15,000 Roger Marshall KS01 https://disclosures-clerk.house.gov/public_dis... False roger marshall Republican
def to_weekday(x):
    day = x.weekday()
    if day == 0:
        return 'Monday'
    elif day == 1:
        return 'Tuesday'
    elif day == 2:
        return 'Wednesday'
    elif day == 3:
        return 'Thursday'
    elif day == 4:
        return 'Friday'
    elif day == 5:
        return 'Saturday'
    else:
        return 'Sunday'

df = combined.assign(weekday=combined['transaction_date'].apply(to_weekday))
df.groupby('weekday').count().rename(columns={'transaction_date': 'count'})[['count']]
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
count
weekday
Friday 3079
Monday 3059
Saturday 48
Sunday 27
Thursday 3358
Tuesday 3021
Wednesday 3075
df = combined.assign(month=combined['transaction_date'].dt.month)
df.groupby('month').count().rename(columns={'transaction_date': 'count'})[['count']]
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
count
month
1.0 1561
2.0 2110
3.0 2081
4.0 1438
5.0 1006
6.0 1485
7.0 971
8.0 972
9.0 1056
10.0 856
11.0 1131
12.0 1000

About

This project uses public data about the stock trades made by members of the US House of Representatives.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published