Skip to content

Python & Pandas Analysis Project: Read and external CSV into DataFrame; Retrieve data from DataFrame; Data cleaning; Merge, filter, slice and sort; groupby; loc; replace; bins&cut; function; calculation; create DataFrame; format DataFrame.

License

Notifications You must be signed in to change notification settings

weihaolun/school-district-analysis

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

43 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

School District Analysis with Pandas

I. Overview

Background

We have been assisting Maria, the Chief Data Scientist in City School District to analyze the standard test results. We received two raw data sources for this project:

  • schools_complete.csv: contains information of 15 schools. School names, type, size and budget.
  • students_cmplete.csv: contains student’s name, gender, grade, school name, reading and math score of each of the 39170 students from the district.

We aim to analyze the standard test scores (math and reading) and aggregate the data to show the insight and performance trends in the school district. The result of this project will assist School Board in making decisions regarding to school budget and priorities.

Before this reported portion of the project, we have already conducted the following steps to the dataset:

  • Clean up the data.
  • Merge two .csv data files into one DataFrame.
  • Created district summary DataFrame.
  • Created per school summary Dataframe with data of each school.
  • Rankings of top 5 and bottom 5 schools based on overall passing rate (rate of students passed both math and reading test).
  • Math & Reading scores by grades from 9th to 12th.
  • Score performance grouped by budget per student.
  • Score performance grouped by school size.
  • Score performance grouped by school type.

Purpose

The School Board found out that there’s evidence of academic dishonesty in reading and math scores for Thomas High School ninth graders, specifically, the scores appear to have been altered. Therefore for this part of the project, we are going to help Maria to replace the math and reading scores for all Thomas High School ninth graders with NaNs while keeping the rest of the data intact. After conducting replacement, we will compare the data and summarize the changes.

For this portion of the project, we will conduct the following steps to the dataset:

  • Clean up the data.
  • Replace Thomas High School ninth grader scores with NaNs.
  • Repeat the school district analysis with adjusted data.
  • Adjust school data for Thomas High School.
  • Repeat per school summary with adjusted data.
  • Repeat school rankings with adjusted data.
  • Repeat Math & Reading scores by grades with adjusted data.
  • Repeat score performance grouped by budget per student with adjusted data.
  • Repeat score performance grouped by school size with adjusted data.
  • Repeat score performance grouped by school type with adjusted data.
  • Compare and analyze DataFrame before & after adjustment.

II. Results

1. How is the district summary affected?

  • The result for overall District Summary is affected by removing Thomas High School ninth graders’ scores. The student count and scores are both adjusted. The number of Thomas High School ninth graders has been deducted from total students count, and their scores have been adjusted to NaNs.

  • As shown in the tables below, Average Math Score, Average Reading Score, % Passing Math, % Passing Reading and % Overall Passing, all the metrics dropped after the adjustment. Average Reading Score dropped very slightly but we can still see the change from the unformatted version.

District Summary

  • Average Math Score changed from 79.0 to 78.9; Average Reading Score changed from 81.87784 to 81.855796; % Passing Math changed from 75.0 to 74.8; % Passing Reading changed from 85.8 to 85.7; % Overall Passing rate changed from 65.2 to 64.9.

2. How is the school summary affected?

  • School Summary DataFrame is a table for each school’s individual data. Each of the 15 schools has its own row, so that only the row for Thomas High School is affected by the adjustment.

  • Thomas High School data gets adjusted twice. The first change only adjusts Average Math Score and Average Reading Score to the correct values, while % Passing Math, % Passing Reading and % Overall Passing all drop to 60%+ which are not correct yet.

    • Average scores get adjusted correctly because all Thomas High School ninth graders’ scores are NaNs and the average calculation by the code below automatically skips student count with NaN score values. Therefore, both total scores and students counts are correctly adjusted for this part of calculation.
    per_school_math = school_data_complete_df.groupby(["school_name"]).mean()["math_score"]
    per_school_reading = school_data_complete_df.groupby(["school_name"]).mean()["reading_score"]
    
    • % data needs further adjustment because only the number of passing students is adjusted (deducted NaNs) while the student counts is unchanged and still includes every student with label “Thomas High School”.

School Summary1

  • The second change corrects % data. We wrote a script to calculate the % data separately for Thomas High School and we ensured that the total count in this calculation only includes 10th-12th graders. The final summary for Thomas High School is shown below:

School Summary2

3. How does replacing the ninth graders' math and reading scores affect Thomas High School's performance relative to the other schools?

Before replacing the ninth graders math and reading scores, Thomas High School was ranked as No.2 among the 15 schools based on the overall passing rate. After the adjustment, Thomas High School is till placed at No.2. Therefore, the adjustment does not affect Thomas High School’s performance relative to the other schools.

School Ranking

4. How does replacing the ninth-grade scores affect the following:

  • Math and reading scores by grade

    Scores By Grade DataFrame is labeled by both school name and grade, therefore only the data for 9th grade of Thomas High School is affected. For both reading and math scores table, the cell at row “Thomas High School” and column “9th” shows “nan”!

    By grade

  • Scores by school spending

    The adjustment only affects the group of $630 - $644 budget per student. The data changed very slightly after the decimal point, but remained same after formatting to one decimal for scores and whole number for percentage. (Please refer to Appendix for table comparison)

  • Scores by school size

    Thomas High School falls into medium-size school category (1000 – 2000 students) and only this group is affected by the adjustment. The data changed very slightly after the decimal point, but remained same after formatting to one decimal for scores and whole number for percentage. (Please refer to Appendix for table comparison)

  • Scores by school type

    Thomas High School is a charter school and only charter school data is affected. There are only two types of schools so that the change of just ninth graders of Thomas High School is very insignificant. Same as above metrics, the data remained same after formatting to one decimal for scores and whole number for percentage. (Please refer to Appendix for table comparison)

III. Summary

All the changes are caused by replacing Thomas High School ninth graders’ scores to NaNs, and the students count for Thomas High School used for calculations is also adjusted accordingly. In conclusion, there are mainly following changes:

  1. District Summary data changed. Average Math Score, Average Reading Score, % Math Passing, % Reading Passing and % Overall Passing are all affected by the replacing.
  2. In Per School Summary, Thomas High School's data changed. Similar to District Summary, all the metrics are affected by the replacement except for school’s basic information such as budget, type and size.
  3. Math Score by Grade changed. (Average) Math score for 9th graders at Thomas High School has been adjusted to NaNs.
  4. Reading Score by Grade changed. (Average) Reading score for 9th graders at Thomas High School has been adjusted to NaNs.
  5. As mentioned in Results section, data of Score by Spending, by Size and by Type all changed slightly but not significant enough to appear changes after formatting.

IV. Appendix

Scores by school spending

By spending

Scores by school size

By size

Scores by school type

By type

Exported Excel File

Challenge_Data

About

Python & Pandas Analysis Project: Read and external CSV into DataFrame; Retrieve data from DataFrame; Data cleaning; Merge, filter, slice and sort; groupby; loc; replace; bins&cut; function; calculation; create DataFrame; format DataFrame.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published