New York City makes a large amount of data on its school system available for analysis. I'm especially interested in this data since I was an input into the process that generated it from 2006-2008. Here I've taken some of that data and looked at trends of test scores across geography and time.
The data are available here:%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
I save the data as a .csv file in the end because parsing the excel sheet is slow, and storing the .csv avoids repeating this step. This data is from a New York State Math test from 2006-2013, grades 3-8.
rawdf = pd.read_excel('data/SchoolMathResults20062012Public.xlsx',
'All Students',
skiprows=6,
na_values=['s'])
raw2013 = pd.read_excel('data/SchoolMathResults2013.xlsx',
'All Students',
skiprows=6,
na_values=['s'])
rawdf = pd.concat([rawdf, raw2013])
rawdf.to_csv('data/SchoolMathResults20062013Public.csv')
rawdf = pd.read_csv('data/SchoolMathResults20062013Public.csv')
rawdf.head()
Here I add information back into the column names that was lost when reading in the excel sheet:
level_number_strings = [str(i) for i in range(1,5)] + ['3&4']
new_cols = ['Level ' + str(s1) + ' ' + s2
for s1 in level_number_strings
for s2 in ['#', '%']]
rawdf.columns = np.append(rawdf.columns.values[:-10], new_cols)
rawdf.head()
Here I use the converters
argument to get the ending year of the school years, which appear in "2006-2007" format in this file. This makes the year column consistent with the format of the test results data. Also I use na_values
to properly encode the string "n/a" as a missing data point.
demographic_df = pd.read_excel(
'data/DemographicSnapshot2012Public.xlsx',
'School Demographics',
converters={'School Year': lambda x: int(x[-4:])},
na_values=['n/a'])
demographic_df = demographic_df.replace(r'^\s+$', np.nan, regex=True)
demographic_df.to_csv('data/DemographicSnapshot2012Public.csv')
demographic_df.head()
The combine_first
method replaces all the NaN values in the '% Free Lunch' column with the value from the '% Free and Reduced Price Lunch' column. It seems '% Free and Reduced Price Lunch' became the statistic of choice starting in 2010.
combined_lunch = demographic_df['% Free Lunch'].combine_first(
demographic_df['% Free and Reduced Price Lunch']
)
demographic_df['% Free Lunch'] = combined_lunch.astype(float)
demographic_df = demographic_df.drop(['% Free and Reduced Price Lunch'], axis=1)
demographic_df = demographic_df.rename(columns={'School Year': 'Year'})
demographic_df.head(5)
JOIN
operation):¶mergeddf = rawdf.merge(demographic_df, on=['DBN', 'Year'])
Pandas includes vectorized string methods that allow fast, NaN safe extraction of borough and district from the 'DBN' column using regular expressions.
borough = mergeddf['DBN'].str.extract(r'\d+([A-Z])\d+')
district = mergeddf['DBN'].str.extract(r'(\d+)[A-Z]\d+')
mergeddf['Borough'] = borough
mergeddf['District'] = district
mergeddf.head()
I find it useful to write plotting functions that take in the data in question as a DataFrame
, that way the data can be easily subsetted or transformed and then easily plotted while doing exploratory analysis. The plot()
method of DataFrame
is helpful for segmenting the data between columns, which otherwise would require repeated explicit matplotlib
calls.
It's a little unintuitive to make scatter plots with the pandas plot()
method, because kind='scatter'
doesn't work. The solution is to pass in the arguments linestyle='', marker='o'
, which gets rid of the line and adds a circle at each data point.
def scatter_plot(alldatadf,
x_axis_col='% Free Lunch',
y_axis_col='Mean Scale Score',
separate_plots_by='Year',
groups_on_plots='Grade',
drop=[]):
"""Make scatter plots of data in alldatadf.
alldatadf: pandas data frame.
x_axis_col: the column of alldatadf to use as the x-axis. Should
be the same for all values of the in the groups_on_plots column,
since this will become in the index of the data frame that is
eventually plotted.
y_axis_col: the column of alldatadf to use as the y-axis.
separate_plots_by: a column name of alldatadf, each unique value
in the column be plotted in a separate figure.
groups_on_plots: a column name of alldatadf, each unique value
will be plotted as a separate color.
drop: columns of alldatadf to drop.
"""
for grp_name, df in alldatadf.groupby(separate_plots_by):
pivot_table = df.pivot(index='DBN', columns=groups_on_plots)
pivot_table = pivot_table[[x_axis_col, y_axis_col]]
pivot_table.index = pivot_table[x_axis_col].iloc[:, 0]
pivot_table.index.name = x_axis_col
pivot_table = pivot_table[y_axis_col]
pivot_table = pivot_table.drop(drop, axis=1)
pivot_table = pivot_table[~np.isnan(pivot_table.index.values)]
pivot_table = pivot_table.sort_index()
ax = pivot_table.plot(linestyle='',
marker='o',
alpha=.5,
title=grp_name,
figsize=(10,6))
ax.set_ylabel(y_axis_col)
title_value = ax.get_title()
ax.set_title(separate_plots_by + ' = ' + title_value)
Here's the plotting function in action, with its default arguments. The data is pretty dense here and specific trends within each grade are hard to distinguish, however there does seem to be an overall negative correlation between the mean score and the % free lunch.
scatter_plot(mergeddf)
As with most things, pandas makes it easy to compute the correlation coefficient (calculated here for all points of mean scale score data in the data set.
mergeddf[['% Free Lunch', 'Mean Scale Score']].corr()
Now instead of the mean score on the y-axis, the % of students reaching levels 3&4 (i.e. passing) is plotted, split by year on each figure. Although again the plotted data is very dense, I did notice that some of the years seem to have higher passing rates than others, which I decided to investigate in a bar graph later in the analysis.
scatter_plot(mergeddf,
groups_on_plots='Year',
separate_plots_by='Grade',
y_axis_col='Level 3&4 %')
To get a better view of summary statistics rather than the raw view offered by the scatter plots, I wrote a function to make bar graphs of a given slicing of the data. This function can also make line graphs for cases when there are so many bars that the graph becomes unreadable (for example when each of the 32 districts is its own bar). Unfortunately, the line graph runs into a common problem: the matplotlib default plotting options repeat quite frequently, which allows only 7 lines to be plotted before two lines look identical in the legend. I get around this by specifying a cycle of markers for the lines. Later in the analysis I also change the default color cycle to make the lines more easily distinguishable.
from itertools import cycle
def bar_graph(alldatadf,
x_axis_col='Year',
y_axis_col='Level 3&4 %',
split_bars_by='Grade',
kind='bar',
figsize=(10,10),
drop=['All Grades']):
"""Plot a bar graph of the data in alldatadf.
alldatadf: DataFrame.
x_axis_col: the column of alldatadf to use as the x-axis.
y_axis_col: the column of alldatadf to use as the y-axis.
split_bars_by: a column name of alldatadf, each unique value
in the column be plotted as one color of bar.
kind: the type of plot. Must be supported by the plot method
of pandas DataFrames.
figsize: tuple of length 2, the figure size.
drop: columns of alldatadf to drop.
"""
grouped = alldatadf.groupby([x_axis_col, split_bars_by])
aggregate_grouped = grouped.agg([np.mean, np.std], level=1)
aggregate_grouped_y = aggregate_grouped[y_axis_col]
unstacked = aggregate_grouped_y.unstack()
unstacked = unstacked.drop(drop, axis=1, level=1)
fig = plt.figure(figsize=figsize)
ax = fig.add_axes([.1, .1, .8, .5])
unstacked['mean'].plot(kind=kind, ax=ax)
ax.set_ylabel(y_axis_col)
ax.set_xticklabels(unstacked.index.values)
legend = ax.legend(loc=2, ncol=2)
legend.set_bbox_to_anchor((1.05, 1))
legend.set_title(split_bars_by)
if kind == 'line':
marker_values = 'oxDs'
markercycle = cycle(marker_values)
for line in ax.get_lines():
line.set_marker(next(markercycle))
markercycle = cycle(marker_values) # reset the cycle
for label in legend.get_lines():
label.set_marker(next(markercycle))
bar_graph(mergeddf)
bar_graph(mergeddf, split_bars_by='Borough')
bar_graph(mergeddf, split_bars_by='Borough', x_axis_col='Grade')
bar_graph(mergeddf, split_bars_by='District')
Yikes, this is too many bars! Time for a line graph:
bar_graph(mergeddf, split_bars_by='District', kind='line')
These lines are still a bit hard to tell apart. To fix this I'll change the matplotlib defaults (using plt.rc()
) to use a spectrum of colors.
colormap = plt.get_cmap('spectral')
colorlist = [colormap(z) for z in np.linspace(0, 1, num=32)]
plt.rc('axes', color_cycle=colorlist)
bar_graph(mergeddf, split_bars_by='District', kind='line')