New York City makes a large amount of data on its school system available for analysis. I recently took some time to explore some of the data in an IPython notebook (link to notebook). Eventually I’d like to do more detailed analysis including clustering and anomaly detection, but for now I had a great time getting a feel for the data.
I was able to make use of a neat feature in pandas, vectorized string operations. Each school is identified by a “DBN”, a string indicating its district, borough and number. For example “10X95” is district 10, Bronx, # 95. The following code extracts the district and borough information to separate columns in a DataFrame
in a fast, nan-safe way.
borough = mergeddf['DBN'].str.extract(r'\d+([A-Z])\d+') district = mergeddf['DBN'].str.extract(r'(\d+)[A-Z]\d+') mergeddf['Borough'] = borough mergeddf['District'] = district
For more neat data and plotting tricks made possible by python, check out the full notebook.