The dramatically titled Doing Data Science: Straight Talk from the Frontline by Cathy O’Neil and Rachel Schutt (disclaimer) reads much like someone reporting back on notes they took at conference, because that’s essentially what it is. The book largely consists of summaries of talks given as part of a Data Science class. I wish the distinction between the content of the talks and the author’s insertions of background information or teaching suggestions was clearer. I don’t see this book being a reference work for me, but it was nice to read through once to learn about how different people set about solving specific data problems.
Monthly Archives: May 2014
Exploring NYC School Data
New York City makes a large amount of data on its school system available for analysis. I recently took some time to explore some of the data in an IPython notebook (link to notebook). Eventually I’d like to do more detailed analysis including clustering and anomaly detection, but for now I had a great time getting a feel for the data.
I was able to make use of a neat feature in pandas, vectorized string operations. Each school is identified by a “DBN”, a string indicating its district, borough and number. For example “10X95” is district 10, Bronx, # 95. The following code extracts the district and borough information to separate columns in a DataFrame
in a fast, nan-safe way.
borough = mergeddf['DBN'].str.extract(r'\d+([A-Z])\d+') district = mergeddf['DBN'].str.extract(r'(\d+)[A-Z]\d+') mergeddf['Borough'] = borough mergeddf['District'] = district
For more neat data and plotting tricks made possible by python, check out the full notebook.
Channel Islands National Park
We visited Channel Islands National Park last weekend. I’ve put up some pictures in the Channel Islands page. Overall it was not our favorite park, the hiking felt very similar to hiking the hot, dry, sun-scorched trails of Mt. Diablo, but with less shade. On the plus side we did get to see a blue whale and her calf. We may try kayaking if we go again.
Book review: Python for Data Analysis
Python for Data Analysis (disclaimer) is written by Wes McKinney, the original author of the excellent Pandas library. I highly recommend this book for anyone who interacts with data. The scope of the book goes well beyond Pandas and covers other essential python data tools such as IPython, Numpy and Matplotlib. Also included are recommendations and best practices for data workflows and interactive analysis. The examples in the book are well thought out and illustrate the point in question without unnecessary complication. As a bonus a diverse group of data sets are used in the examples, which makes for a more interesting read.
One useful function that I had previously overlooked is the apply
method of pandas groupby
objects. The apply
method applies a function to each group in the groupby
object, then glues the results together row wise. I like apply
because it’s an elegant way to do arbitrary operations to each group of data, replacing cases where I might otherwise have used a loop like the one below:
result_dict = {} for group_name, group in groupby_object: result = some_function(group) result_dict[group_name] = result
There are some useful examples of using apply
in the pandas documentation. There are even more examples in the Python for Data Analysis book, including applying a regression model to the data in each group.
Thesis complete
I just had my thesis approved, and got the lollipop from the Graduate Division to prove it.