Book review: Building Machine Learning Systems with Python

I recently read the book Building Machine Learning Systems with Python by Willi Richert and Luis Pedro Coelho (disclaimer). Overall I think it is worth reading for someone who is already familiar with coding in python (and the numpy library) and is interested in using python machine learning libraries. I can’t recommended it as strongly for someone who is unfamiliar with python because the code in the book is often unpythonic (in my opinion), and the code available on the book’s website doesn’t match well with the code in the book and requires a fair bit of tweaking before you can actually run the examples.

I especially enjoyed exploring the gensim library, which is touched on in the book. I also liked the approach taken in the book of building and analyzing machine learning systems as an iterative process, exploring models and features to converge on a good solution.

One thing I think could improve the code in the book is better variable names. For example the following code is needlessly cryptic:

dense = np.zeros( (len(topics), 100), float)
for ti,t in enumerate(topics):
    for tj,v in t:
        dense[ti,tj] = v

There is also a blundering use of a list comprehension to reshape a numpy array:

x = np.array([[v] for v in x])

Using the built in reshape method is more memory efficient, easier to read, and 1,000 times faster:

In [25]: x.shape
Out[25]: (506L,)

In [26]: %timeit y = np.array([[v] for v in x])
100 loops, best of 3: 4.07 ms per loop

In [27]: %timeit y = x.reshape((x.size, 1)).copy()
100000 loops, best of 3: 4.09 µs per loop

 

Python script to rename figures

I’m currently in the process of finishing up writing my thesis. I’ve found the simplest way to name the files for the figures in a paper is in the format figure 1.jpg, figure 2.jpg…  That way all the authors know exactly what figure in the paper corresponds to what file. This works well in a paper where the number of figures is generally fixed, but if I decide want to insert a figure between figures 1 and 2 in my thesis every figure’s number increases by one, and I don’t want to rename 20 files by hand. To fix this I wrote a python script to go through a folder and increment by 1 the number of every figure after a certain figure (i.e. turn figure 1.jpg, figure 2.jpg, figure 3.jpg… into figure 1.jpg, figure 3.jpg, figure 4.jpg…).

The code uses regular expressions to find the files that should be renamed, and figure out what number they had originally. To change figure 2 and above, the command line call would be:

$python figure_rename.py 2

Here is the script:

Singular Value Decomposition

In the Statistical Computing class I took last fall, a matrix decomposition called the singular value decomposition came up briefly as a way to classify similar objects. I wanted to learn more about it but I couldn’t find a resource that was exactly what I wanted, so I decided to create one. I wrote up two IPython Notebooks about singular value decomposition, one is an introduction to the concept and an example application to classifying research articles, the other deals with image compression. The notebooks are viewable at the links below. You can also download the code from these links:

Singular Value Decomposition and Applications

Singular Value Decomposition of an Image

Machine Learning Class

I recently started taking another online class, Machine Learning. Like the database class I took previously (blog entry) it is from Stanford. Unlike the database class, it is offered through coursera, a for-profit company co-founded by the Professor for the course, Andrew Ng. The for-profit aspect really came into focus when I got an email offering “Free $25 Promotional Code for Machine Learning Tutoring!”

The first few weeks covered a lot of things I was already familiar with, but I’m still enjoying hearing the material from a different angle. Although overall the course feels a bit less polished than the Database class, I’m really excited to be taking it and I’m looking forward to the next week of lectures.

Online Introduction to Databases class

I recently finished a free online course from Stanford on Databases (link to course webpage). It’s taught by Professor Jennifer Widom, who gives excellent lectures throughout the course. The exercises are well thought out and fun to do, and the online submission and grading system is very intuitive. Another nice feature is the “progress” tab, which shows a bar graph of all the points you’ve earned on each individual assignment, as well as your current total of points for the course. The progress graph makes writing SQL and XML queries feel like playing a video game.

I really enjoyed the challenge of thinking of ways to build up different sets using nested SQL queries. Besides just learning SQL for its own sake, taking the class has helped me understand the conventions behind some operations in other environments, for example dataframe joins in pandas. I definitely feel much more powerful in regards to accessing and manipulating data after taking the class.

I highly recommend the class. I imagine it will be offered again.

Python templates with Jinja2

When I initially wrote the car usage tracking web app I hard coded the html into the python script that worked with the data. This is alright for prototyping a D3.js visualization, but it’s a real mess having the page layout and text intermingled with the python code. Luckily there exists a large number of python templating engines that make it easy to fill in variables in an html layout with data from a python script. In this case I used Jinja2.

Here is what part of the python script looked liked before refactoring using Jinja2.

# print info from current fill up and table with all fillups
print "<p>Miles driven: %s" % (float(data[-1][-4]) - float(data[-2][-4]))
print "<p>Your mpg was: <b>%.2f</b>" % float(data[-1][-1])

Jinja2 allows for passing variables from python in the form of a dictionary, where the key in the dictionary is specified in the html template (surrounded by double curly braces i.e. {{ key }}), and will be replaced with the value associated with that key. It allows simple logic such as for loops, which I used to produce the table of past data:

<table>
  <tr>
    {% for cell in header %}
      <td><b>{{ cell }}</b></td>
    {% endfor %}
  </tr>
  {% for row in fillups %}
    <tr>
      {% for cell in row %}
        <td>{{ cell }}</td>
      {% endfor %}
    </tr>
  {% endfor %} 
</table>

In the python script, the code to produce the template looks like this:

import jinja2

env = jinja2.Environment(loader=jinja2.FileSystemLoader(
 searchpath='templates/')
 )
template = env.get_template('mpg.html')

display_dictionary = {}
# code that fills in display_dictionary with the values to send to the template

print template.render(display_dictionary)

It’s also possible to save the rendered template as a static html file:

output = template.render(display_dictionary)
with open('../index.html', 'w') as f:
    f.write(output);

Where dispdict is a dictionary containing variables like mpg, milesdriven, etc. The keys to this dictionary are in the template, awaiting replacement the the correct value. The code ends up being much cleaner this way: the data calculation is separated cleanly from the html that defines how it is displayed. Now that I’ve gotten familiar with Jinja2 I plan to use it from the beginning in future projects. I’ll probably further refactor the code to generate a static html file each time the data is updated, since creating the page each time it is visited is not necessary in this case.

The all the code for the car usage tracking web app is up on github (link).

Car usage tracking

Since we bought a new car last year we’ve been keeping a detailed log of every trip to the gas station, including odometer reading, calculated mpg, and location. Over the weekend I wrote a simple web app to visualize the data and provide an interface to update the data. I’ve put a few images from the results below. The page itself is up here. The code is available on github (link).

Graphs of car fill-up data made with D3.js
The interface for adding new data.

I think it’s neat to visualize the effects of various road trips on the odometer reading, the two trips to Utah being very steep. Those parts of the graph remind me of the profile of many of the geologic features we went to Utah to see. Also of note is the prolonged steep section over the summer when I was commuting to an internship. Now if anyone asks what kind of gas mileage our car gets, I can just give them a link.

I used D3.js, a JavaScript library for “data driven documents”, especially useful for making plots in a browser. I hadn’t written any JavaScript before, so it was a great learning experience and I’m really happy with the results. My usual approach is to generate and save the plots in python using matplotlib, then serve those files. I’m excited to use D3.js to create other visualizations on the web.

I used python scripts to update the data and output the html table seen on the page.

Again, here are the links:

Tips for scientists using computers

This week I gave a presentation to my research group about best practices in scientific computing. It can be hard to know what’s out there, so I thought it would be good to give a brief introduction to some tools as a starting point. My main objective was to show how easy version control is with git, and how it could improve the quality of our science. I also wanted to introduce the magic of interactive data analysis using IPython and the IPython Notebook, along with pandas and other scientific libraries. Historically our primary language has been MATLAB, along with Origin for generating figures, but I think these open source alternatives offer a lot of advantages. I’ve posted the files from the presentation here on github (look in the code directory for example IPython notebooks).

Parallel processing in R

In R a lot of operations will take advantage of parallel processing, provided the BLAS being used supports it. As in python it’s also possible to manually split calculations up between multiple cores. The code below uses the doParallel package and a foreach loop combined with %dopar% to split up a matrix multiplication task. If I limit the BLAS to using one thread, this will speed up the calculation. This sort of construction could also be used to run independent simulations.

I think this is a lot more intuitive than the parallel python implementation (see post). I’m curious to look at other parallel processing python modules to see how they compare.

require(parallel)
require(doParallel)
library(foreach)
library(iterators)

parFun <- function(A, B){
  A%*%B
}

nCores <- 2
n <- 4000 # for this toy code, make sure n is divisible by nCores
a <- rnorm(n*n)
A <- matrix(a, ncol=n)

registerDoParallel(nCores)
systime <- system.time(
  result <- foreach(i = 1:nCores, .combine = cbind) %dopar% {
    rows <-  ((i - 1)*n/nCores + 1):(i*n/nCores)
    out <- parFun(A, A[, rows])
  }
)

print(paste("Cores = ", nCores))
print(systime)

Parallel processing in python

I’m all about efficiency, and I’ve been excited to learn more about parallel processing. I really like seeing my cpu using all four cores at 100%. I paid for all those cores, and I want them computing!

If I have some calculations running in python it’s mildly annoying to check on task manager (in windows – ctrl-shift-esc, click on the “performance” tab), and see my cpu usage at 25%. Only one core of the four cores are being used. The reasons for this are somewhat complex, but this is python so there are a number of modules that make it easy to parallelize operations (post on similar code in R).

Do I have one hard working core, and three lazy ones?

To start learning how to parallelize calculations in python I used parallel python. To test the speed up gained by using parallel processing I wrote an inefficient matrix multiplication function using for loops (code below). Using numpy’s matrix class would be much better here, but numpy operations can do some parallelization on their own. Generating four 400 x 400 matrices and multiplying them by themselves took 76.2 s when run serially, and 21.8 s when distributed across the four cores, almost matching a full four-fold speedup. Also, I see I’m getting the full value from the four cores I paid for:

Using all the cpus to speed up the calculation.

This trivial sort of parallelization is ideal for speeding up simulations, since each simulation is independent of the others.

import pp
import time
import random

def randprod(n):
    """Waste time by inefficiently multiplying an n x n matrix"""
    random.seed(0)
    n = int(n)
    # Generate an n x n matrix of random numbers
    X = [[random.random() for _ in range(n)] for _ in range(n)]
    Y = [[None]*n]*n
    for i in range(n):
        for j in range(n):
            Y[i][j] = sum([X[i][k]*X[k][j] for k in range(n)])
    return Y

if __name__ == "__main__":
    test_n = [400]*4
    # Do 4 serial calculations:
    start_time = time.time()
    series_sums = [randprod(n) for n in test_n]
    elapsed = (time.time() - start_time)
    print "Time for serial processing: %.1f" % elapsed + " s"

    time.sleep(10) # provide a break in windows CPU usage graph    

    # Do the same 4 calculations in parallel, using parallel python
    par_start_time = time.time()
    n_cpus = 4
    job_server = pp.Server(n_cpus)
    jobs = [job_server.submit(randprod, (n,), (), ("random",)) for n in test_n]
    par_sums = [job() for job in jobs]
    par_elapsed = (time.time() - par_start_time)
    print "Time for parallel processing: %.1f" % par_elapsed + " s"

    print par_sums == series_sums