The singular value decomposition of a matrix has many applications. Here I'll focus on an introduction to singular value decomposition and an application in clustering articles by topic. In another notebook (link) I show how singular value decomposition can be used in image compression.
Any matrix \(A\) can be decomposed to three matrices \(U\), \(\Sigma\), and \(V\) such that \(A = U \Sigma V\), this is called singular value decomposition. The columns of \(U\) and \(V\) are orthonormal and \(\Sigma\) is diagonal. Most scientific computing packages have a function to compute the singular value decomposition, I won't go into the details of how to find \(U\), \(\Sigma\) and \(V\) here. Some sources write the decomposition as \(A = U \Sigma V^T\), so that their \(V^T\) is our \(V\). The usage in this notebook is consistent with how numpy's singular value decomposition function returns \(V\).
If \(A = \begin{bmatrix} 1 & 0 \\ 1 & 2 \end{bmatrix}\)
\(A\) can be written as \(U \Sigma V\) where \(U\), \(\Sigma\), and \(V\) are, rounded to 2 decimal places:
\(U = \begin{bmatrix} -0.23 & -0.97 \\ -0.97 & 0.23 \end{bmatrix}\)
\(S = \begin{bmatrix} 2.29 & 0 \\ 0 & 0.87 \end{bmatrix}\)
\(V = \begin{bmatrix} -0.53 & -0.85 \\ -0.85 & 0.53 \end{bmatrix}\)
Although the singular value decomposition has interesting properties from a linear algebra standpoint, I'm going to focus here on some of its applications and skip the derivation and geometric interpretations.
Let \(A\) be a \(m \times n\) matrix with column vectors \(\vec{a}_1, \vec{a}_2, ..., \vec{a}_n\). In the singular value decomposition of \(A\), \(U\) will be \(m \times m\), \(\Sigma\) will be \(m \times n\) and \(V\) will be \(n \times n\). We denote the column vectors of \(U\) as \(\vec{u}_1, \vec{u}_2, ..., \vec{u}_m\) and \(V\) as \(\vec{v}_1, \vec{v}_2, ..., \vec{v}_n\), similarly to \(A\). We'll call the values along the diagonal of \(\Sigma\) as \(\sigma_1, \sigma_2, ...\).
We have that \(A = U \Sigma V\) where:
\(U = \begin{bmatrix} \\ \\ \\ \vec{u}_1 & \vec{u}_2 & \dots & \vec{u}_m \\ \\ \\ \end{bmatrix}\)
\(\Sigma = \begin{bmatrix} \sigma_1 & 0 & \dots \\ 0 & \sigma_2 & \dots \\ \vdots & \vdots & \ddots \end{bmatrix}\)
\(V = \begin{bmatrix} \\ \\ \\ \vec{v}_1 & \vec{v}_2 & \dots & \vec{v}_n \\ \\ \\ \end{bmatrix}\)
Because \(\Sigma\) is diagonal, the columns of \(A\) can be written as:
\(\vec{a}_i = \vec{u}_1 * \sigma_1 * V_{1,i} + \vec{u}_2 * \sigma_2 * V_{2,i} + ... = U * \Sigma * \vec{v}_i\)
This is equivalent to creating a vector \(\vec{w}_i\), where the elements of \(\vec{w}_i\) are the elements of \(\vec{v}_i\), weighted by the \(\sigma\)'s:
\(\vec{w}_i = \begin{bmatrix} \sigma_1V_{1,i} \\ \sigma_2V_{2,i} \\ \sigma_3V_{3,i} \\ \vdots \end{bmatrix} = \Sigma * \vec{v}_i\)
Then \(\vec{a}_i = U * \vec{w}_i\). That is to say that every column \(\vec{a}_i\) of \(A\) is expressed by a sum over all the columns of \(U\), weighted by the values in the \(i^{th}\) column of \(V\), and the \(\sigma\)'s. By convention the order of the columns in \(U\) and rows in \(V\) is chosen such that the values in \(\Sigma = \begin{bmatrix} \sigma_1 & 0 & \dots \\ 0 & \sigma_2 & \dots \\ \vdots & \vdots & \ddots \end{bmatrix}\) obey \(\sigma_1 > \sigma_2 > \sigma_3 > ...\). This means that as a whole, the first column of \(U\) and the first row of \(V\) contribute more to the final values of \(A\) than subsequent columns. This has applications in image compression (link to another notebook) and reducing the dimensionality of data by selecting the most import components.
This section isn't required to understand how singular value decomposition is useful, but I've included it for completeness.
If \(A\) is \(m \times n\) (\(m\) rows and \(n\) columns), \(U\) will be \(m \times m\), \(\Sigma\) will be \(m \times n\) and \(V\) will be \(n \times n\). However, there are only \(r = rank(A)\) non-zero values in \(\Sigma\), i.e. \(\sigma_1, ..., \sigma_r \neq 0\); \(\sigma_{r+1}, ..., \sigma_n = 0\). Therefore columns of \(U\) beyond the \(r^{th}\) column and rows of \(V\) beyond the \(r^{th}\) row do not contribute to \(A\) and are usually omitted, leaving \(U\) an \(m \times r\) matrix, \(\Sigma\) an \(r \times r\) diagonal matrix and \(V\) an \(r \times n\) matrix.
Singular value decomposition can be used to classify similar objects (for example, news articles on a particular topic). Note above that similar \(\vec{a_i}\)'s will have similar \(\vec{v_i}\)'s.
Imagine four blog posts, two about skiing and two about hockey. I've made up some data about five different words and the number of times they appear in each post:
import pandas as pd
c_names = ['post1', 'post2', 'post3', 'post4']
words = ['ice', 'snow', 'tahoe', 'goal', 'puck']
post_words = pd.DataFrame([[4, 4, 6, 2],
[6, 1, 0, 5],
[3, 0, 0, 5],
[0, 6, 5, 1],
[0, 4, 5, 0]],
index = words,
columns = c_names)
post_words.index.names = ['word:']
post_words
It looks like posts 1 and 4 pertain to skiing, and while posts 2 and 3 are about hockey.
Imagine the DataFrame post_words
as the matrix \(A\), where the entries represent the number of times a given word appears in the post. The singular value decomposition of \(A\) can be calculated using numpy.
import numpy as np
U, sigma, V = np.linalg.svd(post_words)
print "V = "
print np.round(V, decimals=2)
Recall that \(\vec{a}_i = U * \Sigma * \vec{v}_i\), that is each column \(\vec{v}_i\) of \(V\) defines the entries in that column, \(\vec{a}_i\), of our data matrix, \(A\). Let's label V with the identities of the posts using a DataFrame:
V_df = pd.DataFrame(V, columns=c_names)
V_df
Note how post1 and post4 agree closely in value in the first two rows of \(V\), as do post2 and post3. This indicates that posts 1 and 4 contain similar words (in this case words relating to skiing). However, the agreement is less close in the last two rows, even among related posts. This is because the weights of the last two rows, \(\sigma_3\) and \(\sigma_4\), are small compared to \(\sigma_1\) and \(\sigma_2\). Let's look at the values for the \(\sigma\)'s.
sigma
\(\sigma_1\) and \(\sigma_2\) are about an order of magnitude greater than \(\sigma_3\) and \(\sigma_4\), indicating that the values in the first two rows of \(V\) are much more important than the values in the last two. In fact we could closely reproduce \(A\) using just the first two rows of \(V\) and first two columns of \(U\), with an error of at most 1 word:
A_approx = np.matrix(U[:, :2]) * np.diag(sigma[:2]) * np.matrix(V[:2, :])
print "A calculated using only the first two components:\n"
print pd.DataFrame(A_approx, index=words, columns=c_names)
print "\nError from actual value:\n"
print post_words - A_approx
To help visualize the similarity between posts, \(V\) can be displayed as an image. Notice how the similar posts (1 and 4, 2 and 3) have similar color values in the first two rows:
%matplotlib inline
import matplotlib.pyplot as plt
plt.imshow(V, interpolation='none')
plt.xticks(xrange(len(c_names)))
plt.yticks(xrange(len(words)))
plt.ylim([len(words) - 1.5, -.5])
ax = plt.gca()
ax.set_xticklabels(c_names)
ax.set_yticklabels(xrange(1, len(words) + 1))
plt.title("$V$")
plt.colorbar();
Another thing the singular value decomposition tells us is what most defines the different categories of posts. The skiing posts have very different values from the hockey posts in the second row of \(V\), i.e. \(V_{2,1} \approx V_{2, 4}\) and \(V_{2,2} \approx V_{2, 3}\) but \(V_{2,1} \neq V_{2, 2}\).
Recall from above that:
\(\vec{a}_i = \vec{u}_1 * \sigma_1 * V_{1,i} + \vec{u}_2 * \sigma_2 * V_{2,i} + ...\)
Thus the posts differ very much in how much the values in \(\vec{u}_2\) contribute to their final word count. Here is \(\vec{u}_2\):
pd.DataFrame(U[:,1], index=words)
From this we can conclude that, at least in this small data set, the words 'snow' and 'tahoe' identify a different class of posts from the words 'goal' and 'puck'.
Moving on from the simple example above, here is an application using singular value decomposition to find similar research papers.
I've collect several different papers for analysis. Unfortunately due to the sorry state of open access for scientific papers I can't share the full article text that was used for analysis. Cell, for example, cautions that "you may not copy, display, distribute, modify, publish, reproduce, store, transmit, post, ..." Yikes. However I did chose articles such that you should be able to download the pdf's from the publisher for free.
Single-Molecule Protein Unfolding and Translocation by an ATP-Fueled Proteolytic Machine (clpx2)
Insights into dynein motor domain function from a 3.3-A crystal structure (dyn-structure)
Biophysical mechanism of T-cell receptor triggering in a reconsistuted system (tcell)
with open('input/stopwords.txt') as f:
stopwords = f.read().strip().split(',')
stopwords = set(stopwords) # use a set for fast membership testing
import collections
import os
import re
def word_count(fname):
"""Return a collections.Counter instance counting
the words in file fname."""
with open(fname) as f:
file_content = f.read()
words = re.split(r'\W+', file_content.lower())
words = [word for word in words
if len(word) > 3 and word not in stopwords]
word_count = collections.Counter(words)
return word_count
file_list = ['input/papers/' + f for f in os.listdir('input/papers/')
if f.endswith('.txt')]
word_df = pd.DataFrame()
for fname in file_list:
word_counter = word_count(fname)
file_df = pd.DataFrame.from_dict(word_counter,
orient='index')
file_df.columns = [fname.replace('input/papers/', '').replace('.txt', '')]
# normalize word count by the total number of words in the file:
file_df.ix[:, 0] = file_df.values.flatten() / float(file_df.values.sum())
word_df = word_df.join(file_df, how='outer', )
word_df = word_df.fillna(0)
print "Number of unique words: %s" % len(word_df)
Here are the results, sorted by the most common words in the first paper:
word_df.sort(columns=word_df.columns[0], ascending=False).head(10)
Now to calculate the singular value decomposition of this data.
U, sigma, V = np.linalg.svd(word_df)
Here is a look at \(V\), with the column names added:
v_df = pd.DataFrame(V, columns=word_df.columns)
v_df.apply(lambda x: np.round(x, decimals=2))
Here are the values of \(V\) represented as an image:
plt.imshow(V, interpolation='none')
ax = plt.gca()
plt.xticks(xrange(len(v_df.columns.values)))
plt.yticks(xrange(len(v_df.index.values)))
plt.title("$V$")
ax.set_xticklabels(v_df.columns.values, rotation=90)
plt.colorbar();
Note how in the above image, in the first three rows the similarities between the clpx papers is apparent, as well as between the first three dyn papers. The last dyn paper is somewhat different, but this is to be expected since it is a structure paper and the other three dyn papers involve more microscopy. In terms of comparing the papers, singular value decomposition allowed us to reduce the 5657 different words found in the papers into 6 values that are pre-sorted in order of importance!
Now we'll look in more detail at how similar each paper is to the others. I've defined a function to calculate the distance between two column vectors of \(V\), weighted by the weights in \(\Sigma\). For \(\vec{v}_i\) and \(\vec{v}_j\) the function calculates \(\|\Sigma * (\vec{v}_i - \vec{v}_j)\|\). This function is applied to every pairwise combination of \(\vec{v}_i\) and \(\vec{v}_j\), giving a metric of how similar two papers are (smaller values are more similar).
def dist(col1, col2, sigma=sigma):
"""Return the norm of (col1 - col2), where the differences in
each dimension are wighted by the values in sigma."""
return np.linalg.norm(np.array(col1 - col2) * sigma)
dist_df = pd.DataFrame(index=v_df.columns, columns=v_df.columns)
for cname in v_df.columns:
dist_df[cname] = v_df.apply(lambda x: dist(v_df[cname].values, x.values))
plt.imshow(dist_df.values, interpolation='none')
ax = plt.gca()
plt.xticks(xrange(len(dist_df.columns.values)))
plt.yticks(xrange(len(dist_df.index.values)))
ax.set_xticklabels(dist_df.columns.values, rotation=90)
ax.set_yticklabels(dist_df.index.values)
plt.title("Similarity between papers\nLower value = more similar")
plt.colorbar()
dist_df
The two clpx papers and the two dyn-steps are most similar to each other, as expected, while all the dyn paper do bear some similarity to each other. For a quicker readout, I've grouped the data into three similarity levels (in practice this could be done automatically with a clustering algorithm).
levels = [0.06, 0.075]
binned_df = dist_df.copy()
binned_df[(dist_df <= levels[0]) & (dist_df > 0)] = 1
binned_df[(dist_df <= levels[1]) & (dist_df > levels[0])] = 2
binned_df[(dist_df < 1) & (dist_df > levels[1])] = 3
plt.imshow(binned_df.values, interpolation='none')
ax = plt.gca()
plt.xticks(xrange(len(binned_df.columns.values)))
plt.yticks(xrange(len(binned_df.index.values)))
ax.set_xticklabels(binned_df.columns.values, rotation=90)
ax.set_yticklabels(binned_df.index.values)
plt.title("Similarity between papers\nLower value = more similar")
plt.colorbar();
Finally, let's output a list for each paper of the other papers, sorted in order of decreasing similarity:
for paper in dist_df.columns:
sim_papers_df = dist_df.sort(columns=paper)[paper]
sim_papers = sim_papers_df.drop([paper]).index
print 'Papers most similar to ' + paper + ':'
print ', '.join(sim_papers)
print '\n'