Developing Machine Learning Software

Programming Language for Machine Learning

The most popular programming languages for machine learning (not necessarily in this order):

  • Python: scripting language, it is easy to wrap C/C++ libraries, great libraries
  • Julia: brand-new, awesome, efficient, designed for scientific programming
  • Octave: free Matlab clone
  • C/C++: efficient, general purpose language
  • Java: web applications, large scale and distributed systems
  • ...

It helps a lot to use libraries that have support for linear algebra (matrices and vectors), e.g.

Python

Python is possibly the best choice for the programming tasks today since there are many libraries that are great for machine learning, for example

  • NumPy for linear algebra
  • Matplotlib for visualization
  • not required in this lecture: SciPy, Pandas, Scikit Learn, Theano, ...

In addition, you can work interactively with IPython notebooks. Examples can be found at the IPython Notebook Viewer.

Setup Python Environment

Required packages to run this notebook:

  • Python
  • IPython - interactive Python shell
  • IPython notebooks - IPython for the browser
  • NumPy - linear algebra library
  • Matplotlib - plotting library

Scientific Python Distributions (available for every platform)

Introductions to Python and Scientific Programming

IDEs

IPython

Document your code with

  • Markdown and
  • LaTeX formulas:

\[\sigma = \sqrt{\sum_{n=1}^N \left( x_n - \mu \right)^2}\]

In [1]:
# IPython magic: load numpy and matplotlib
%pylab inline
Populating the interactive namespace from numpy and matplotlib

In [2]:
%timeit np.random.randn(10)
1000000 loops, best of 3: 948 ns per loop

In [3]:
%run test.py
ERROR: File `u'test.py'` not found.

In [4]:
cat README.md
Machine Learning Tutorials
==========================

These are IPython notebooks for the tutorials of the machine learning course
at the University of Bremen.

You can view the notebooks and the corresponding slides
[here](http://alexanderfabisch.github.io/ml_tutorials/).

Requirements
------------

The following packages are required to use these notebooks:

* Python
* IPython - interactive Python shell
* IPython notebooks - IPython for the browser
* NumPy - linear algebra library
* Matplotlib - plotting library

For Windows: use the [Enthought Python Distribution](https://www.enthought.com/products/epd/).


In [5]:
plot(linspace(0, 3, 100), sin(linspace(0, 3, 100)))
Out[5]:
[<matplotlib.lines.Line2D at 0x312e650>]
In [6]:
np.random.randn(2, 2)
Out[6]:
array([[-0.89353402, -1.67727854],
       [-0.06876317,  0.94290509]])
In [7]:
from IPython.html.widgets import interactive


def iwidget(w, b):
    x = np.linspace(-1, 1, 201)
    y = w * x + b
    plt.plot(x, y)
    plt.xlim((-1, 1))
    plt.ylim((-1, 1))
    plt.title("y = %g x + %g" % (w, b))


interactive(iwidget, w=(-1, 1, 0.1), b=(-0.5, 0.5, 0.1))

NumPy

Creating ndarrays

ndarrays can have n dimensions but most of them have 1 (vector) or 2 (matrix)

In [8]:
a = np.array([0, 1, 2])
a
Out[8]:
array([0, 1, 2])

Vector of ones \(\boldsymbol{1}\)

In [9]:
b = np.ones(3)
b
Out[9]:
array([ 1.,  1.,  1.])

Vector of zeros \(\boldsymbol{0}\)

In [10]:
c = np.zeros(3)
c
Out[10]:
array([ 0.,  0.,  0.])

Identity matrix \(\boldsymbol{I}\)

In [11]:
D = np.eye(2)
D
Out[11]:
array([[ 1.,  0.],
       [ 0.,  1.]])

iid. Gaussian samples

In [12]:
E = np.random.randn(1000, 2)
scatter(E[:, 0], E[:, 1])
E
Out[12]:
array([[ 1.66108612,  0.17376477],
       [ 0.52094769, -0.32635803],
       [-1.66433794,  0.23631064],
       ..., 
       [ 0.22490847, -0.56416149],
       [-0.15707269, -1.02849181],
       [-1.97060547, -0.71026355]])

Matrix multiplication

In [13]:
E.dot(D)
Out[13]:
array([[ 1.66108612,  0.17376477],
       [ 0.52094769, -0.32635803],
       [-1.66433794,  0.23631064],
       ..., 
       [ 0.22490847, -0.56416149],
       [-0.15707269, -1.02849181],
       [-1.97060547, -0.71026355]])

Note: * is not the matrix multiplication! It is the component-wise multiplication!

In [14]:
A = np.array([[0, 1], [2, 3]])
B = np.array([[4, 5], [6, 7]])
print("A * B")
print(A * B)
print("A.dot(B)")
print(A.dot(B))
A * B
[[ 0  5]
 [12 21]]
A.dot(B)
[[ 6  7]
 [26 31]]

Linear algebra stuff

\(||A||_2\) (Frobenius norm)

In [15]:
np.linalg.norm(A)
Out[15]:
3.7416573867739413

\(B^{-1}\) (inverse)

In [16]:
np.linalg.inv(B)
Out[16]:
array([[-3.5,  2.5],
       [ 3. , -2. ]])

|A| (determinant)

In [17]:
np.linalg.det(A)
Out[17]:
-2.0

\(A x = b\)

In [18]:
b = np.array([0, 3])
x = np.linalg.solve(A, b)
x
Out[18]:
array([ 1.5,  0. ])
In [19]:
A.dot(x)
Out[19]:
array([ 0.,  3.])

Matplotlib

In [20]:
x = np.linspace(0, 10 * np.pi, 500)
y = np.sin(x)
plot(x, y)
Out[20]:
[<matplotlib.lines.Line2D at 0x31f4550>]
In [21]:
scatter(x, y)
Out[21]:
<matplotlib.collections.PathCollection at 0x3212650>
In [22]:
x, y = np.meshgrid(np.linspace(-10, 10, 21), np.linspace(-10, 10, 21))
z = x ** 2 + 2 * y
contourf(x, y, z)
colorbar()
Out[22]:
<matplotlib.colorbar.Colorbar instance at 0x3694440>
In [23]:
matshow(x)
Out[23]:
<matplotlib.image.AxesImage at 0x3b2ae90>
In [24]:
imshow(x)
Out[24]:
<matplotlib.image.AxesImage at 0x3d2a9d0>

scikit learn

Example: Digits dataset

In [25]:
from sklearn.datasets import load_digits
from sklearn.preprocessing import scale

digits = load_digits()
data = scale(digits.data)
In [26]:
matshow(data[10].reshape(8, 8))
gray()

Let's reduce the 64-dimensional vectors to 2 dimensions (dimensionality reduction)

In [27]:
from sklearn.decomposition import PCA

reduced_data = PCA(n_components=2).fit_transform(data)
In [28]:
scatter(reduced_data[:, 0], reduced_data[:, 1])
Out[28]:
<matplotlib.collections.PathCollection at 0x3f410d0>

Is there any structure? Can we automatically determine clusters?

In [29]:
from sklearn.cluster import KMeans

np.random.seed(42)

n_samples, n_features = data.shape
n_digits = len(np.unique(digits.target))

kmeans = KMeans(init='k-means++', n_clusters=n_digits, n_init=10)
kmeans.fit(reduced_data)
Out[29]:
KMeans(copy_x=True, init='k-means++', max_iter=300, n_clusters=10, n_init=10,
    n_jobs=1, precompute_distances='auto', random_state=None, tol=0.0001,
    verbose=0)
In [30]:
# Step size of the mesh. Decrease to increase the quality of the VQ.
h = .02     # point in the mesh [x_min, m_max]x[y_min, y_max].

# Plot the decision boundary. For that, we will assign a color to each
x_min, x_max = reduced_data[:, 0].min() + 1, reduced_data[:, 0].max() - 1
y_min, y_max = reduced_data[:, 1].min() + 1, reduced_data[:, 1].max() - 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))

# Obtain labels for each point in mesh. Use last trained model.
Z = kmeans.predict(np.c_[xx.ravel(), yy.ravel()])

# Put the result into a color plot
Z = Z.reshape(xx.shape)
plt.figure(1)
plt.clf()
plt.imshow(Z, interpolation='nearest',
           extent=(xx.min(), xx.max(), yy.min(), yy.max()),
           cmap=plt.cm.Paired,
           aspect='auto', origin='lower')

plt.plot(reduced_data[:, 0], reduced_data[:, 1], 'k.', markersize=2)
# Plot the centroids as a white X
centroids = kmeans.cluster_centers_
plt.scatter(centroids[:, 0], centroids[:, 1],
            marker='x', s=169, linewidths=3,
            color='w', zorder=10)
plt.title('K-means clustering on the digits dataset (PCA-reduced data)\n'
          'Centroids are marked with white cross')
plt.xlim(x_min, x_max)
plt.ylim(y_min, y_max)
plt.xticks(())
plt.yticks(())
plt.show()