Developing Machine Learning Software¶

Programming Language for Machine Learning¶

The most popular programming languages for machine learning (not necessarily in this order):

Python: scripting language, it is easy to wrap C/C++ libraries, great libraries
Julia: brand-new, awesome, efficient, designed for scientific programming
Octave: free Matlab clone
C/C++: efficient, general purpose language
Java: web applications, large scale and distributed systems
...

It helps a lot to use libraries that have support for linear algebra (matrices and vectors), e.g.

Python: NumPy
Julia: builtin
Octave: builtin
C++: Eigen 3
C/Fortran: BLAS (OpenBLAS, CBLAS, ...), LAPack
Java: JBlas (BLAS wrapper), Jama

Python¶

Python is possibly the best choice for the programming tasks today since there are many libraries that are great for machine learning, for example

NumPy for linear algebra
Matplotlib for visualization
not required in this lecture: SciPy, Pandas, Scikit Learn, Theano, ...

In addition, you can work interactively with IPython notebooks. Examples can be found at the IPython Notebook Viewer.

Setup Python Environment¶

Required packages to run this notebook:

Python
IPython - interactive Python shell
IPython notebooks - IPython for the browser
NumPy - linear algebra library
Matplotlib - plotting library

Scientific Python Distributions (available for every platform)

Introductions to Python and Scientific Programming¶

IDEs¶

Platform independend: PyCharm, Eclipse with PyDev
Mac OS X: TextMate

IPython¶

Document your code with

Markdown and
LaTeX formulas:

\[\sigma = \sqrt{\sum_{n=1}^N \left( x_n - \mu \right)^2}\]

In [1]:

# IPython magic: load numpy and matplotlib
%pylab inline

Populating the interactive namespace from numpy and matplotlib

In [2]:

%timeit np.random.randn(10)

1000000 loops, best of 3: 948 ns per loop

In [3]:

%run test.py

ERROR: File `u'test.py'` not found.

In [4]:

cat README.md

Machine Learning Tutorials
==========================

These are IPython notebooks for the tutorials of the machine learning course
at the University of Bremen.

You can view the notebooks and the corresponding slides
[here](http://alexanderfabisch.github.io/ml_tutorials/).

Requirements
------------

The following packages are required to use these notebooks:

* Python
* IPython - interactive Python shell
* IPython notebooks - IPython for the browser
* NumPy - linear algebra library
* Matplotlib - plotting library

For Windows: use the [Enthought Python Distribution](https://www.enthought.com/products/epd/).

In [5]:

plot(linspace(0, 3, 100), sin(linspace(0, 3, 100)))

Out[5]:

[<matplotlib.lines.Line2D at 0x312e650>]

In [6]:

np.random.randn(2, 2)

Out[6]:

array([[-0.89353402, -1.67727854],
       [-0.06876317,  0.94290509]])

In [7]:

from IPython.html.widgets import interactive


def iwidget(w, b):
    x = np.linspace(-1, 1, 201)
    y = w * x + b
    plt.plot(x, y)
    plt.xlim((-1, 1))
    plt.ylim((-1, 1))
    plt.title("y = %g x + %g" % (w, b))


interactive(iwidget, w=(-1, 1, 0.1), b=(-0.5, 0.5, 0.1))

NumPy¶

Creating ndarrays¶

ndarrays can have n dimensions but most of them have 1 (vector) or 2 (matrix)

In [8]:

a = np.array([0, 1, 2])
a

Out[8]:

array([0, 1, 2])

Vector of ones \(\boldsymbol{1}\)

In [9]:

b = np.ones(3)
b

Out[9]:

array([ 1.,  1.,  1.])

Vector of zeros \(\boldsymbol{0}\)

In [10]:

c = np.zeros(3)
c

Out[10]:

array([ 0.,  0.,  0.])

Identity matrix \(\boldsymbol{I}\)

In [11]:

D = np.eye(2)
D

Out[11]:

array([[ 1.,  0.],
       [ 0.,  1.]])

iid. Gaussian samples

In [12]:

E = np.random.randn(1000, 2)
scatter(E[:, 0], E[:, 1])
E

Out[12]:

array([[ 1.66108612,  0.17376477],
       [ 0.52094769, -0.32635803],
       [-1.66433794,  0.23631064],
       ..., 
       [ 0.22490847, -0.56416149],
       [-0.15707269, -1.02849181],
       [-1.97060547, -0.71026355]])

Matrix multiplication

In [13]:

E.dot(D)

Out[13]:

array([[ 1.66108612,  0.17376477],
       [ 0.52094769, -0.32635803],
       [-1.66433794,  0.23631064],
       ..., 
       [ 0.22490847, -0.56416149],
       [-0.15707269, -1.02849181],
       [-1.97060547, -0.71026355]])

Note: * is not the matrix multiplication! It is the component-wise multiplication!

In [14]:

A = np.array([[0, 1], [2, 3]])
B = np.array([[4, 5], [6, 7]])
print("A * B")
print(A * B)
print("A.dot(B)")
print(A.dot(B))

A * B
[[ 0  5]
 [12 21]]
A.dot(B)
[[ 6  7]
 [26 31]]

Linear algebra stuff¶

\(||A||_2\) (Frobenius norm)

In [15]:

np.linalg.norm(A)

Out[15]:

3.7416573867739413

\(B^{-1}\) (inverse)

In [16]:

np.linalg.inv(B)

Out[16]:

array([[-3.5,  2.5],
       [ 3. , -2. ]])

|A| (determinant)

In [17]:

np.linalg.det(A)

Out[17]:

-2.0

\(A x = b\)

In [18]:

b = np.array([0, 3])
x = np.linalg.solve(A, b)
x

Out[18]:

array([ 1.5,  0. ])

In [19]:

A.dot(x)

Out[19]:

array([ 0.,  3.])

Matplotlib¶

In [20]:

x = np.linspace(0, 10 * np.pi, 500)
y = np.sin(x)
plot(x, y)

Out[20]:

[<matplotlib.lines.Line2D at 0x31f4550>]

In [21]:

scatter(x, y)

Out[21]:

<matplotlib.collections.PathCollection at 0x3212650>

In [22]:

x, y = np.meshgrid(np.linspace(-10, 10, 21), np.linspace(-10, 10, 21))
z = x ** 2 + 2 * y
contourf(x, y, z)
colorbar()

Out[22]:

<matplotlib.colorbar.Colorbar instance at 0x3694440>

In [23]:

matshow(x)

Out[23]:

<matplotlib.image.AxesImage at 0x3b2ae90>

In [24]:

imshow(x)

Out[24]:

<matplotlib.image.AxesImage at 0x3d2a9d0>

scikit learn¶

Documentation

Example: Digits dataset

In [25]:

from sklearn.datasets import load_digits
from sklearn.preprocessing import scale

digits = load_digits()
data = scale(digits.data)

In [26]:

matshow(data[10].reshape(8, 8))
gray()

Let's reduce the 64-dimensional vectors to 2 dimensions (dimensionality reduction)

In [27]:

from sklearn.decomposition import PCA

reduced_data = PCA(n_components=2).fit_transform(data)

In [28]:

scatter(reduced_data[:, 0], reduced_data[:, 1])

Out[28]:

<matplotlib.collections.PathCollection at 0x3f410d0>

Is there any structure? Can we automatically determine clusters?

In [29]:

from sklearn.cluster import KMeans

np.random.seed(42)

n_samples, n_features = data.shape
n_digits = len(np.unique(digits.target))

kmeans = KMeans(init='k-means++', n_clusters=n_digits, n_init=10)
kmeans.fit(reduced_data)

Out[29]:

KMeans(copy_x=True, init='k-means++', max_iter=300, n_clusters=10, n_init=10,
    n_jobs=1, precompute_distances='auto', random_state=None, tol=0.0001,
    verbose=0)

In [30]:

# Step size of the mesh. Decrease to increase the quality of the VQ.
h = .02     # point in the mesh [x_min, m_max]x[y_min, y_max].

# Plot the decision boundary. For that, we will assign a color to each
x_min, x_max = reduced_data[:, 0].min() + 1, reduced_data[:, 0].max() - 1
y_min, y_max = reduced_data[:, 1].min() + 1, reduced_data[:, 1].max() - 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))

# Obtain labels for each point in mesh. Use last trained model.
Z = kmeans.predict(np.c_[xx.ravel(), yy.ravel()])

# Put the result into a color plot
Z = Z.reshape(xx.shape)
plt.figure(1)
plt.clf()
plt.imshow(Z, interpolation='nearest',
           extent=(xx.min(), xx.max(), yy.min(), yy.max()),
           cmap=plt.cm.Paired,
           aspect='auto', origin='lower')

plt.plot(reduced_data[:, 0], reduced_data[:, 1], 'k.', markersize=2)
# Plot the centroids as a white X
centroids = kmeans.cluster_centers_
plt.scatter(centroids[:, 0], centroids[:, 1],
            marker='x', s=169, linewidths=3,
            color='w', zorder=10)
plt.title('K-means clustering on the digits dataset (PCA-reduced data)\n'
          'Centroids are marked with white cross')
plt.xlim(x_min, x_max)
plt.ylim(y_min, y_max)
plt.xticks(())
plt.yticks(())
plt.show()