## Tutorial: Getting Started with Machine Learning with the SciPy stack

__Full Source Here__

**Sample Data Here**

## SciPy Stack

Python: Powerful scripting language

Numpy: Python package for numerical computing

SciPy: Python package for scientific computing

Matplotlib: Python package for plotting

iPython: Interactive python shell

Pandas: Python package for data analysis

SymPy: Python package for computer algebra systems

Nose: Python package for unit tests

### Installation

I will go through my Mac installation but if you are using another OS, you can find the installation instructions for SciPy on: http://www.scipy.org/install.html.

You should have Python 2.7.

#### Mac Installation

I am using a Mac on OS X 10.8.5 and used MacPorts to setup the SciPy stack on my machine.

Install macports if you haven’t already: http://www.macports.org/

Otherwise open Terminal and run: ‘sudo macports selfupdate’

Next in your Terminal run: ‘sudo port install py27-numpy py27-scipy py27-matplotlib py27-ipython +notebook py27-pandas py27-sympy py27-nose’

Run the following in terminal to select package versions.

sudo port select –set python python27

sudo port select –set ipython ipython27

## Hello World

Click New -> Notebooks -> Python 2

This should open a new tab with a newly create notebook.

Click Untitled at the top, rename the notebook to Hello World and press OK.

In the first line, change the line format from Code to Markdown and type in:

# Hello World Code

And click run (the black triangle that looks like a play button)

On the next line, in code, type:

print ‘Hello World’

and press run.

## K Means Clustering Seed Example

Suppose we are doing a study on a wheat farm to determine how much of each kind of wheat is in the field. We collect a random sample of seeds from the field and measure different attributes such as area, perimeter, length, width, etc. Using this attributes we can use k-means clustering to classify seeds into different types and determine the percentage of each type.

Sample data can be found here: http://archive.ics.uci.edu/ml/datasets/seeds

The sample data contains data that comes from real measurements. The attributes are:

1. area A,

2. perimeter P,

3. compactness C = 4*pi*A/P^2,

4. length of kernel,

5. width of kernel,

6. asymmetry coefficient

7. length of kernel groove.

Example: 15.26, 14.84, 0.871, 5.763, 3.312, 2.221, 5.22, 1

Download the file into the same folder as your notebook.

#### Code

Create a new notebook and name it whatever you want. We can put all the code into one cell.

First, we need to parse the data so that we can run k-means on it. We open the file using a csv reader and convert each cell to a float. We will skip rows that contain missing data.

Sample row:

['15.26', '14.84', '0.871', '5.763', '3.312', '2.221', '5.22', '1']

# Read data for row in bank_csv: missing = False float_arr = [] for cell in row: if not cell: missing = True break else: # Convert each cell to float float_arr.append(float(cell)) # Take row if row is not missing data if not missing: data.append(float_arr) data = np.array(data)

Next, we normalize the features for the k means algorithm. Since Scipy implements the k means clustering algorithm for us, all the hard work is done.

# Normalize vectors whitened = vq.whiten(data) # Perform k means on all features to classify into 3 groups centroids, _ = vq.kmeans(whitened, 3)

We then classify each data point by distance to centroid:

# Classify data by distance to centroids cls, _ = vq.vq(whitened, centroids)

Finally, we can graph the classifications of the data points by the first two features. There are seven features total, but it would be hard to visualize. You can graph by other features for similar visualizations.

# Plot first two features (area vs perimter in this case) plt.plot(data[cls==0,0], data[cls==0,6],'ob', data[cls==1,0], data[cls==1,6],'or', data[cls==2,0], data[cls==2,6],'og') plt.show()

Note: to show the plot inline in the cell, we put ‘%matplotlib inline’ at the beginning of the cell.