{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "kmeans.ipynb\n", "\n", "Discussion: This Jupyter notebook investigates clustering discrete data with kmeans.\n", "\n", "Licensing: This code is distributed under the GNU LGPL license.\n", " \n", "Modified: 28 October 2016\n", "\n", "Author: John Burkardt, Lukas Bystricky" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Import necessary libraries and set plot option\n", "%matplotlib inline\n", "%config InlineBackend.figure_format = 'svg'\n", "import numpy as np\n", "import matplotlib.pyplot as plt\n", "import math\n", "import scipy.spatial as spatial" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# K Means #\n", "\n", "In this module we will investigate the problem of arranging N items of\n", "M-dimensional data into K groups or \"clusters\", in such a way that the \n", "items in each group are close.\n", "\n", "Each group will be associated with a special value called the \"mean\".\n", "We may expect this value to be the centroid of the group values, but\n", "while we are computing, sometimes this will not quite be the case.\n", "\n", "Our examples will use M=2 dimensions, but the method works for any M.\n", "\n", "We need to understand \n", "* how the \"energy\" of a group measures closeness;\n", "* the importance of the centroid/mean/generator, \n", "* how to randomly assign a data item to a cluster, \n", "* how to compute the energy of a cluster and of the collection of clusters, \n", "* how to assign items to the nearest generator or mean, \n", "* how to update the means, \n", "* how to carry out an iteration that repeatedly reduces the clustering energy." ] }, { "cell_type": "markdown", "metadata": { "collapsed": false }, "source": [ "# Point energy, cluster energy, and the centroid #\n", "\n", "Suppose we have a set of points G whose typical point is x.\n", "We will call G a \"group\" or a \"cluster\", and the idea is that\n", "all the points in G ought to be close or tightly clustered.\n", "\n", "In order to evaluate how tightly clustered a set of points is, we \n", "use the idea of \"energy\". To define the energy, we have to include\n", "a special value we might call the base point, or mean, or center, or generator.\n", "Let's call it \"m\". \n", "\n", "The point energy of x (with respect to m ) is\n", " e(x,m) = ||x - m||^2\n", "that is, just the square of the Euclidean distance.\n", "Obviously, for a given x, e(x,m) is minimized by setting m=x (not exciting!)\n", "\n", "Since x is part of a set G, we can now also assign a cluster energy to G,\n", " e(G,m) = sum (all x in g ) e(x,m)\n", "Now we said that m could be any point at all. An interesting question \n", "becomes: if we can choose any value for m, is there a value that minimizes\n", "the cluster energy. The answer to this is that we should choose m to be\n", "the average of the x values. (You can show this by differentiating the\n", "expression for e(G,m) with respect to m).\n", "\n", "We would like to make a small test to verify that, for a given set of points x\n", "in a cluster G, the value of e(G,m) is minimized by the average (which we\n", "might also call the centroid or mean.)\n" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[[ 0.62022095 0.25780003]\n", " [ 0.85224345 0.32585312]\n", " [ 0.18210333 0.57784647]\n", " [ 0.35746139 0.59488945]\n", " [ 0.49054516 0.21448136]\n", " [ 0.47745961 0.9499567 ]\n", " [ 0.22000546 0.11850909]\n", " [ 0.3525047 0.56682692]\n", " [ 0.64253934 0.47827504]\n", " [ 0.00637021 0.93105693]]\n", "(array([ 0., 0.]), 5.5659070553836907)\n", "(array([ 0.2, 0.7]), 2.1636324489619718)\n", "(array([ 0.62022095, 0.25780003]), 2.2796071996895089)\n", "(array([ 0.42014536, 0.50154951]), 1.2851666876061194)\n" ] } ], "source": [ "## Minimizing the energy of a cluster\n", "#\n", "# Create a 10x2 array G containing 10 random 2D points in the unit square.\n", "#\n", "# Define a function cluster_energy ( G, m ) which returns the cluster\n", "# energy of the points in G relative to the point m.\n", "#\n", "# Define\n", "# m1 = (0,0)\n", "# m2 = (0.2,0.7)\n", "# m3 = the first point in G = G[0,:]\n", "# m4 = average of points in G\n", "#\n", "# and evaluate the cluster energies e(G,m1), e(G,m2), e(G,m3), e(G,m4).\n", "#\n", "# You should see that e(G,m4) is lower than the other values, and indeed,\n", "# it will be lower than the cluster energy for any other choice of m.\n", "#\n", "n = 10\n", "G = np.random.rand ( n, 2 )\n", "print ( G )\n", "\n", "def cluster_energy ( G, m ):\n", " energy = 0.0\n", " rows = G.shape[0]\n", " for row in range ( 0, rows ):\n", " energy = energy + ( np.linalg.norm ( G[row,:] - m[:] ) ) ** 2\n", " return energy\n", "\n", "m1 = np.array ( [ 0.0, 0.0 ] )\n", "m2 = np.array ( [ 0.2, 0.7 ] )\n", "m3 = G[0,:]\n", "m4 = np.sum ( G, axis = 0 ) / float ( n )\n", "\n", "print ( m1, cluster_energy ( G, m1 ) )\n", "print ( m2, cluster_energy ( G, m2 ) )\n", "print ( m3, cluster_energy ( G, m3 ) )\n", "print ( m4, cluster_energy ( G, m4 ) )" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "[]" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# Reading data #\n", "#\n", "# For our first experiments, the items we are going to try to cluster \n", "# will be the (x,y) coordinates of 75 points.\n", "#\n", "# The data values are stored in the file \"ruspini.txt\".\n", "#\n", "# Download the data from the web page:\n", "#\n", "# http://people.sc.fsu.edu/~jburkardt/classes/urop_2016/ruspini.txt\n", "#\n", "# Read the data from the file.\n", "#\n", "# Rewrite your work so it is a function:\n", "#\n", "# def ruspini_data ( ):\n", "# ***\n", "# return xy\n", "#\n", "def ruspini_data ( ):\n", " \n", " xy = np.zeros ( [ 75, 2 ] )\n", " \n", " input = open ( 'ruspini.txt', 'r' )\n", " \n", " i = 0\n", " for line in input:\n", " j = 0\n", " for word in line.strip().split():\n", " xy[i,j] = float ( word )\n", " j = j + 1\n", " i = i + 1\n", "\n", " input.close ( )\n", "\n", " return xy\n", "#\n", "# Use the function.\n", "#\n", "# Plot the data, and see that it naturally separates into\n", "# a certain number of clusters.\n", "#\n", "xy = ruspini_data ( )\n", "plt.plot ( xy[:,0], xy[:,1], 'b*' )" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", " I MOD1 MOD2 RANDOM1 RANDOM2:\n", "\n", " 0 0 0 2 0\n", " 1 1 1 2 2\n", " 2 2 2 0 2\n", " 3 0 0 1 0\n", " 4 1 1 0 1\n", " 5 2 2 1 0\n", " 6 0 0 2 2\n", " 7 1 1 1 2\n", " 8 2 2 1 1\n", " 9 0 0 1 1\n", " 10 1 1 2 0\n", " 11 2 2 0 0\n", " 12 0 0 2 0\n", " 13 1 1 1 2\n", " 14 2 2 1 1\n", " 15 0 0 1 1\n", " 16 1 1 1 0\n", " 17 2 2 2 1\n", " 18 0 0 0 0\n", " 19 1 1 2 0\n", " 20 2 2 1 2\n", " 21 0 0 0 0\n", " 22 1 1 1 2\n", " 23 2 2 2 1\n", " 24 0 0 2 1\n", " 25 1 1 0 2\n", " 26 2 2 0 1\n", " 27 0 0 2 1\n", " 28 1 1 1 0\n", " 29 2 2 0 2\n", " 30 0 0 0 0\n", " 31 1 1 2 2\n", " 32 2 2 1 1\n", " 33 0 0 2 1\n", " 34 1 1 2 0\n", " 35 2 2 0 0\n", " 36 0 0 1 0\n", " 37 1 1 0 1\n", " 38 2 2 2 2\n", " 39 0 0 2 1\n", " 40 1 1 1 2\n", " 41 2 2 2 0\n", " 42 0 0 1 1\n", " 43 1 1 0 0\n", " 44 2 2 2 1\n", " 45 0 0 0 0\n", " 46 1 1 1 2\n", " 47 2 2 0 2\n", " 48 0 0 0 0\n", " 49 1 1 1 0\n", " 50 2 2 0 0\n", " 51 0 0 2 1\n", " 52 1 1 0 2\n", " 53 2 2 2 2\n", " 54 0 0 2 1\n", " 55 1 1 0 0\n", " 56 2 2 2 2\n", " 57 0 0 2 2\n", " 58 1 1 0 0\n", " 59 2 2 2 2\n", " 60 0 0 1 2\n", " 61 1 1 0 1\n", " 62 2 2 0 2\n", " 63 0 0 1 2\n", " 64 1 1 0 0\n", " 65 2 2 2 2\n", " 66 0 0 1 1\n", " 67 1 1 0 2\n", " 68 2 2 0 0\n", " 69 0 0 0 1\n", " 70 1 1 1 1\n", " 71 2 2 2 0\n", " 72 0 0 0 0\n", " 73 1 1 2 0\n", " 74 2 2 1 0\n" ] } ], "source": [ "# Initial Cluster Assignment\n", "#\n", "# If xy represents our data values, then for each xy[i],\n", "# we need an array c so that c[i] records the cluster \n", "# that the points belongs to.\n", "#\n", "# To start the K-means algorithm, we might as well just\n", "# randomly assign points to clusters.\n", "#\n", "# We don't want any cluster to be empty. So the easiest\n", "# thing to do is assign point 0 to cluster 0, point 1 to\n", "# cluster 1, ..., point K-1 to cluster K-1, and then\n", "# wrap around to point K is sent to cluster 0, point K+1\n", "# to cluster 1 and so on, which is essentially using the\n", "# mod function.\n", "#\n", "# The other simple choice is to call NUMPY's random.randint()\n", "# function to return a random list of integers between 0\n", "# and k-1. The advantage of this choice is that we will\n", "# get a different initial configuration every time we call.\n", "#\n", "# Write functions:\n", "#\n", "# def initial_clusters_mod ( n, k ):\n", "# ***\n", "# return c\n", "#\n", "# def initial_clusters_random ( n, k ):\n", "# ***\n", "# return c\n", "#\n", "# For the case N = 20 and K = 3, call each function twice, \n", "# print the results, and show that the second function gives\n", "# different results each time.\n", "#\n", "def initial_clusters_mod ( n, k ):\n", " c = np.zeros ( n, dtype = np.int32 )\n", " for i in range ( 0, n ):\n", " c[i] = ( i % k )\n", " return c\n", "\n", "def initial_clusters_random ( n, k ):\n", " c = np.random.randint ( 0, k, n )\n", " return c\n", "\n", "xy = ruspini_data ( )\n", "n = 75\n", "k = 3\n", "\n", "c1 = initial_clusters_mod ( n, k )\n", "c2 = initial_clusters_mod ( n, k )\n", "c3 = initial_clusters_random ( n, k )\n", "c4 = initial_clusters_random ( n, k )\n", "\n", "print ( '' )\n", "print ( ' I MOD1 MOD2 RANDOM1 RANDOM2:' )\n", "print ( '' )\n", "for i in range ( 0, n ):\n", " print ( '%4d %4d %4d %4d %4d' % ( i, c1[i], c2[i], c3[i],c4[i] ) ) \n" ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", " Cluster Pop Energy\n", "\n", " 0 27 90289.037037\n", " 1 23 63374.782609\n", " 2 25 85914.320000\n", "\n", " Total 75 239578.139646\n" ] } ], "source": [ "# Cluster statistics #\n", "#\n", "# Assuming we have K clusters, and that the vector J=C[I] stores the\n", "# cluster J to which point X[I] is assigned, we want to compute the \n", "# following arrays for each cluster J\n", "#\n", "# POP(J) = number of items in the cluster\n", "# MN(J,2) = the mean or average of the items in the cluster\n", "# E(J) = energy of the cluster\n", "#\n", "# Use K = 3, and N = 30, and the random cluster assigment function\n", "# initial_clusters_random(), compute these quantities.\n", "#\n", "# When you like what you have written, rewrite it as a function:\n", "#\n", "# def cluster_stats ( n, xy, c, k ):\n", "# ***\n", "# return pop, mn, e\n", "#\n", "# so we can update these statistics as we change the cluster assignments.\n", "#\n", "def cluster_stats ( n, xy, c, k ):\n", "\n", " pop = np.zeros ( k, dtype = np.int32 )\n", " dim = xy.shape[1]\n", " mn = np.zeros ( [ k, dim ] )\n", " e = np.zeros ( k )\n", "\n", " for i in range ( 0, n ):\n", " j = c[i]\n", " pop[j] = pop[j] + 1\n", " mn[j,:] = mn[j,:] + xy[i,:]\n", "\n", " for j in range ( 0, k ):\n", " mn[j,:] = mn[j,:] / float ( max ( pop[j], 1 ) )\n", "\n", " for i in range ( 0, n ):\n", " j = c[i]\n", " e[j] = e[j] + ( np.linalg.norm ( xy[i,:] - mn[j,:] ) ) ** 2\n", "\n", " return pop, mn, e\n", "#\n", "# Use the function.\n", "#\n", "n = 75\n", "k = 3\n", "xy = ruspini_data ( )\n", "c = initial_clusters_random ( n, k )\n", "pop, mn, e = cluster_stats ( n, xy, c, k )\n", "\n", "print ( '' )\n", "print ( ' Cluster Pop Energy')\n", "print ( '')\n", "\n", "for j in range ( 0, k ):\n", " print ( ' %3d %4d %f' % ( j, pop[j], e[j] ) )\n", "\n", "print ( '' )\n", "print ( ' Total %4d %f' % ( np.sum ( pop ), np.sum ( e ) ) )\n", "\n" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "Point Current Nearest\n", "\n", " 0 1 0\n", " 1 0 2\n", " 2 0 1\n", " 3 2 2\n", " 4 0 0\n", " 5 0 2\n", " 6 0 0\n", " 7 0 1\n", " 8 1 1\n", " 9 0 0\n", " 10 0 0\n", " 11 1 1\n", " 12 2 2\n", " 13 0 0\n", " 14 1 1\n", " 15 0 0\n", " 16 0 2\n", " 17 0 0\n", " 18 2 0\n", " 19 0 0\n", " 20 2 2\n", " 21 1 2\n", " 22 0 0\n", " 23 0 2\n", " 24 1 1\n", " 25 0 2\n", " 26 1 0\n", " 27 1 0\n", " 28 1 1\n", " 29 1 0\n", " 30 0 0\n", " 31 2 0\n", " 32 1 1\n", " 33 1 2\n", " 34 0 0\n", " 35 1 1\n", " 36 1 1\n", " 37 1 2\n", " 38 2 2\n", " 39 2 1\n", " 40 0 0\n", " 41 1 0\n", " 42 1 2\n", " 43 1 1\n", " 44 1 2\n", " 45 2 1\n", " 46 1 1\n", " 47 1 1\n", " 48 2 2\n", " 49 2 0\n", " 50 1 0\n", " 51 2 2\n", " 52 2 0\n", " 53 1 2\n", " 54 1 1\n", " 55 1 2\n", " 56 0 2\n", " 57 1 2\n", " 58 1 0\n", " 59 2 2\n", " 60 1 0\n", " 61 2 0\n", " 62 1 0\n", " 63 0 1\n", " 64 1 2\n", " 65 2 1\n", " 66 2 2\n", " 67 0 0\n", " 68 1 2\n", " 69 0 1\n", " 70 2 1\n", " 71 0 0\n", " 72 1 1\n", " 73 0 2\n", " 74 2 1\n" ] } ], "source": [ "# Find Nearest Cluster\n", "#\n", "# We want the data to be clustered, and we use the energy to measure this. \n", "# Energy depends on how close you are to the mean of your cluster.\n", "# Suppose a data item is in the \"wrong\" cluster. We will take this\n", "# to mean that the data item is closer to the mean of some other cluster\n", "# than it is to the mean of its current cluster. What would happen if\n", "# we transferred that item to the cluster of the closer mean? The energy\n", "# in the old cluster would go down, and the energy in the new cluster would\n", "# go up. But the energy of the old cluster will go down more than the \n", "# energy of the new cluster will go up (because they are based on the distance\n", "# to the cluster mean) That means our total energy will go down, and that's good.\n", "# \n", "# Write a function which looks like this:\n", "#\n", "# def nearest_cluster ( n, i, xy, k, mn ):\n", "# *****\n", "# return j2\n", "#\n", "# which has as input and output:\n", "# N, the number of data items;\n", "# I, the data item that we are interested in;\n", "# XY, the list of all data item coordinates;\n", "# K, the number of clusters;\n", "# MN, the cluster means;\n", "# J2, the index of the cluster mean that is closest to data item I.\n", "#\n", "def nearest_cluster ( n, i, xy, k, mn ):\n", " d2 = np.inf\n", " j2 = -1\n", " for j in range ( 0, k ):\n", " d = ( xy[i,0] - mn[j,0] ) ** 2 + ( xy[i,1] - mn[j,1] ) ** 2\n", " if ( d < d2 ):\n", " d2 = d\n", " j2 = j\n", " return j2\n", "#\n", "# Use the function on an example.\n", "#\n", "n = 75\n", "k = 3\n", "xy = np.random.random ( [ n, 2 ] )\n", "c = initial_clusters_random ( n, k )\n", "pop, mn, e = cluster_stats ( n, xy, c, k )\n", "print ( '' )\n", "print ( 'Point Current Nearest' )\n", "print ( '' )\n", "for i in range ( 0, n ):\n", " j = c[i]\n", " j2 = nearest_cluster ( n, i, xy, k, mn )\n", " print ( '%4d %4d %4d' % ( i, j, j2 ) )" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "53 data items moved\n", "\n", " I C(old) C(new)\n", "\n", " 0 1 2\n", " 1 1 2\n", " 2 2 2\n", " 3 2 2\n", " 4 0 2\n", " 5 1 2\n", " 6 1 2\n", " 7 0 2\n", " 8 2 2\n", " 9 1 2\n", "10 1 2\n", "11 1 2\n", "12 1 2\n", "13 3 2\n", "14 1 2\n", "15 0 2\n", "16 2 2\n", "17 0 2\n", "18 2 2\n", "19 1 2\n", "20 3 1\n", "21 1 1\n", "22 1 1\n", "23 1 1\n", "24 0 1\n", "25 2 1\n", "26 0 1\n", "27 1 1\n", "28 3 1\n", "29 2 1\n", "30 0 1\n", "31 0 1\n", "32 0 1\n", "33 2 1\n", "34 1 1\n", "35 1 1\n", "36 3 1\n", "37 1 1\n", "38 0 1\n", "39 3 1\n", "40 1 1\n", "41 3 3\n", "42 0 3\n", "43 2 3\n", "44 1 3\n", "45 0 3\n", "46 0 3\n", "47 3 3\n", "48 2 3\n", "49 3 3\n", "50 3 3\n", "51 0 3\n", "52 1 3\n", "53 3 3\n", "54 1 3\n", "55 1 3\n", "56 1 3\n", "57 3 3\n", "58 1 3\n", "59 2 3\n", "60 2 2\n", "61 3 2\n", "62 3 2\n", "63 2 2\n", "64 1 2\n", "65 0 2\n", "66 1 2\n", "67 3 2\n", "68 3 2\n", "69 0 2\n", "70 2 2\n", "71 0 2\n", "72 1 2\n", "73 0 2\n", "74 1 2\n" ] } ], "source": [ "# Function to reassign data\n", "#\n", "# Now we want to allow each data item to change its cluster.\n", "# Write a function which, for a given data item X(I), computes the\n", "# distance between X(I) and the mean MN(J) of each cluster.\n", "# Assume X(I) is currently assigned to cluster J=C[I], but is\n", "# closest to the mean of cluster J2. Unless J=J2, move X(I)\n", "# to the new cluster (that is, change the value of C[I])\n", "#\n", "# Write a function\n", "#\n", "# def move_data ( n, xy, c, k, mn ):\n", "# ***\n", "# return c, swaps\n", "#\n", "# which returns the updated cluster information C, but also\n", "# in SWAPS returns the number of points that changed their cluster.\n", "#\n", "def move_data ( n, xy, c, k, mn ):\n", " \n", " swaps = 0\n", "\n", " for i in range ( 0, n ):\n", " j = c[i]\n", " j2 = nearest_cluster ( n, i, xy, k, mn )\n", " if ( j2 != j ):\n", " swaps = swaps + 1\n", " c[i] = j2\n", "\n", " return c, swaps\n", "#\n", "# Try the function on the ruspini example.\n", "# The data is sorted and will break up into 4 groups naturally.\n", "#\n", "k = 4\n", "n = 75\n", "xy = ruspini_data ( )\n", "c = initial_clusters_random ( n, k )\n", "cold = c.copy ( )\n", "pop, mn, e = cluster_stats ( n, xy, c, k )\n", "\n", "c, swaps = move_data ( n, xy, c, k, mn )\n", "\n", "if ( swaps == 0 ):\n", " print ( 'No data items moved!')\n", "else:\n", " print ( '%d data items moved' % ( swaps ) )\n", " print ( '' )\n", " print ( ' I C(old) C(new)' )\n", " print ( '' )\n", " for i in range ( 0, n ):\n", " print ( '%2d %2d %2d' % ( i, cold[i], c[i] ) )\n" ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "Trial 0\n", " 0 1672.907720\n", " 1 924.645848\n", " 2 685.233876\n", " 3 665.626837\n", " 4 661.556076\n", " 5 660.697151\n", " 6 660.449910\n", " 7 660.303590\n", " 8 660.242451\n", " 9 660.216640\n", "\n", "Trial 1\n", " 0 1672.125335\n", " 1 682.548801\n", " 2 664.353694\n", " 3 662.542911\n", " 4 661.994316\n", " 5 661.598716\n", " 6 661.174854\n", " 7 660.790794\n", " 8 660.447890\n", " 9 660.195312\n", "\n", "Trial 2\n", " 0 1673.050488\n", " 1 1080.100079\n", " 2 699.487326\n", " 3 665.575445\n", " 4 664.513566\n", " 5 664.285867\n", " 6 664.228903\n", " 7 664.197820\n", " 8 664.125623\n", " 9 664.043365\n" ] } ], "source": [ "# Now take 10000 random points, and try to cluster them into 3 groups.\n", "#\n", "# Use an initial random clustering, and then recluster 10 times:\n", "#\n", "# n = 10000\n", "# k = 3\n", "# xy = np.random.random ( [ n, 2 ] )\n", "# c = initial_clusters_random ( n, k )\n", "# for step in range ( 0, 10 ):\n", "# get cluster stats\n", "# move data\n", "# print total energy\n", "#\n", "# We said that the total energy should always go down \n", "# (or at least never go up!). Is this what you observe.\n", "# \n", "# What is the minimum energy you reach after 10 steps?\n", "# Repeat the whole calculation two more times and note the\n", "# minimum energy you reach after 10 steps. Is there any difference?\n", "# Much difference?\n", "#\n", "n = 10000\n", "k = 3\n", "xy = np.random.random ( [ n, 2 ] )\n", "\n", "for rep in range ( 0, 3 ):\n", " print ( '' )\n", " print ( 'Trial %d' % ( rep ) )\n", " c = initial_clusters_random ( n, k )\n", " for step in range ( 0, 10 ):\n", " pop, mn, e = cluster_stats ( n, xy, c, k )\n", " print ( '%2d %f' % (step, np.sum ( e ) ) )\n", " c, swaps = move_data ( n, xy, c, k, mn )\n" ] }, { "cell_type": "code", "execution_count": 20, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "No point swaps!\n" ] }, { "data": { "text/plain": [ "[]" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# Look at a larger, random data set.\n", "# Watch what happens to energy as the iterations proceed.\n", "#\n", "n = 10000\n", "k = 10\n", "xy = np.random.random ( [ n, 2 ] )\n", "c = initial_clusters_random ( n, k )\n", "ivec = []\n", "evec = []\n", "for i in range ( 0, 100 ):\n", " pop, mn, e = cluster_stats ( n, xy, c, k )\n", " ivec.append ( i )\n", " evec.append ( sum ( e ) )\n", " c, swaps = move_data ( n, xy, c, k, mn )\n", " if ( swaps == 0 ):\n", " print ( 'No point swaps!')\n", " break\n", " \n", "plt.plot ( ivec, evec )\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] } ], "metadata": { "anaconda-cloud": {}, "kernelspec": { "display_name": "Python 2", "language": "python", "name": "python2" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 2 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython2", "version": "2.7.8" } }, "nbformat": 4, "nbformat_minor": 0 }