\section{Example: The Pizza Truck Problem} The town of Grayville has three pizza trucks, which are painted red, green, and blue. By tradition, every house in Grayville has always ordered from the same truck, which then sends the delivery by scooter. \vskip 0.1in \begin{center} \begin{tikzpicture} [ initial_style/.style={rectangle,draw=green,fill=green!10,thick }, scale = { 0.25 } ] \node (A) at ( 3,28) {\includegraphics[width=0.5in]{red_house.png} }; \node (B) at ( 9,16) {\includegraphics[width=0.5in]{red_house.png} }; \node (C) at (12,26) {\includegraphics[width=0.5in]{red_house.png} }; \node (D) at ( 4, 4) {\includegraphics[width=0.5in]{blue_house.png} }; \node (E) at (18,30) {\includegraphics[width=0.5in]{green_house.png} }; \node (F) at (16,13) {\includegraphics[width=0.5in]{green_house.png} }; \node (G) at (23,26) {\includegraphics[width=0.5in]{green_house.png} }; \node (H) at (24, 2) {\includegraphics[width=0.5in]{blue_house.png} }; \node (I) at (26,10) {\includegraphics[width=0.5in]{blue_house.png} }; \node (J) at (28,18) {\includegraphics[width=0.5in]{red_house.png} }; \node (K) at (33,30) {\includegraphics[width=0.5in]{green_house.png} }; \node (L) at (33,10) {\includegraphics[width=0.5in]{green_house.png} }; \node (M) at (40,24) {\includegraphics[width=0.5in]{green_house.png} }; \node (N) at (36, 6) {\includegraphics[width=0.5in]{blue_house.png} }; \node (O) at (15,20) {\includegraphics[width=0.5in]{blue_house.png} }; \node (P) at (11,15) {\includegraphics[width=0.75in]{red_pizza.png} }; \node (Q) at (20,25) {\includegraphics[width=0.75in]{green_pizza.png} }; \node (R) at (20, 5) {\includegraphics[width=0.75in]{blue_pizza.png} }; \draw [-,very thick,red] (A) -- (P); \draw [-,very thick,red] (B) -- (P); \draw [-,very thick,red] (C) -- (P); \draw [-,very thick,blue] (D) -- (R); \draw [-,very thick,green] (E) -- (Q); \draw [-,very thick,green] (F) -- (Q); \draw [-,very thick,green] (G) -- (Q); \draw [-,very thick,blue] (H) -- (R); \draw [-,very thick,blue] (I) -- (R); \draw [-,very thick,red] (J) -- (P); \draw [-,very thick,green] (K) -- (Q); \draw [-,very thick,green] (L) -- (Q); \draw [-,very thick,green] (M) -- (Q); \draw [-,very thick,blue] (N) -- (R); \draw [-,very thick,blue] (O) -- (R); \end{tikzpicture} \end{center} The price of gas has risen, and the owner of the pizza trucks asks a consultant if there is a way to save money. \begin{packed_enumerate} \item{ Each house should be served by the nearest pizza truck. {\it{Assign each data item to the nearest centroid)}} The owner is impressed by this change, which lowers the monthly gas bill. But is that as good as we can do? It turns out that in this new system, the trucks are not well placed.} \item{Each truck should be moved to the center of its service area. {\it{(Replace each centroid by the average of its data items.)}} That has got to be it, says the owner. No, because when you moved the trucks, you actually made some houses slightly closer to a different truck than the one they had been assigned.} \item{Unless things settled down, go back to step \#1} \end{packed_enumerate} \begin{center} \begin{tikzpicture} [ initial_style/.style={rectangle,draw=green,fill=green!10,thick }, scale = { 0.25 } ] \node (A) at ( 3,28) {\includegraphics[width=0.5in]{red_house.png} }; \node (B) at ( 9,16) {\includegraphics[width=0.5in]{red_house.png} }; \node (C) at (12,26) {\includegraphics[width=0.5in]{red_house.png} }; \node (D) at ( 4, 4) {\includegraphics[width=0.5in]{blue_house.png} }; \node (E) at (18,30) {\includegraphics[width=0.5in]{red_house.png} }; \node (F) at (16,13) {\includegraphics[width=0.5in]{blue_house.png} }; \node (G) at (23,26) {\includegraphics[width=0.5in]{green_house.png} }; \node (H) at (24, 2) {\includegraphics[width=0.5in]{blue_house.png} }; \node (I) at (26,10) {\includegraphics[width=0.5in]{blue_house.png} }; \node (J) at (28,18) {\includegraphics[width=0.5in]{green_house.png} }; \node (K) at (33,30) {\includegraphics[width=0.5in]{green_house.png} }; \node (L) at (33,10) {\includegraphics[width=0.5in]{blue_house.png} }; \node (M) at (40,24) {\includegraphics[width=0.5in]{green_house.png} }; \node (N) at (36, 6) {\includegraphics[width=0.5in]{blue_house.png} }; \node (O) at (15,20) {\includegraphics[width=0.5in]{red_house.png} }; \node (P) at (11.4, 24.0) {\includegraphics[width=0.75in]{red_pizza.png} }; \node (Q) at (31.0, 24.5) {\includegraphics[width=0.75in]{green_pizza.png} }; \node (R) at (23.1, 7.5) {\includegraphics[width=0.75in]{blue_pizza.png} }; \draw [-,very thick,red] (A) -- (P); \draw [-,very thick,red] (B) -- (P); \draw [-,very thick,red] (C) -- (P); \draw [-,very thick,blue] (D) -- (R); \draw [-,very thick,green] (E) -- (P); \draw [-,very thick,blue] (F) -- (R); \draw [-,very thick,green] (G) -- (Q); \draw [-,very thick,blue] (H) -- (R); \draw [-,very thick,blue] (I) -- (R); \draw [-,very thick,green] (J) -- (Q); \draw [-,very thick,green] (K) -- (Q); \draw [-,very thick,blue] (L) -- (R); \draw [-,very thick,green] (M) -- (Q); \draw [-,very thick,blue] (N) -- (R); \draw [-,very thick,red] (O) -- (P); \end{tikzpicture} \end{center} This simple example suggests how k-means clustering can reorganize data so that it is grouped more tightly. In this case, the regrouping simply reduces the total travel cost of the deliverers. In other cases, we expect that the grouping may reflect some meaningful fact about the data. \section{Example: Coding the Pizza Truck Problem} Assume that the arrays {\tt{x}} and {\tt{y}} contain the coordinates of each house, that {\tt{s}} and {\tt{t}} contain the coordinates of each truck, and that {\tt{rc}}, {\tt{gc}} and {\tt{bc}} list the houses served by the red, green, and blue trucks respectively. \vskip 0.1in Our first improvement is to assign each house to the nearest truck. To do this, we need to compute the distance of each house to each truck, and update the assignment vectors. We can also compute the current cost. \begin{lstlisting} rd = np.sqrt ( ( x - s[0] )**2 + ( y - t[0] )**2 ) gd = np.sqrt ( ( x - s[1] )**2 + ( y - t[1] )**2 ) bd = np.sqrt ( ( x - s[2] )**2 + ( y - t[2] )**2 ) rc = np.where ( ( rd < bd ) & ( rd < gd ) ) gc = np.where ( ( gd < bd ) & ( gd < rd ) ) bc = np.where ( ( bd < rd ) & ( bd < gd ) ) cost = sum ( bd[bc] ) + sum ( rd[rc] ) + sum ( gd[gc] ) \end{lstlisting} \vskip 0.1in Because we have reassigned some houses, it makes sense to move each truck to the center of its set of houses. We just have to average all the coordinates: \begin{lstlisting} s[0] = np.mean ( x[rc] ) t[0] = np.mean ( y[rc] ) s[1] = np.mean ( x[gc] ) t[1] = np.mean ( y[gc] ) s[2] = np.mean ( x[bc] ) t[2] = np.mean ( y[bc] ) \end{lstlisting} Because the trucks have moved, we need to recompute the distances and update the cost. \vskip 0.1in These two steps of reassigning houses and moving trucks are repeated until no house has to be reassigned, or the cost stops changing. \vskip 0.1in For the example problem in the illustration, here is how the cost changes: \begin{lstlisting} 0: 181.49 Initial 1: 146.29 Reassign houses 2: 117.08 Move trucks, reassign houses 3: 116.73 Move trucks, reassign houses 4: 116.73 Move trucks, reassign houses, NO CHANGE \end{lstlisting} \vskip 0.1in You can examine a simple code for this problem in the file {\it{pizza\_kmeans.py}}.