{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "density_discrete.ipynb\n", "\n", "Discussion: Sampling from a discrete distribution over a square\n", "\n", "Licensing: This code is distributed under the GNU LGPL license.\n", " \n", "Modified: 09 November 2016\n", "\n", "Author: John Burkardt, Lukas Bystricky" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Using matplotlib backend: agg\n" ] } ], "source": [ "# Import necessary libraries and set plot option\n", "%matplotlib\n", "%matplotlib inline\n", "%config InlineBackend.figure_format = 'svg'\n", "import numpy as np\n", "import matplotlib.pyplot as plt\n", "import math\n", "import scipy.spatial as spatial" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Sampling a discrete distribution #\n", "\n", "From our work on sampling and densities, we know that the sampling method\n", "is the key to whether the CVT is uniform or not. Rather than specifying\n", "a density function rho(x,y), we can modify our sampling technique, as\n", "we did in the circle example, and get a nonuniform result.\n", "\n", "We'd like to be able to model population density over a state. Populations\n", "are not defined by a function, but by chunks of data, that is, so many people\n", "live in this block, and so many in this block, and so on. Within the block,\n", "we might assume the population density is constant, but from block to block\n", "it must vary. \n", "\n", "In order to do a CVT, we would need to be able, say, to pick 10,000 people\n", "uniformly at random from the population; however, these people would be\n", "distributed nonuniformly with respect to geography. (We expect a lot\n", "of samples from Miami, but not many from the Everglades!)\n", "\n", "To understand how this can be done, we will look at an example that embodies\n", "the same problem, but with simple geometry and only a few sets of data.\n", "So we'll look at a square, which has been subdivided into squares, and for\n", "which the \"population\" of each subsquare is known.\n", "\n", "We will work out a fair way of sampling the population, plot a typical sample,\n", "and then work out how a CVT iteration could be set up for such a problem.\n", "If we understand this problem, then the only difference with Florida is \n", "a more complicated geometry and larger set of population data.\n", "\n", "We will look at:\n", "* reading the population data from a file\n", "* sampling from the population and plotting the sample\n", "* setting up a CVT for this toy problem." ] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "# Our discrete data #\n", "\n", "We suppose our \"state\" is a square in [0,5]x[0,5], divided into a 5x5 array of \n", "unit squares, which we can think of as 25 counties. We will find that indexing \n", "individual counties can be confusing, since sometimes we will want to order them \n", "by an index running 0 to 24, and other times by double indices running (0,0) to \n", "(4,4), and other times by the (X,Y) coordinates of the vertices of the corners\n", "of the county. Let us try to make sense out of this and agree on orderings.\n", " \n", "Here is how we will index the counties in linear order:\n", "\n", " 0 1 2 3 4\n", " 5 6 7 8 9\n", " 10 11 12 13 14\n", " 15 16 17 18 19\n", " 20 21 22 23 24\n", "\n", "Here is how we will index the counties in (I,J) or (row,column) order:\n", "\n", " (0,4) (1,4) (2,4) (3,4) (4,4)\n", " (0,3) (1,3) (2,3) (3,3) (4,3)\n", " (0,2) (1,2) (2,2) (3,2) (4,2)\n", " (0,1) (1,1) (2,1) (3,1) (4,1)\n", " (0,0) (1,0) (2,0) (3,0) (4,0)\n", "\n", "Notice that county 13, with (I,J) coordinates (3,2) is inside the following\n", "(X,Y) coordinate box:\n", "\n", " (3.0,3.0)----(4.0,3.0)\n", " | |\n", " | |\n", " (3.0,2.0)----(4.0,2.0)\n", " \n", "Now we supply, for each county, a population, that is, the number of people\n", "who live there. Here's the data, arranged in the same ordering. We can imagine \n", "that the data is measured in thousands, so county 0 has 5,000 inhabitants.\n", "\n", " 5 5 2 2 2\n", " 3 3 2 2 2\n", " 3 3 5 5 5\n", " 1 2 5 5 10\n", " 1 2 5 10 10\n", "\n", "These numbers add up to 100, so we can see right away that our probability\n", "table, indicating how often we should pick a person from each county is:\n", "\n", " 0.05 0.05 0.02 0.02 0.02\n", " 0.03 0.03 0.02 0.02 0.02\n", " 0.03 0.03 0.05 0.05 0.05\n", " 0.01 0.02 0.05 0.05 0.10\n", " 0.01 0.02 0.05 0.10 0.10\n", " \n", "Now, we have seen this kind of problem before when sampling a triangle of\n", "a polygon before. What we need now is a running sum of probabilities, so\n", "that we can pick a random number R, and use that to choose at random the\n", "county from which to sample. If we \"read\" the counties in the linear order,\n", "then our running sum will look like this:\n", "\n", " 0.05 0.10 0.12 0.14 0.16\n", " 0.19 0.22 0.24 0.26 0.28\n", " 0.31 0.34 0.39 0.44 0.49\n", " 0.50 0.52 0.57 0.62 0.72\n", " 0.73 0.75 0.80 0.90 1.00\n", " \n", "So if our random value R was 0.3173, we would want to pick the county whose\n", "linear index is 11, with (row,column) index (2,1), and (x,y) coordinates\n", "(2.0,1.0) for the lower left corner of the box." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# Code for discrete data\n", "#\n", "# Write a function that returns the population data as a list of 25 values;\n", "```python\n", "def pop_data ( )\n", " pop = np.array ( [ ... ])\n", " return pop\n", "```\n", "#\n", "# Write a function that converts the population data into probabilities.\n", "#\n", "def prob_data ( )\n", " prob = np.array ( [ ... ] )\n", " return prob\n", "```\n", "#\n", "# Write a function that returns the running sum of the probabilities.\n", "#\n", "def cdf_data ( )\n", " cdf = np.array ( [...])\n", " return cdf\n", "```\n", "#\n", "# Write a function that returns M random samples of the numbers from 0 to 24,\n", "# using the CDF data. That is, this function simply picks a county, with\n", "# a weight based on the population. If we ask for 100 samples, then we'd expect, \n", "# for example, to see county 24 show up about 10 times in the list.\n", "#\n", "def county_sample_data ( )\n", " county = np.zeros ( m )\n", " ???\n", " return county\n", "```\n" ] } ], "metadata": { "anaconda-cloud": {}, "kernelspec": { "display_name": "Python 2", "language": "python", "name": "python2" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 2 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython2", "version": "2.7.8" } }, "nbformat": 4, "nbformat_minor": 0 }