[Updates at bottom (including solution source code)]
I have a challenging business problem that a computer can help solve.
Along a mountainous region flows a long winding river with strong currents. Along certain parts of the river are plots of environmentally sensitive land suitable for growing a particular type of rare fruit that is in very high demand. Once field laborers harvest the fruit, the clock starts ticking to get the fruit to a processing plant. It's very costly to try and send the fruits upstream or over land or air. By far the most cost effective mechanism to ship them to the plant is downstream in containers powered solely by the river's constant current. We have the capacity to build 10 processing plants and need to locate these along the river to minimize the total time the fruits spend in transit. The fruits can take however long before reaching the nearest downstream plant but that time directly hurts the price at which they can be sold. Effectively, we want to minimize the sum of the distances to the nearest respective downstream plant. A plant can be located as little as 0 meters downstream from a fruit access point.
The question is: In order to maximize profits, how far up the river should we build the 10 processing plants if we have found 32 fruit growing regions, where the regions' distances upstream from the base of the river are (in meters): 10, 40, 90, 160, 250, 360, 490, ... (n^2)*10 ... 9000, 9610, 10320?
[It is hoped that all work going towards solving this problem and towards creating similar problems and usage scenarios can help raise awareness about and generate popular resistance towards the damaging and stifling nature of software/business method patents (to whatever degree those patents might be believed to be legal within a locality).]
UPDATES
Update1: Forgot to add: I believe this question is a special case of this one.
Update2: One algorithm I wrote gives an answer in a fraction of a second, and I believe is rather good (but it's not yet stable across sample values). I'll give more details later, but the short is as follows. Place the plants at equal spacings. Cycle over all the inner plants where at each plant you recalculate its position by testing every location between its two neighbors until the problem is solved within that space (greedy algorithm). So you optimize plant 2 holding 1 and 3 fixed. Then plant 3 holding 2 and 4 fixed... When you reach the end, you cycle back and repeat until you go a full cycle where every processing plant's recalculated position stops varying.. also at the end of each cycle, you try to move processing plants that are crowded next to each other and are all near each others' fruit dumps into a region that has fruit dumps far away. There are many ways to vary the details and hence the exact answer produced. I have other candidate algorithms, but all have glitches. [I'll post code later.] Just as Mike Dunlavey mentioned below, we likely just want "good enough".
To give an idea of what might be a "good enough" result:
10010 total length of travel from 32 locations to plants at
{10,490,1210,1960,2890,4000,5290,6760,8410,9610}
Update3: mhum gave the correct exact solution first but did not (yet) post a program or algorithm, so I wrote one up that yields the same values.
/************************************************************
This program can be compiled and run (eg, on Linux):
$ gcc -std=c99 processing-plants.c -o processing-plants
$ ./processing-plants
************************************************************/
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
//a: Data set of values. Add extra large number at the end
int a[]={
10,40,90,160,250,360,490,640,810,1000,1210,1440,1690,1960,2250,2560,2890,3240,3610,4000,4410,4840,5290,5760,6250,6760,7290,7840,8410,9000,9610,10240,99999
};
//numofa: size of data set
int numofa=sizeof(a)/sizeof(int);
//a2: will hold (pt to) unique data from a and in sorted order.
int *a2;
//max: size of a2
int max;
//num_fixed_loc: at 10 gives the solution for 10 plants
int num_fixed_loc;
//xx: holds index values of a2 from the lowest error winner of each cycle memoized. accessed via memoized offset value. Winner is based off lowest error sum from left boundary upto right ending boundary.
//FIX: to be dynamically sized.
int xx[1000000];
//xx_last: how much of xx has been used up
int xx_last=0;
//SavedBundle: data type to "hold" memoized values needed (total traval distance and plant locations)
typedef struct _SavedBundle {
long e;
int xx_offset;
} SavedBundle;
//sb: (pts to) lookup table of all calculated values memoized
SavedBundle *sb; //holds winning values being memoized
//Sort in increasing order.
int sortfunc (const void *a, const void *b) {
return (*(int *)a - *(int *)b);
}
/****************************
Most interesting code in here
****************************/
long full_memh(int l, int n) {
long e;
long e_min=-1;
int ti;
if (sb[l*max+n].e) {
return sb[l*max+n].e; //convenience passing
}
for (int i=l+1; i<max-1; i++) {
e=0;
//sum first part
for (int j=l+1; j<i; j++) {
e+=a2[j]-a2[l];
}
//sum second part
if (n!=1) //general case, recursively
e+=full_memh(i, n-1);
else //base case, iteratively
for (int j=i+1; j<max-1; j++) {
e+=a2[j]-a2[i];
}
if (e_min==-1) {
e_min=e;
ti=i;
}
if (e<e_min) {
e_min=e;
ti=i;
}
}
sb[l*max+n].e=e_min;
sb[l*max+n].xx_offset=xx_last;
xx[xx_last]=ti; //later add a test or a realloc, etc, if approp
for (int i=0; i<n-1; i++) {
xx[xx_last+(i+1)]=xx[sb[ti*max+(n-1)].xx_offset+i];
}
xx_last+=n;
return e_min;
}
/*************************************************************
Call to calculate and print results for given number of plants
*************************************************************/
int full_memoization(int num_fixed_loc) {
char *str;
long errorsum; //for convenience
//Call recursive workhorse
errorsum=full_memh(0, num_fixed_loc-2);
//Now print
str=(char *) malloc(num_fixed_loc*20+100);
sprintf (str,"\n%4d %6d {%d,",num_fixed_loc-1,errorsum,a2[0]);
for (int i=0; i<num_fixed_loc-2; i++)
sprintf (str+strlen(str),"%d%c",a2[ xx[ sb[0*max+(num_fixed_loc-2)].xx_offset+i ] ], (i<num_fixed_loc-3)?',':'}');
printf ("%s",str);
return 0;
}
/**************************************************
Initialize and call for plant numbers of many sizes
**************************************************/
int main (int x, char **y) {
int t;
int i2;
qsort(a,numofa,sizeof(int),sortfunc);
t=1;
for (int i=1; i<numofa; i++)
if (a[i]!=a[i-1])
t++;
max=t;
i2=1;
a2=(int *)malloc(sizeof(int)*t);
a2[0]=a[0];
for (int i=1; i<numofa; i++)
if (a[i]!=a[i-1]) {
a2[i2++]=a[i];
}
sb = (SavedBundle *)calloc(sizeof(SavedBundle),max*max);
for (int i=3; i<=max; i++) {
full_memoization(i);
}
free(sb);
return 0;
}
Let me give you a simple example of a Metropolis-Hastings algorithm. Suppose you have a state vector x, and a goodness-of-fit function P(x), which can be any function you care to write.
Suppose you have a random distribution Q that you can use to modify the vector, such as
x' = x + N(0, 1) * sigma
, where N is a simple normal distribution about 0, and sigma is a standard deviation of your choosing.Usually it is done with a burn-in time of maybe 1000 cycles, after which you start collecting samples. After another maybe 10,000 cycles, the average of the samples is what you take as an answer.
It requires diagnostics and tuning. Typically the samples are plotted, and what you are looking for is a "fuzzy caterpilar" plot that is stable (doesn't move around much) and has a high acceptance rate (very fuzzy). The main parameter you can play with is sigma. If sigma is too small, the plot will be fuzzy but it will wander around. If it is too large, the plot will not be fuzzy - it will have horizontal segments. Often the starting vector x is chosen at random, and often multiple starting vectors are chosen, to see if they end up in the same place.
It is not necessary to vary all components of the state vector x at the same time. You can cycle through them, varying one at a time, or some such method.
Also, if you don't need the diagnostic plot, it may not be necessary to save the samples, but just calculate the average and variance on the fly.
In the applications I'm familiar with, P(x) is a measure of probability, and it is typically in log-space, so it can vary from 0 to negative infinity. Then to do the "maybe accept" step it is
(exp(logp' - logp))
Unless I've made an error, here are exact solutions (obtained through a dynamic programming approach):