I have a set of >2000 numbers, gathered from measurement. I want to sample from this data set, ~10 times in each test, while preserving probability distribution overall, and in each test (to extent approximately possible). For example, in each test, I want some small value, some middle class value, some big value, with the mean and variance approximately close to the original distribution. Combining all the tests, I also want the total mean and variance of all the samples, approximately close to the original distribution.
As my dataset is a long-tail probability distribution, the amount of data at each quantile are not the same:
Fig 1. Density plot of ~2k elements of data.
I am using Java, and right now I am using a uniform distribution, and use a random int from the dataset, and return the data element at that position:
public int getRandomData() {
int data[] ={1231,414,222,4211,,41,203,123,432,...};
length=data.length;
Random r=new Random();
int randomInt = r.nextInt(length);
return data[randomInt];
}
I don't know if it works as I want, because I use data in order it is measured, which has great amount of serial correlation.
Random sampling preserves the probability distribution.
It works as you want. The order of the data is irrelevant.