Number of peaks in histogram

2019-04-10 23:33发布

问题:

I have 1D data that represent some intensity values. I want to detect number of components in these data (clusters of points with similar intensity, or alternatively number of "peaks" in histogram created from this data).

This approach: 1D multiple peak detection? is not very useful for me, because one "peak" can contain more local maximums (see image below).

Of cause, I can use statistical approach, for example, I can try to fit data for 1,2,3,....n peaks, then calculate BIC, AIC or whatever for each fit. And finally use elbow method for number of clusters determination. However, I want to detect approximate number of peaks as fast as possible and fitting gaussian mixture is quite time consuming procedure.

My approach

So I came up with following approach (in C++). It takes histogram bins heights (y) and searches for indices in which y values start to decline. Then values lower than y tolerance (yt) are filtered. And finally, indices that are near to other using x tolerance (xt) are filtered too:

Indices StatUtils::findLocalMaximas(const Points1D &y, int xt, int yt) {

  // Result indices
  Indices indices;

  // Find all local maximas
  int imax = 0;
  double max = y[0];
  bool inc = true;
  bool dec = false;
  for (int i = 1; i < y.size(); i++) {    

    // Changed from decline to increase, reset maximum
    if (dec && y[i - 1] < y[i]) {
      max = std::numeric_limits<double>::min();
      dec = false;
      inc = true;
    }

    // Changed from increase to decline, save index of maximum
    if (inc && y[i - 1] > y[i]) {
       indices.append(imax);
       dec = true;
       inc = false;
    }

    // Update maximum
    if (y[i] > max) {
       max = y[i];
       imax = i;
    }
  }

  // If peak size is too small, ignore it
  int i = 0;
  while (indices.count() >= 1 && i < indices.count()) {
    if (y[indices.at(i)] < yt) {
      indices.removeAt(i);
    } else {
      i++;
    }
  }

  // If two peaks are near to each other, take only the largest one
  i = 1;
  while (indices.count() >= 2 && i < indices.count()) {
    int index1 = indices.at(i - 1);
    int index2 = indices.at(i);
    if (abs(index1 - index2) < xt) {
      indices.removeAt(y[index1] < y[index2] ? i-1 : i);
    } else {
      i++;
    }
  }
  return indices;
}

Problem with approach

Problem with this solution is that strongly depends on those tolerance values (xt and yt). So I have to have information about minimum allowed distance among peaks. Moreover, there are isolated outliers in my data that are higher then maximums of those smaller peaks.

Could you suggest some other approach how to determine number of peaks for data similar to those in attached figure.

回答1:

You could use my method of approximate Gaussian mixtures:

  • it is a robust statistical method

  • it does not depend on absolute thresholds; it only has two parameters that are relative (normalized) quantities, are easily controlled, and same values apply to different datasets

  • unlike the elbow method and most statistical methods, it estimates the number of modes dynamically in a single EM (expectation-maximization) run. It starts with every data point as an independent mode and deletes "overlapping" modes at every iteration.

  • it is fast because it employs approximate nearest neighbor (ANN) search at each iteration and its updates take into account only the k nearest neighbors, not all data points.

There is an online Matlab demo so you can easily experiment on a small dataset. In our C++ implementation we use FLANN for nearest neighbor search at large scale. Unfortunately this implementation is not public but I could give you some version if you're interested.