I have 1D data that represent some intensity values. I want to detect number of components in these data (clusters of points with similar intensity, or alternatively number of "peaks" in histogram created from this data).
This approach: 1D multiple peak detection? is not very useful for me, because one "peak" can contain more local maximums (see image below).
Of cause, I can use statistical approach, for example, I can try to fit data for 1,2,3,....n peaks, then calculate BIC, AIC or whatever for each fit. And finally use elbow method for number of clusters determination. However, I want to detect approximate number of peaks as fast as possible and fitting gaussian mixture is quite time consuming procedure.
My approach
So I came up with following approach (in C++). It takes histogram bins heights (y) and searches for indices in which y values start to decline. Then values lower than y tolerance (yt) are filtered. And finally, indices that are near to other using x tolerance (xt) are filtered too:
Indices StatUtils::findLocalMaximas(const Points1D &y, int xt, int yt) {
// Result indices
Indices indices;
// Find all local maximas
int imax = 0;
double max = y[0];
bool inc = true;
bool dec = false;
for (int i = 1; i < y.size(); i++) {
// Changed from decline to increase, reset maximum
if (dec && y[i - 1] < y[i]) {
max = std::numeric_limits<double>::min();
dec = false;
inc = true;
}
// Changed from increase to decline, save index of maximum
if (inc && y[i - 1] > y[i]) {
indices.append(imax);
dec = true;
inc = false;
}
// Update maximum
if (y[i] > max) {
max = y[i];
imax = i;
}
}
// If peak size is too small, ignore it
int i = 0;
while (indices.count() >= 1 && i < indices.count()) {
if (y[indices.at(i)] < yt) {
indices.removeAt(i);
} else {
i++;
}
}
// If two peaks are near to each other, take only the largest one
i = 1;
while (indices.count() >= 2 && i < indices.count()) {
int index1 = indices.at(i - 1);
int index2 = indices.at(i);
if (abs(index1 - index2) < xt) {
indices.removeAt(y[index1] < y[index2] ? i-1 : i);
} else {
i++;
}
}
return indices;
}
Problem with approach
Problem with this solution is that strongly depends on those tolerance values (xt and yt). So I have to have information about minimum allowed distance among peaks. Moreover, there are isolated outliers in my data that are higher then maximums of those smaller peaks.
Could you suggest some other approach how to determine number of peaks for data similar to those in attached figure.