Extracting boundaries of dense regions of 1s in a

I'm not sure how to word my problem. But here it is...

I have a huge list of 1s and 0s [Total length = 53820].

Example of how the list looks like - [0,1,1,1,1,1,1,1,1,0,0,0,1,1,0,0,0,0,0,0,1,1...........]

The visualization is given below.

x-axis: index of the element (from 0 to 53820)

y-axis: value at that index (i.e. 1 or 0)

Input Plot--> (http://i67.tinypic.com/2h5jq5e.png)

The plot clearly shows 3 dense areas where the occurrence of 1s is more. I have drawn on top of the plot to show the visually dense areas. (ugly black lines on the plot). I want to know the index numbers on the x-axis of the dense areas (start and end boundaries) on the plot.

I have extracting the chunks of 1s and saving the start indexes of each in a new list named 'starts'. That function returns a list of dictionaries like this:

{'start': 0, 'count': 15, 'end': 16}, {'start': 2138, 'count': 3, 'end': 2142}, {'start': 2142, 'count': 3, 'end': 2146}, {'start': 2461, 'count': 1, 'end': 2463}, {'start': 2479, 'count': 45, 'end': 2525}, {'start': 2540, 'count': 2, 'end': 2543}

Then in starts, after setting a threshold, compared adjacent elements. Which returns the apparent boundaries of the dense areas.

THR = 2000
    results = []
    cues = {'start': 0, 'stop': 0}  
    result,starts = densest(preds) # Function that returns the list of dictionaries shown above
    cuestart = False # Flag to check if looking for start or stop of dense boundary
    for i,j in zip(range(0,len(starts)), range(1,len(starts))):
        now = starts[i]
        nextf = starts[j]

        if(nextf-now > THR):
            if(cuestart == False):
                cues['start'] = nextf
                cues['stop'] = nextf
                cuestart = True

            elif(cuestart == True): # Cuestart is already set
                cues['stop'] = now
                cuestart = False
                cues = {'start': 0, 'stop': 0}


The output and corresponding plot looks like this.

[{'start': 2138, 'stop': 6654}, {'start': 23785, 'stop': 31553}, {'start': 38765, 'stop': 38765}]

Output Plot --> (http://i63.tinypic.com/23hom6o.png)

This method fails to get the last dense region as seen in the plot, and also for other data of similar sorts.

P.S. I have also tried 'KDE' on this data and 'distplot' using seaborn but that gives me plots directly and I am unable to extract the boundary values from that. The link for that question is here (Getting dense region boundary values from output of KDE plot)


OK, you need an answer...

First, the imports (we are going to use LineCollections)

import numpy as np ; import matplotlib.pyplot as plt ;                           
from matplotlib.collections import LineCollection                                

Next, definition of constants

N = 1001 ; np.random.seed(20190515)                                              

and generation of fake data

x = np.linspace(0,1, 1001)                                                       
prob = np.where(x<0.4, 0.02, np.where(x<0.7, 0.95, 0.02))                        
y = np.where(np.random.rand(1001)<prob, 1, 0)                                    

here we create the line collection, sticks is a N×2×2 array containing the start and end points of our vertical lines

sticks = np.array(list(zip(zip(x, np.zeros(N)), zip(x, y))))                                  
lc = LineCollection(sticks)                                                      

finally, the cumulated sum, here normalized to have the same scale as the vertical lines

cs = (y-0.5).cumsum()                                                            
csmin, csmax = min(cs), max(cs)                                                  
cs = (cs-csmin)/(csmax-csmin) # normalized to 0 ÷ 1                              

We have just to plot our results

f, a = plt.subplots()                                                            
a.plot(x, cs, color='red')                                                       

Here it is the plot

and here a detail of the stop zone.

You can smooth the cs data and use something from scipy.optimize to spot the position of extremes. Should you have a problem in this last step please ask another question.