I need to read long file with timestamp in seconds, and plot of CDF using numpy or scipy. I did try with numpy but seems the output is NOT what it is supposed to be. The code below: Any suggestions appreciated.
import numpy as np
import matplotlib.pyplot as plt
data = np.loadtxt('Filename.txt')
sorted_data = np.sort(data)
cumulative = np.cumsum(sorted_data)
plt.plot(cumulative)
plt.show()
As a quick answer,
plt.plot(sorted_data, np.linspace(0,1,sorted_data.size)
should have got you what you wanted
You have two options:
1: you can bin the data first. This can be done easily with the
numpy.histogram
function:2: rather than use
numpy.cumsum
, just plot thesorted_data
array against the number of items smaller than each element in the array (see this answer for more details https://stackoverflow.com/a/11692365/588071):Here's an implementation that's a bit more efficient if there are many repeated values (since we only have to sort the unique values). And it plots the CDF as a step function, which it is, strictly speaking.
For completeness, you should also consider:
You can use
numpy.histogram
, setting the bins edges in such a way that each bin collects all the occurrences of only one point. You should keepdensity=False
, because according to the documentation:You can normalize instead the number of elements in each bin dividing it by the size of your data.
As an example, with the following data:
you would get:
You can also interpolate the cdf in order to get a continuous function (with either a linear interpolation or a cubic spline):
The following is the step of my implementation:
1.sort your data
2.calculate the cumulative probability of every 'x'
Example:
Figure: The link of graph