I wish to know, for a given predicted commute journey duration in minutes, the range of actual commute times I might expect. For example, if Google Maps predicts my commute to be 20 minutes, what is the minimum and maximum commute I should expect (perhaps a 95% range)?
Let's import my data into pandas:
%matplotlib inline
import pandas as pd
commutes = pd.read_csv('https://raw.githubusercontent.com/blokeley/commutes/master/commutes.csv')
commutes.tail()
This gives:
We can create a plot easily which shows the scatter of raw data, a regression curve, and the 95% confidence interval on that curve:
import seaborn as sns
# Create a linear model plot
sns.lmplot('prediction', 'duration', commutes);
How do I now calculate and plot the 95% range of actual commute times versus predicted times?
Put another way, if Google Maps predicts my commute to take 20 minutes, it looks like it could actually take anywhere between something like 14 and 28 minutes. It would be great to calculate or plot this range.
Thanks in advance for any help.
The relationship between actual duration of a commute and the prediction should be linear, so I can use quantile regression:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import statsmodels.formula.api as smf
# Import data and print the last few rows
commutes = pd.read_csv('https://raw.githubusercontent.com/blokeley/commutes/master/commutes.csv')
# Create the quantile regression model
model = smf.quantreg('duration ~ prediction', commutes)
# Create a list of quantiles to calculate
quantiles = [0.05, 0.25, 0.50, 0.75, 0.95]
# Create a list of fits
fits = [model.fit(q=q) for q in quantiles]
# Create a new figure and axes
figure, axes = plt.subplots()
# Plot the scatter of data points
x = commutes['prediction']
axes.scatter(x, commutes['duration'], alpha=0.4)
# Create an array of predictions from the minimum to maximum to create the regression line
_x = np.linspace(x.min(), x.max())
for index, quantile in enumerate(quantiles):
# Plot the quantile lines
_y = fits[index].params['prediction'] * _x + fits[index].params['Intercept']
axes.plot(_x, _y, label=quantile)
# Plot the line of perfect prediction
axes.plot(_x, _x, 'g--', label='Perfect prediction')
axes.legend()
axes.set_xlabel('Predicted duration (minutes)')
axes.set_ylabel('Actual duration (minutes)');
This gives:
Many thanks to my colleague Philip for the quantile regression tip.
You should fit your data in a gaussian distribution within 3 sigma std dev wich will represent something around 96% of your results.
Look after Normal Distribution.