I have the following SQL table -
Date StoreNo Sales
23/4 34 4323.00
23/4 23 564.00
24/4 34 2345.00
etc
I am running a query that returns average sales, max sales and min sales for a certain period -
select avg(Sales), max(sales), min(sales)
from tbl_sales
where date between etc
But there are some values coming through in the min and max that are really extreme - perhaps because the data entry was bad, perhaps because some anomoly had occurred on that date and store.
What I'd like is a query that returns average, max and min, but somehow excludes the extreme values. I am open to how this is done, but perhaps it would use standard deviations in some way (for example, only using data within x std devs of the true average).
Many thanks
In order to calculate the standard deviation, you need to iterate through all of the elements, so it would be impossible to do this in one query. The lazy way would be to just do it in two passes:
Another simple option that might work (fairly common in analysis of scientific data) would be to just drop the minimum and maximum x values, which works if you have a lot of data to process. You can use
ROW_NUMBER
to do this in one statement:Replace
ROW_NUMBER
withRANK
orDENSE_RANK
if you want to discard a certain number of unique values.Beyond these simple tricks you start to get into some pretty heavy stats. I have to deal with similar kinds of validation and it's far too much material for a SO post. There are a hundred different algorithms that you can tweak in a dozen different ways. I would try to keep it simple if possible!
Maybe what you're looking for are percentiles.
Standard deviation tends to be sensitive to outliers, since it's calculated using the square of the difference between a value and the mean.
Maybe a more robust, less sensitive measure like absolute value of difference between a value and the mean would be more appropriate in your case.
Expanding on DuffyMo's post you could do something like
This will exclude the lowest 5% and highest 95%. If you have numbers that vary wildly, you may find that the Average isn't a quality summary statistic and should consider using median. You can do that by doing something like: