I have a column of data, some of which are NULL values, from which I wish to extract the single 90th percentile value:
ColA
-----
NULL
100
200
300
NULL
400
500
600
700
800
900
1000
For the above, I am looking for a technique which returns the value 900 when searching for the 90th percentile, 800 for the 80th percentile, etc. An analogous function would be AVG(ColA) which returns 550 for the above data, or MIN(ColA) which returns 100, etc.
Any suggestions?
If you want to get exactly the 90th percentile value, excluding NULLs, I would suggest doing the calculation directly. The following version calculates the row number and number of rows, and selects the appropriate value:
select max(case when rownum*1.0/numrows <= 0.9 then colA end) as percentile_90th
from (select colA,
row_number() over (order by colA) as rownum,
count(*) over (partition by NULL) as numrows
from t
where colA is not null
) t
I put the condition in the SELECT clause rather than the WHERE clause, so you can easily get the 50th percentile, 17th, or whatever values you want.
WITH
percentiles AS
(
SELECT
NTILE(100) OVER (ORDER BY ColA) AS percentile,
*
FROM
data
)
SELECT
*
FROM
percentiles
WHERE
percentile = 90
Note: If the data has less than 100 observations, not all percentiles will have a value. Equally, if you have more than 100 observations, some percentiles will contain more values.
Starting with SQL Server 2012, there are now PERCENTILE_DISC
and PERCENTILE_CONT
inverse distribution functions. These are (so far) only available as window functions, not as aggregate functions, so you would have to remove redundant results because of the lacking grouping, e.g. by using DISTINCT
or TOP 1
:
WITH t AS (
SELECT *
FROM (
VALUES(NULL),(100),(200),(300),
(NULL),(400),(500),(600),(700),
(800),(900),(1000)
) t(ColA)
)
SELECT DISTINCT percentile_disc(0.9) WITHIN GROUP (ORDER BY ColA) OVER()
FROM t
;
I have blogged about percentiles more in detail here.