I have a sample table like this:
CREATE TABLE #TEMP(Category VARCHAR(100), Name VARCHAR(100))
INSERT INTO #TEMP VALUES('A', 'John')
INSERT INTO #TEMP VALUES('A', 'John')
INSERT INTO #TEMP VALUES('A', 'John')
INSERT INTO #TEMP VALUES('A', 'John')
INSERT INTO #TEMP VALUES('A', 'John')
INSERT INTO #TEMP VALUES('A', 'John')
INSERT INTO #TEMP VALUES('A', 'Adam')
INSERT INTO #TEMP VALUES('A', 'Adam')
INSERT INTO #TEMP VALUES('A', 'Adam')
INSERT INTO #TEMP VALUES('A', 'Adam')
INSERT INTO #TEMP VALUES('A', 'Lisa')
INSERT INTO #TEMP VALUES('A', 'Lisa')
INSERT INTO #TEMP VALUES('A', 'Bucky')
INSERT INTO #TEMP VALUES('B', 'Lily')
INSERT INTO #TEMP VALUES('B', 'Lily')
INSERT INTO #TEMP VALUES('B', 'Lily')
INSERT INTO #TEMP VALUES('B', 'Lily')
INSERT INTO #TEMP VALUES('B', 'Lily')
INSERT INTO #TEMP VALUES('B', 'Tom')
INSERT INTO #TEMP VALUES('B', 'Tom')
INSERT INTO #TEMP VALUES('B', 'Tom')
INSERT INTO #TEMP VALUES('B', 'Tom')
INSERT INTO #TEMP VALUES('B', 'Ross')
INSERT INTO #TEMP VALUES('B', 'Ross')
INSERT INTO #TEMP VALUES('B', 'Ross')
SELECT Category, Name, COUNT(Name) Total
FROM #TEMP
GROUP BY Category, Name
ORDER BY Category, Total DESC
DROP TABLE #TEMP
Gives me the following:
A John 6
A Adam 4
A Lisa 2
A Bucky 1
B Lily 5
B Tom 4
B Ross 3
Now, how do I select the TOP 5 PERCENT
records from each category assuming each category has more than 100 records (did not show in sample table here)? For instance, in my actual table, it should remove the John
record from A
and Lily
record from B
as appropriate (again, I did not show the full table here) to get:
A Adam 4
A Lisa 2
A Bucky 1
B Tom 4
B Ross 3
I have been trying to use CTE
s and PARTITION BY
clauses but cannot seem to achieve what I want. It removes the TOP 5 PERCENT from the overall result but not from each category. Any suggestions?
You could use a CTE (Common Table Expression) paired with the NTILE
windowing function - this will slice up your data into as many slices as you need, e.g. in your case, into 20 slices (each 5%).
;WITH SlicedData AS
(
SELECT Category, Name, COUNT(Name) Total,
NTILE(20) OVER(PARTITION BY Category ORDER BY COUNT(Name) DESC) AS 'NTile'
FROM #TEMP
GROUP BY Category, Name
)
SELECT *
FROM SlicedData
WHERE NTile > 1
This basically groups your data by Category,Name
, orders by something else (not sure if COUNT(Name)
is really the thing you want here), and then slices it up into 20 pieces, each representing 5% of your data partition. The slice with NTile = 1
is the top 5% slice - just ignore that when selecting from the CTE.
See:
- MSDN docs on NTILE
- SQL Server 2005 ranking functions
- SQL SERVER – 2005 – Sample Example of RANKING Functions – ROW_NUMBER, RANK, DENSE_RANK, NTILE
for more info
Edit: I have added the second solution
SELECT b.Id
,b.Category
,b.Name
,b.CategoryNameCount
FROM
(
SELECT a.Id
,a.Category
,a.Name
,COUNT(*)OVER(PARTITION BY a.Category, a.Name) CategoryNameCount
,COUNT(*)OVER(PARTITION BY a.Category) CategoryCount
FROM #TEMP a
) b
WHERE b.CategoryCount*5.0/100 > b.CategoryCount*b.CategoryNameCount*1.0/100
ORDER BY b.Category, b.CategoryNameCount DESC, b.Name
Results:
Id Category Name CategoryNameCount
----------- -------- ---------- -----------------
7 A Adam 4
8 A Adam 4
9 A Adam 4
10 A Adam 4
11 A Lisa 2
12 A Lisa 2
13 A Bucky 1
19 B Tom 4
20 B Tom 4
21 B Tom 4
22 B Tom 4
23 B Ross 3
24 B Ross 3
25 B Ross 3
or
SELECT b.Category, b.Name, b.CategoryNameCount
FROM
(
SELECT
a.Category
,a.Name
,COUNT(*)OVER(PARTITION BY a.Category, a.Name) CategoryNameCount
,COUNT(*)OVER(PARTITION BY a.Category) CategoryCount
FROM #TEMP a
) b
WHERE b.CategoryCount*5.0/100 > b.CategoryCount*b.CategoryNameCount*1.0/100
GROUP BY b.Category, b.Name, b.CategoryNameCount
ORDER BY b.Category, b.CategoryNameCount DESC, b.Name
Results:
Category Name CategoryNameCount
-------- ---------- -----------------
A Adam 4
A Lisa 2
A Bucky 1
B Tom 4
B Ross 3
select Category,name,CountTotal,RankSeq,(50*CountTotal)/100 from (
select Category,name,COUNT(*)
over (partition by Category,name ) as CountTotal,
ROW_NUMBER()
over (partition by Category,name order by Category) RankSeq from #TEMP
--group by Category,Name
) temp
where RankSeq <= ((50*CountTotal)/100)
order by Category,Name,RankSeq
Output:
Category name CountTotal RankSeq 50*CountTotal)/100
A Adam 4 1 2
A Adam 4 2 2
A John 6 1 3
A John 6 2 3
A John 6 3 3
A Lisa 2 1 1
B Lily 5 1 2
B Lily 5 2 2
B Ross 3 1 1
B Tom 4 1 2
B Tom 4 2 2
I hope this helps :)
;WITH SlicedData AS
(
SELECT Category, Name, COUNT(Name) Total,
**PERCENT_RANK() OVER(PARTITION BY Category ORDER BY COUNT(Name) DESC) * 100** AS 'Percent'
FROM #TEMP
GROUP BY Category, Name
)
SELECT *
FROM SlicedData
WHERE Percent < 5
NTile will not work if number of records is less than your tile number.