Getting data for histogram plot

2019-01-07 02:36发布

Is there a way to specify bin sizes in MySQL? Right now, I am trying the following SQL query:

select total, count(total) from faults GROUP BY total;

The data that is being generated is good enough but there are just too many rows. What I need is a way to group the data into predefined bins. I can do this from a scripting language, but is there a way to do it directly in SQL?

Example:

+-------+--------------+
| total | count(total) |
+-------+--------------+
|    30 |            1 | 
|    31 |            2 | 
|    33 |            1 | 
|    34 |            3 | 
|    35 |            2 | 
|    36 |            6 | 
|    37 |            3 | 
|    38 |            2 | 
|    41 |            1 | 
|    42 |            5 | 
|    43 |            1 | 
|    44 |            7 | 
|    45 |            4 | 
|    46 |            3 | 
|    47 |            2 | 
|    49 |            3 | 
|    50 |            2 | 
|    51 |            3 | 
|    52 |            4 | 
|    53 |            2 | 
|    54 |            1 | 
|    55 |            3 | 
|    56 |            4 | 
|    57 |            4 | 
|    58 |            2 | 
|    59 |            2 | 
|    60 |            4 | 
|    61 |            1 | 
|    63 |            2 | 
|    64 |            5 | 
|    65 |            2 | 
|    66 |            3 | 
|    67 |            5 | 
|    68 |            5 | 
------------------------

What I am looking for:

+------------+---------------+
| total      | count(total)  |
+------------+---------------+
|    30 - 40 |            23 | 
|    40 - 50 |            15 | 
|    50 - 60 |            51 | 
|    60 - 70 |            45 | 
------------------------------

I guess this cannot be achieved in a straight forward manner but a reference to any related stored procedure would be fine as well.

10条回答
女痞
2楼-- · 2019-01-07 03:12

Ofri Raviv's answer is very close but incorrect. The count(*) will be 1 even if there are zero results in a histogram interval. The query needs to be modified to use a conditional sum:

SELECT b.*, SUM(a.value IS NOT NULL) AS total FROM bins b
  LEFT JOIN a ON a.value BETWEEN b.min_value AND b.max_value
GROUP BY b.min_value;
查看更多
淡お忘
3楼-- · 2019-01-07 03:13

Mike DelGaudio's answer is the way I do it, but with a slight change:

select floor(mycol/10)*10 as bin_floor, count(*)
from mytable
group by 1
order by 1

The advantage? You can make the bins as large or as small as you want. Bins of size 100? floor(mycol/100)*100. Bins of size 5? floor(mycol/5)*5.

Bernardo.

查看更多
Juvenile、少年°
4楼-- · 2019-01-07 03:16

This is a post about a super quick-and-dirty way to create a histogram in MySQL for numeric values.

There are multiple other ways to create histograms that are better and more flexible, using CASE statements and other types of complex logic. This method wins me over time and time again since it's just so easy to modify for each use case, and so short and concise. This is how you do it:

SELECT ROUND(numeric_value, -2)    AS bucket,
       COUNT(*)                    AS COUNT,
       RPAD('', LN(COUNT(*)), '*') AS bar
FROM   my_table
GROUP  BY bucket;

Just change numeric_value to whatever your column is, change the rounding increment, and that's it. I've made the bars to be in logarithmic scale, so that they don't grow too much when you have large values.

numeric_value should be offset in the ROUNDing operation, based on the rounding increment, in order to ensure the first bucket contains as many elements as the following buckets.

e.g. with ROUND(numeric_value,-1), numeric_value in range [0,4] (5 elements) will be placed in first bucket, while [5,14] (10 elements) in second, [15,24] in third, unless numeric_value is offset appropriately via ROUND(numeric_value - 5, -1).

This is an example of such query on some random data that looks pretty sweet. Good enough for a quick evaluation of the data.

+--------+----------+-----------------+
| bucket | count    | bar             |
+--------+----------+-----------------+
|   -500 |        1 |                 |
|   -400 |        2 | *               |
|   -300 |        2 | *               |
|   -200 |        9 | **              |
|   -100 |       52 | ****            |
|      0 |  5310766 | *************** |
|    100 |    20779 | **********      |
|    200 |     1865 | ********        |
|    300 |      527 | ******          |
|    400 |      170 | *****           |
|    500 |       79 | ****            |
|    600 |       63 | ****            |
|    700 |       35 | ****            |
|    800 |       14 | ***             |
|    900 |       15 | ***             |
|   1000 |        6 | **              |
|   1100 |        7 | **              |
|   1200 |        8 | **              |
|   1300 |        5 | **              |
|   1400 |        2 | *               |
|   1500 |        4 | *               |
+--------+----------+-----------------+

Some notes: Ranges that have no match will not appear in the count - you will not have a zero in the count column. Also, I'm using the ROUND function here. You can just as easily replace it with TRUNCATE if you feel it makes more sense to you.

I found it here http://blog.shlomoid.com/2011/08/how-to-quickly-create-histogram-in.html

查看更多
Fickle 薄情
5楼-- · 2019-01-07 03:16
select "30-34" as TotalRange,count(total) as Count from table_name
   where total between 30 and 34
union (
select "35-39" as TotalRange,count(total) as Count from table_name 
   where total between 35 and 39)
union (
select "40-44" as TotalRange,count(total) as Count from table_name
   where total between 40 and 44)
union (
select "45-49" as TotalRange,count(total) as Count from table_name
   where total between 45 and 49)
etc ....

As long as there are not too many intervals, this is a pretty good solution.

查看更多
登录 后发表回答