I need a theoretical answer.
Imagine that you have a table with 1.5 billion rows (the table is created as column-based with DB2-Blu).
You are using SAS and you will do some statistics by using Proc Summary
like min/max/mean values, standard deviation value and percentile-10, percentile-90 through your peer-groups.
For instance, you have 30.000 peer-groups and you have 50.000 values in each peer group (Total 1.5 billions values).
The other case you have 3 million peer-groups and also you have 50 values in each peer-group. So you have total 1.5 billion values again.
Would it go faster if you have less peer groups but more values in each peer-group? Or would it go faster with more peer-groups but less less values in each peer-group.
I could test the first case (30.000 peer-groups and 50.000 values per peer group) and it took around 16 mins. But I can't test for the second case.
Can you write an approximate prognose for run-time in case when I have 3 million peer-groups and also 50 values in each peer-group?
One more dimension for the question. Would it be faster to do those statistics if I use Proc SQL
instead?
Example code is below:
proc summary data = table_blu missing chartype;
class var1 var2; /* Var1 and var2 are toghether peer-group */
var values;
output out = stattable(rename = (_type_ = type) drop = _freq_)
n=n min=min max=max mean=mean std=std q1=q1 q3=q3 p10=p10 p90=p90 p95=p95
;
run;
So there are a number of things to think about here.
The first point and quite possibly the largest in terms of performance is getting the data from DB2 into SAS. (I'm assuming this is not an in database instance of SAS -- correct me if it is). That's a big table and moving it across the wire takes time. Because of that, if you can calculate all these statistics inside DB2 with an SQL statement, that will probably be your fastest option.
So assuming you've downloaded the table to the SAS server:
A table sorted by the CLASS
variables will be MUCH faster to process than an unsorted table. If SAS knows the table is sorted, it doesn't have to scan the table for records to go into a group, it can do block reads instead of random IO.
If the table is not sorted, then the larger the number of groups, then more table scans that have to occur.
The point is, the speed of getting data from the HD to the CPU will be paramount in an unsorted process.
From there, you get into a memory and cpu issue. PROC SUMMARY is multithreaded and SAS will read N groups at a time. If group size can fit into the memory allocated for that thread, you won't have an issue. If the group size is too large, then SAS will have to page.
I scaled down the problem to a 15M row example:
%let grps=3000;
%let pergrp=5000;
UNSORTED:
NOTE: There were 15000000 observations read from the data set
WORK.TEST.
NOTE: The data set WORK.SUMMARY has 3001 observations and 9
variables.
NOTE: PROCEDURE SUMMARY used (Total process time):
real time 20.88 seconds
cpu time 31.71 seconds
SORTED:
NOTE: There were 15000000 observations read from the data set
WORK.TEST.
NOTE: The data set WORK.SUMMARY has 3001 observations and 9
variables.
NOTE: PROCEDURE SUMMARY used (Total process time):
real time 5.44 seconds
cpu time 11.26 seconds
=============================
%let grps=300000;
%let pergrp=50;
UNSORTED:
NOTE: There were 15000000 observations read from the data set
WORK.TEST.
NOTE: The data set WORK.SUMMARY has 300001 observations and 9
variables.
NOTE: PROCEDURE SUMMARY used (Total process time):
real time 19.26 seconds
cpu time 41.35 seconds
SORTED:
NOTE: There were 15000000 observations read from the data set
WORK.TEST.
NOTE: The data set WORK.SUMMARY has 300001 observations and 9
variables.
NOTE: PROCEDURE SUMMARY used (Total process time):
real time 5.43 seconds
cpu time 10.09 seconds
I ran these a few times and the run times were similar. Sorted times are about equal and way faster.
The more groups / less per group was faster unsorted, but look at the total CPU usage, it is higher. My laptop has an extremely fast SSD so IO was probably not the limiting factor -- the HD was able to keep up with the multi-core CPU's demands. On a system with a slower HD, the total run times could be different.
In the end, it depends too much on how the data is structured and the specifics of your server and DB.
Not a theoretical answer but still relevant IMO...
To speed up your proc summary
on large tables add the / groupinternal
option to your class statement. Of course, assuming you don't want the variables formatted prior to being grouped.
e.g:
class age / groupinternal;
This tells SAS that it doesn't need to apply a format to the value prior to calculating what class to group the value into. Every value will have a format applied to it even if you have not specified one explicitly. This doesn't make a large difference on small tables, but on large tables it can.
From this simple test, it reduces the time from 60 seconds on my machine to 40 seconds (YMMV):
data test;
set sashelp.class;
do i = 1 to 10000000;
output;
end;
run;
proc summary data=test noprint nway missing;
class age / groupinternal;
var height;
output out=smry mean=;
run;