I need to pull humongous amount of data, say 600-700 variables from different tables in a data warehouse...now the dataset in its raw form will easily touch 150 gigs - 79 MM rows and for my analysis purpose I need only a million rows...how can I pull data using proc sql directly from warehouse by doing simple random sampling on the rows.
Below code wont work as ranuni is not supported by oracle
proc sql outobs =1000000;
select * from connection to oracle(
select * from tbl1 order by ranuni(12345);
quit;
How do you propose I do it
Use the DBMS_RANDOM Package to Sort Records and Then Use A Row Limiting Clause to Restrict to the Desired Sample Size
The dbms_random.value function obtains a random number between 0 and 1 for all rows in the table and we sort in ascending order of the random value.
Here is how to produce the sample set you identified:
To demonstrate with the sample schema table,
emp
, we sample 4 records:With the example above, notice that the
empno
changes significantly during the execution of the SQL*Plus command.The performance might be an issue with the row counts you are describing.
EDIT:
With table sizes in the order of 150 gigs - 79 MM, any sorting would be painful.
If the table had a surrogate key based on a sequence incremented by 1, we could take the approach of selecting every nth record based on the key.
e.g.
This approach would not use an index (unless a function based index is created), but at least we are not performing a sort on a data set of this size.
I performed an explain plan with a table that has close to 80 million records and it does perform a full table scan (the condition forces this without a function based index) but this looks tenable.
None of the answers posted or comments helped my cause, it could but we have 87 MM rows
Now I wanted the answer with the help of sas: here is what I did: and it works. Thanks all!