simple random sampling while pulling data from war

2019-08-24 12:09发布

问题:

I need to pull humongous amount of data, say 600-700 variables from different tables in a data warehouse...now the dataset in its raw form will easily touch 150 gigs - 79 MM rows and for my analysis purpose I need only a million rows...how can I pull data using proc sql directly from warehouse by doing simple random sampling on the rows.

Below code wont work as ranuni is not supported by oracle

    proc sql outobs =1000000;
    select * from connection to oracle(
    select * from tbl1 order by ranuni(12345);
    quit;

How do you propose I do it

回答1:

Use the DBMS_RANDOM Package to Sort Records and Then Use A Row Limiting Clause to Restrict to the Desired Sample Size

The dbms_random.value function obtains a random number between 0 and 1 for all rows in the table and we sort in ascending order of the random value.

Here is how to produce the sample set you identified:

    SELECT
    *
FROM
    (
        SELECT
            *
        FROM
            tbl1
        ORDER BY dbms_random.value
    )
FETCH FIRST 1000000 ROWS ONLY;

To demonstrate with the sample schema table, emp, we sample 4 records:

   SCOTT@DEV> SELECT
  2      empno,
  3      rnd_val
  4  FROM
  5      (
  6          SELECT
  7              empno,
  8              dbms_random.value rnd_val
  9          FROM
 10              emp
 11          ORDER BY rnd_val
 12      )
 13  FETCH FIRST 4 ROWS ONLY;
EMPNO  RND_VAL
7698   0.06857749035643605682648168347885993709
7934   0.07529612360785920635181751566833986766
7902   0.13618520865865754766175030040204331697
7654   0.14056380246495282237607922497308953768


SCOTT@DEV> SELECT
  2      empno,
  3      rnd_val
  4  FROM
  5      (
  6          SELECT
  7              empno,
  8              dbms_random.value rnd_val
  9          FROM
 10              emp
 11          ORDER BY rnd_val
 12      )
 13  FETCH FIRST 4 ROWS ONLY;
EMPNO  RND_VAL
7839   0.00430658806761508024693197916281775492
7499   0.02188116061148367312927392115186317884
7782   0.10606515700372416131060633064729870016
7788   0.27865276349549877512032787966777990909

With the example above, notice that the empno changes significantly during the execution of the SQL*Plus command.

The performance might be an issue with the row counts you are describing.


EDIT:

With table sizes in the order of 150 gigs - 79 MM, any sorting would be painful.

If the table had a surrogate key based on a sequence incremented by 1, we could take the approach of selecting every nth record based on the key.

e.g.

    --scenario n = 3000

 FROM
    tbl1
WHERE
    mod(table_id, 3000) = 0;

This approach would not use an index (unless a function based index is created), but at least we are not performing a sort on a data set of this size.

I performed an explain plan with a table that has close to 80 million records and it does perform a full table scan (the condition forces this without a function based index) but this looks tenable.



回答2:

None of the answers posted or comments helped my cause, it could but we have 87 MM rows

Now I wanted the answer with the help of sas: here is what I did: and it works. Thanks all!

    libname dwh path username pwd;
    proc sql;
    create table sample as
    (select 
     <all the variables>, ranuni(any arbitrary seed)
     from dwh.<all the tables>
     <bunch of where conditions goes here>);
     quit);