What exactly is the/this data statement in SAS doi

2019-06-13 14:33发布

问题:

I'm converting a SAS script to Python for a PostgreSQL environment. In a few places I've found a data statement in SAS, which looks something like this (in multiple scripts):

data dups;
  set picc;
  by btn wtn resp_ji;
if not (first.resp_ji and last.resp_ji);
run;

Obviously datasets aren't the same in python or SQL environments, and I'm having trouble determining what this specific statement is doing. To be clear, there are a number of scripts being converted which create a dataset in this manner with this same name. So my expectation would be that most of these would be overwritten over and over.

I'm also unclear as to what the postgres equivalent to the condition in the data dups statement would be.

Is there an obvious PostgreSQL statement that would work in its place? Something like this?:

CREATE TABLE dups AS
  SELECT btn, wtn, resp_ji
  WHERE /*some condition that matches the condition in the data statement*/

Does the

by btn wtn respji;

statement mean which columns are copied over, or is that the equivalent of an ORDER BY clause in PostgreSQL?

Thanks.

回答1:

The statement is using what's called 'by group processing'. Before the step can run, it requires that the data is sorted by btn wtn resp_ji.

The first.resp_ji piece is checking to see if it's the first time it's seen the current value of resp_ji within the current btn/wtn combination. Likewise the last.resp_ji piece is checking if it's the final time that it will see the current value of resp_ji within the current btn/wtn combination.

Combining it all together the statement:

if not (first.resp_ji and last.resp_ji);

Is saying, if the current value of resp_ji occurs multiple times for the current combination of btn/wtn then keep the record, otherwise discard the record. The behaviour of the if statement when used like that implicitly keeps/discards the record.

To do the equivalent in SQL, you could do something like:

  1. Find all records to discard.
  2. Discard those records from the original dataset.

So...

create table rows_to_discard as 
select btn, wtn, resp_ji, count(*) as freq
from mytable
group by btn, wtn, resp_ji
having count(*) = 1

create table want as 
select a.*
from mytable a
left join rows_to_discard b  on b.btn = a.btn
                            and b.wtn = a.wtn
                            and b.resp_ji = a.resp_ji
where b.btn is null

EDIT : I should mention that there is no simple SQL equivalent. It may be possible by numbering rows in subqueries, and then building logic on top of that but it'd be ugh-ly. It may also depend on the specific flavour of SQL being used.



回答2:

As someone that learned SAS before postgressql, I found the following much more similar to SAS first. last. logic:

--first.

select distinct on (resp_ji) from <table> order by resp_ji 

--last.

select distinct on (resp_ji) from <table> order by resp_ji desc 


回答3:

A way to detect duplicates (when no extra differentiating field is available) is to use the ctid as tie-breaker:

CREATE TABLE dups
AS
SELECT * FROM pics p
WHERE EXISTS ( SELECT * FROM pics x
   WHERE x.btn = p.btn
     AND x.wtn = p.wtn
     AND x.resp_ji = p.resp_ji
     AND x.ctid <> p.ctid
     );