Powershell: Sorting/Removing Duplicates in a CSV f

2020-07-26 10:58发布

问题:

first of all, I'm very new to Powershell, and I would like to thank all the participants of this site for helping me by providing answers across different fourms! I've accomplished a lot in a short time because of this site!

Here is the issue, and I'll do my best to explain. I have a CSV file to create student accounts. Our Student management system produces a record each time a student enrolls, is altered, or exits a program. If that student "trys out" a few different programs, they will have multiple records in the CSV file. So my goal is to Sort the CSV file by userID (the userID never changes) and by CurrentStatusDate (which is when the record was created). Using this command:

Import-CSV "C:\students.csv" | sort UserID,CurrentStatusDate

Sample of CSV records:

"UserID","AccountStatus","PersonID","PIN","FirstName","LastName","IDEXPIRY","Term","Role","Course","SectionName","locationDescription","Location","CurrentStatusDate"
"aboggs","Add","xxxxxxx","xxxxxxx","Ashley","Baggs","5/11/2013","xxxxxx","Student","Accounting Technology","xxxxxx","xxxxxx","xxxxxx","9/12/2011"
"aboutilier","Add","xxxxxxx","xxxxxxx","Amelia","Boutilier","5/3/2012","xxxxxx","Student","Adult Education","xxxxxx","xxxxxx","xxxxxx","11/15/2011"
"abowtle","Delete","xxxxxxx","xxxxxxx","Aleisha","Bowtle","7/31/2013","xxxxxx","Student","Business Administration","xxxxxx","xxxxxx","xxxxxx","2/1/2011"
"abowtle","Add","xxxxxxx","xxxxxxx","Aleisha","Bowtle","7/31/2012","xxxxxx","Student","General Studies","xxxxxx","xxxxxx","xxxxxx","9/9/2011"
"abradley","Delete","xxxxxxx","xxxxxxx","Anna","Bradley","10/25/2011","xxxxxx","Student","Adult Education","xxxxxx","xxxxxx","xxxxxx","11/17/2011"
"abridges","Delete","xxxxxxx","xxxxxxx","Ashley","Bridges","10/5/2011","xxxxxx","Student","Adult Education","xxxxxx","xxxxxx","xxxxxx","11/15/2011"
"abrown10165","Add","xxxxxxx","xxxxxxx","Adam","Brown","10/28/2011","xxxxxx","Student","Advanced Firefighting STCW VI/3","xxxxxx","xxxxxx","xxxxxx","10/24/2011"
"abrown10165","Add","xxxxxxx","xxxxxxx","Adam","Brown","12/16/2011","xxxxxx","Student","Simulated Electronic Navigation Level 1, Part B","xxxxxx","xxxxxx","xxxxxx","11/10/2011"
"abrown8081","Add","xxxxxxx","xxxxxxx","Alex","Brown","5/25/2013","xxxxxx","Student","Culinary Arts","xxxxxx","xxxxxx","xxxxxx","9/6/2011"
"abrown8950","Delete","xxxxxxx","xxxxxxx","Ashley","Brown","9/13/2012","xxxxxx","Student","Medical Support Services","xxxxxx","xxxxxx","xxxxxx","9/14/2011"
"acameron2637","Delete","xxxxxxx","xxxxxxx","Anne","Cameron","10/14/2011","xxxxxx","Student","Adult Education","xxxxxx","xxxxxx","xxxxxx","10/14/2011"
"acameron4368","Add","xxxxxxx","xxxxxxx","Amanda","Cameron","4/20/2013","xxxxxx","Student","Applied Degree in Culinary Operations","xxxxxx","xxxxxx","xxxxxx","10/12/2011"
"acampbell10266","Add","xxxxxxx","xxxxxxx","Amanda","Campbell","5/4/2012","xxxxxx","Student","Adult Education","xxxxxx","xxxxxx","xxxxxx","11/7/2011"
"acampbell6499","Delete","xxxxxxx","xxxxxxx","Aaron","Campbell","10/31/2012","xxxxxx","Student","Retail Business Management","xxxxxx","xxxxxx","xxxxxx","11/1/2011"
"acampbell6499","Add","xxxxxxx","xxxxxxx","Aaron","Campbell","12/13/2011","xxxxxx","Student","Complete the Accounting Cycle - Part II","xxxxxx","xxxxxx","xxxxxx","9/26/2011"

This should group all the userID's with the same records, then sort them by date created. I then want to remove the duplicates and retain the last record created. I'm familiar with the -Unique, but it doesn't apply to the command above as it will only remove records that have duplicate userID and CurrentStatusDates.

If been "Google-ing" and banging my head for 2 days... starting to think there is no "easy" answer, but my programming-fu is weak... Just looking for a "nudge" in the right direction.

Thanks!

Chris

回答1:

As Andy stated, it's a little hard given we don't have a sample of the CSV format. However I'm thinking that something like the below is what you're looking for:

Import-CSV "C:\students.csv" | Group-Object userid | foreach-object { $_.group | sort-object currentstatusdate | select -last 1}

Just as you describe - we group by ID, then sort by CurrentStatusDate, then select most-recent record. I'm not sure how CurrentStatusDate is formatted, so I don't know if a straight sort-object will be good enough.



回答2:

How about:

  • joining the fields together (http://www.johndcook.com/PowerShellCookbook.html#a19) on a seperator
  • use unique
  • split


回答3:

Not tested:

 $new_csv = @()
 Import-CSV "C:\students.csv" | sort UserID |
  foreach {
    if ($temp -eq $null){$temp = $_}
    if ($_.UserID -ne $temp.UserID){
       $new_csv += $temp
       $temp = $_
       }
elseif ([datetime]$_.CurrentStatusDate -gt [datetime]$temp.CurrentStatusDate){
   $temp = $_
   }
 } 
 $new_csv += $temp
 export-csv $new_csv c:\somedir\new_csv.csv -notype

When the first record comes through, $temp will get set to that record. As each new record comes through, if it's for the same student ID, the timestamp gets checked agains the record in $temp. If it's newer, that gets put into $temp. When it sees the userID change, it writes $temp (which should now contian the newest record for the last user) to $new_csv. Then it sets $temp to the current record, and starts over for the next userID. Since it won't see a userid change for the last account, you have to pick that one up after the loop finishes, then export the csv.

Not sure exactly what format that timestamp is in, but I'm assuming it will parse correctly to [datetime]. But if it's coming from .csv it's going to be a string, and it seemed unlikely it would sort in datetime order on a string sort, so I didn't even bother.