Add an md5 hash value to a big CSV file

2019-08-14 20:01发布

问题:

I need to insert a new field containing the MD5 Hash value of the first field for each line of an 80 GB csv file.

For small projects, I have been able to do this in excel by passing the field value to

=WEBSERVICE(CONCATENATE("https://helloacm.com/api/md5/?s="&ENCODEURL(A1)))

However, with the 80 GB file, that is not an option.

Via AWK, is it possible to pull the first field of each row in this massive csv, calculate the md5 for the content of the first field, and insert that value back into the same line?

Example line:

Original:

"value001","value002","Value003","Value004","Value005","Value006","Value007"

Revised Example line with md5ofvalue001 field inserted:

"value001","MD5ofValue001","value002","Value003","Value004","Value005","Value006","Value007"

回答1:

awk is great, but for your problem, it will probably be much too slow if you have to use system() to calculate the md5. awk may also be poorly suited to the task if the first field has any embedded commas.

In any case, here is a fast (or at least much faster) solution using php, which I've found to have excellent support for CSV of various stripes and hues. You should be able to run this as a script on a Mac or Linux-like platform.

#!/usr/bin/env php
<?php

# Syntax: $0 [PATHNAME]
# A filter that expects its input to have the CSV format.
# Input is taken from STDIN if PATHNAME is - or not specified.
# Output is the same CSV but with the md5 of the first field tacked on.

$file = ($argc > 1 && $argv[1] != "" ) ? $argv[1] : 'php://stdin';
if ( $file == "-" ) { $file = 'php://stdin'; }

$handle = @fopen($file, "r");
$sep = ",";

if ($handle) {
  while (($data = fgetcsv($handle, 0, $sep)) !== FALSE) {
    $num = count($data);
    $data[] = md5($data[0]);
    fputcsv(STDOUT, $data, $sep);
  }
  fclose($handle);
} else {
  echo "{$argv[0]}: unable to fopen $argv[1]\n";
  exit(1);
}
?>

If you want to leave the input lines unaltered, then you could read in the line literally and use str_getcsv() to parse it, etc.



回答2:

awk to the rescue!

Here is a proof of concept for you

$ awk -F, -v OFS=, -v q='\"' '{
         cmd= "echo " $1 " | md5sum | cut -f1 -d\" \""; 
         (cmd | getline md5); print $1, q md5 q}' file

"value001","c36a5b774bfb2fd236331ac5ebef4266"


回答3:

Since you asked how to do it in awk, and assuming that echo val | md5sum is how to calculate an "md5sum", this is the awk script to do that:

$ cat tst.awk
BEGIN { FS=OFS="," }
{
    cmd = "echo " $1 " | md5sum"
    if ( (cmd | getline md5) > 0 ) {
        sub(/ .*/,"",md5)
    }
    else {
        printf "Warning: Failed to calculate md5sum of %s at input line %d\n", $1, NR | "cat>&2"
        md5 = "N/A"
    }
    close(cmd)
    $1 = $1 OFS "\"" md5 "\""
    print
}

$ awk -f tst.awk file
"value001","c36a5b774bfb2fd236331ac5ebef4266","value002","Value003","Value004","Value005","Value006","Value007"

As noted elsewhere, since you're jumping in and out of shell every line this will be slow compared to a tool that can do md5sum calculations internally.



标签: csv hash awk sed