How can I replace a column with its hash value (like MD5) in awk or sed?
The original file is super huge, so I need this to be really efficient.
How can I replace a column with its hash value (like MD5) in awk or sed?
The original file is super huge, so I need this to be really efficient.
I copy pasted larsks's response, but I have added the close line, to avoid the problem indicated in this post: gawk / awk: piping date to getline *sometimes* won't work
So, you don't really want to be doing this with
awk
. Any of the popular high-level scripting languages -- Perl, Python, Ruby, etc. -- would do this in a way that was simpler and more robust. Having said that, something like this will work.Given input like this:
(E.g., a row with four columns), we can replace a given column with its md5 checksum like this:
This relies on GNU awk (you'll probably have this by default on a Linux system), and it uses
openssl
to generate the md5 checksum. We first build a shell command line intmp
to pass the selected column to themd5
command. Then we pipe the output into thecksum
variable, and replace column 2 with the checksum. Given the sample input above, the output of this awk script would be:This might work using Bash/GNU sed:
or a mostly sed solution:
Replaces
is
fromthis is a test
with md5sumExplanation:
In the first:- identify the columns and use back references as parameters in the Bash command which is substituted and evaluated then make cosmetic changes to lose the file description (in this case standard input) generated by the md5sum command.
In the second:- similar to the first but hive the input string into the hold space, then after evaluating the md5sum command, append the string
G
to the pattern space (md5sum result) and using substitution arrange to suit.You can also do that with perl :
If you want to obfuscate large amount of data it might be faster than sed and awk which need to fork a md5sum process for each lines.
You might have a better time with
read
thanawk
, though I haven't done any benchmarking.the input (scratch001.txt):
transformed using
read
:produces the output: