In awk, how can I use a file containing multiple f

2020-07-09 02:07发布

I have a case where I want to use input from a file as the format for printf() in awk. My formatting works when I set it in a string within the code, but it doesn't work when I load it from input.

Here's a tiny example of the problem:

$ # putting the format in a variable works just fine:
$ echo "" | awk -vs="hello:\t%s\n\tfoo" '{printf(s "bar\n", "world");}'
hello:  world
        foobar
$ # But getting the format from an input file does not.
$ echo "hello:\t%s\n\tfoo" | awk '{s=$0; printf(s "bar\n", "world");}'
hello:\tworld\n\tfoobar
$ 

So ... format substitutions work ("%s"), but not special characters like tab and newline. Any idea why this is happening? And is there a way to "do something" to input data to make it usable as a format string?

UPDATE #1:

As a further example, consider the following using bash heretext:

[me@here ~]$ awk -vs="hello: %s\nworld: %s\n" '{printf(s, "foo", "bar");}' <<<""
hello: foo
world: bar
[me@here ~]$ awk '{s=$0; printf(s, "foo", "bar");}' <<<"hello: %s\nworld: %s\n"
hello: foo\nworld: bar\n[me@here ~]$

As far as I can see, the same thing happens with multiple different awk interpreters, and I haven't been able to locate any documentation that explains why.

UPDATE #2:

The code I'm trying to replace currently looks something like this, with nested loops in shell. At present, awk is only being used for its printf, and could be replaced with a shell-based printf:

#!/bin/sh

while read -r fmtid fmt; do
  while read cid name addy; do
    awk -vfmt="$fmt" -vcid="$cid" -vname="$name" -vaddy="$addy" \
      'BEGIN{printf(fmt,cid,name,addy)}' > /path/$fmtid/$cid
  done < /path/to/sampledata
done < /path/to/fmtstrings

Example input would be:

## fmtstrings:
1 ID:%04d Name:%s\nAddress: %s\n\n
2 CustomerID:\t%-4d\t\tName: %s\n\t\t\t\tAddress: %s\n
3 Customer: %d / %s (%s)\n

## sampledata:
5 Companyname 123 Somewhere Street
12 Othercompany 234 Elsewhere

My hope was that I'd be able to construct something like this to do the entire thing with a single call to awk, instead of having nested loops in shell:

awk '

  NR==FNR { fmts[$1]=$2; next; }

  {
    for(fmtid in fmts) {
      outputfile=sprintf("/path/%d/%d", fmtid, custid);
      printf(fmts[fmtid], $1, $2) > outputfile;
    }
  }

' /path/to/fmtstrings /path/to/sampledata

Obviously, this doesn't work, both because of the actual topic of this question and because I haven't yet figured out how to elegantly make awk join $2..$n into a single variable. (But that's the topic of a possible future question.)

FWIW, I'm using FreeBSD 9.2 with its built in, but I'm open to using gawk if a solution can be found with that.

标签: awk printf
10条回答
男人必须洒脱
2楼-- · 2020-07-09 02:28

Ed Morton shows the problem clearly (edit: and it's now complete, so just go accept it): awk's string literal processing handled the escapes, and file I/O code isn't a lexical analyzer.

It's an easy fix: decide what escapes you want to support, and support them. Here's a one-liner form if you're doing special-purpose work that doesn't need to handle escaped backslashes

awk '{ gsub(/\\n/,"\n"); gsub(/\\t/,"\t"); printf($0 "bar\n", "world"); }' <<\EOD
hello:\t%s\n\tfoo
EOD

but for doit-and-forgetit peace of mind just use the full form in the linked answer.

查看更多
SAY GOODBYE
3楼-- · 2020-07-09 02:32

I had to create another answer to start clean, I believe I've come to a good solution, again with perl:

 echo '%10s\t:\t%10s\r\n' | perl -lne 's/((?:\\[a-zA-Z\\])+)/qq[qq[$1]]/eeg; printf "$_","hi","hello"'  
        hi  :        hello

That bad boy s/((?:\\[a-zA-Z\\])+)/qq[qq[$1]]/eeg will translate any meta character I can think of, let us take a look with cat -A :

echo '%10s\t:\t%10s\r\n' | perl -lne 's/((?:\\[a-zA-Z\\])+)/qq[qq[$1]]/eeg; printf "$_","hi","hello"'   | cat -A
        hi^I:^I     hello^M$

PS. I didn't create that regex, I googled unquote meta and found here

查看更多
仙女界的扛把子
4楼-- · 2020-07-09 02:33

Since the question explicitly asks for an awk solution, here's one which works on all the awks I know of. It's a proof-of-concept; error handling is abysmal. I've tried to indicate places where that could be improved.

The key, as has been noted by various commentators, is that awk's printf -- like the C standard function it is based on -- does not interpret backslash-escapes in the format string. However, awk does interpret them in command-line assignment arguments.

awk 'BEGIN  {if(ARGC!=3)exit(1);
             fn=ARGV[2];ARGC=2}
     NR==FNR{ARGV[ARGC++]="fmt="substr($0,length($1)+2);
             ARGV[ARGC++]="fmtid="$1;
             ARGV[ARGC++]=fn;
             next}
     {match($0,/^ *[^ ]+[ ]+[^ ]+[ ]+/);
      printf fmt,$1,$2,substr($0,RLENGTH+1) > ("data/"fmtid"/"$1)
     }' fmtfile sampledata

( What's going on here is that the 'FNR==NR' clause (which executes only on the first file) adds the values (fmtid, fmt) from each line of the first file as command-line assignments, and then inserts the data file name as a command-line argument. In awk, assignments as command line arguments are simply executed as though they were assignments from a string constant with implicit quotes, including backslash-escape processing (except that if the last character in the argument is a backslash, it doesn't escape the implicit closing double-quote). This behaviour is mandated by Posix, as is the order in which arguments are processed which makes it possible to add arguments as you go.

As written, the script must be provided with exactly two arguments: the formats and the data (in that order). There is some room for improvement, obviously.

The snippet also shows two ways of concatenating trailing fields.

In the format file, I assume that the lines are well behaved (no leading spaces; exactly one space after the format id). With those constraints, substr($0, length($1)+2) is precisely the part of the line after the first field and a single space.

Processing the datafile, it may be necessary to do this with fewer constraints. First, the builtin match function is called with the regular expression /^ *[^ ]+[ ]+[^ ]+[ ]+/ which matches leading spaces (if any) and two space-separated fields, along with the following spaces. (It would be better to allow tabs, as well.) Once the regex matches (and matching shouldn't be assumed, so there's another thing to fix), the variables RSTART and RLENGTH are set, so substr($0, RLENGTH+1) picks up everything starting with the third field. (Again, this is all Posix-standard behaviour.)

Honestly, I'd use the shell printf for this problem, and I don't understand why you feel that solution is somehow sub-optimal. The shell printf interprets backslash escapes in formats, and the shell read -r will do the line splitting the way you want. So there's no reason for awk at all, as far as I can see.

查看更多
手持菜刀,她持情操
5楼-- · 2020-07-09 02:37

This looks extremely ugly, but it works for this particular problem:

s=$0;
gsub(/'/, "'\\''", s);
gsub(/\\n/, "\\\\\\\\n", s);
"printf '%b' '" s "'" | getline s;
gsub(/\\\\n/, "\n", s);
gsub(/\\n/, "\n", s);
printf(s " bar\n", "world");
  1. Replace all single quotes with shell-escaped single quotes ('\'').
  2. Replace all escaped newline sequences that appear normally as \n with the sequence that appears as \\\\n. It would suffice to use \\\\n as the actual replacement string (meaning \\n would print if you printed it), but the version of gawk I have messes things up in POSIX mode.
  3. Invoke the shell to execute printf '%b' 'escape'\''d format' and use awk's getline statement to retrieve the line.
  4. Unescape \\n to yield a newline. This step wouldn't be necessary if gawk in POSIX mode played nicely.
  5. Unescape \n to yield a newline.

Otherwise you're left to call the gsub function for each possible escape sequence, which is terrible for \001, \002, etc.

查看更多
登录 后发表回答