In awk, how can I use a file containing multiple f

2020-07-09 02:07发布

I have a case where I want to use input from a file as the format for printf() in awk. My formatting works when I set it in a string within the code, but it doesn't work when I load it from input.

Here's a tiny example of the problem:

$ # putting the format in a variable works just fine:
$ echo "" | awk -vs="hello:\t%s\n\tfoo" '{printf(s "bar\n", "world");}'
hello:  world
        foobar
$ # But getting the format from an input file does not.
$ echo "hello:\t%s\n\tfoo" | awk '{s=$0; printf(s "bar\n", "world");}'
hello:\tworld\n\tfoobar
$ 

So ... format substitutions work ("%s"), but not special characters like tab and newline. Any idea why this is happening? And is there a way to "do something" to input data to make it usable as a format string?

UPDATE #1:

As a further example, consider the following using bash heretext:

[me@here ~]$ awk -vs="hello: %s\nworld: %s\n" '{printf(s, "foo", "bar");}' <<<""
hello: foo
world: bar
[me@here ~]$ awk '{s=$0; printf(s, "foo", "bar");}' <<<"hello: %s\nworld: %s\n"
hello: foo\nworld: bar\n[me@here ~]$

As far as I can see, the same thing happens with multiple different awk interpreters, and I haven't been able to locate any documentation that explains why.

UPDATE #2:

The code I'm trying to replace currently looks something like this, with nested loops in shell. At present, awk is only being used for its printf, and could be replaced with a shell-based printf:

#!/bin/sh

while read -r fmtid fmt; do
  while read cid name addy; do
    awk -vfmt="$fmt" -vcid="$cid" -vname="$name" -vaddy="$addy" \
      'BEGIN{printf(fmt,cid,name,addy)}' > /path/$fmtid/$cid
  done < /path/to/sampledata
done < /path/to/fmtstrings

Example input would be:

## fmtstrings:
1 ID:%04d Name:%s\nAddress: %s\n\n
2 CustomerID:\t%-4d\t\tName: %s\n\t\t\t\tAddress: %s\n
3 Customer: %d / %s (%s)\n

## sampledata:
5 Companyname 123 Somewhere Street
12 Othercompany 234 Elsewhere

My hope was that I'd be able to construct something like this to do the entire thing with a single call to awk, instead of having nested loops in shell:

awk '

  NR==FNR { fmts[$1]=$2; next; }

  {
    for(fmtid in fmts) {
      outputfile=sprintf("/path/%d/%d", fmtid, custid);
      printf(fmts[fmtid], $1, $2) > outputfile;
    }
  }

' /path/to/fmtstrings /path/to/sampledata

Obviously, this doesn't work, both because of the actual topic of this question and because I haven't yet figured out how to elegantly make awk join $2..$n into a single variable. (But that's the topic of a possible future question.)

FWIW, I'm using FreeBSD 9.2 with its built in, but I'm open to using gawk if a solution can be found with that.

标签: awk printf
10条回答
The star\"
2楼-- · 2020-07-09 02:15

Why so lengthy and complicated an example? This demonstrates the problem:

$ echo "" | awk '{s="a\t%s"; printf s"\n","b"}'
a       b

$ echo "a\t%s" | awk '{s=$0; printf s"\n","b"}'
a\tb

In the first case, the string "a\t%s" is a string literal and so is interpreted twice - once when the script is read by awk and then again when it is executed, so the \t is expanded on the first pass and then at execution awk has a literal tab char in the formatting string.

In the second case awk still has the characters backslash and t in the formatting string - hence the different behavior.

You need something to interpret those escaped chars and one way to do that is to call the shell's printf and read the results (corrected per @EtanReiser's excellent observation that I was using double quotes where I should have had single quotes, implemented here by \047, to avoid shell expansion):

$ echo 'a\t%s' | awk '{"printf \047" $0 "\047 " "b" | getline s; print s}'
a       b

If you don't need the result in a variable, you can just call system().

If you just wanted the escape chars expanded so you don't need to provide the %s args in the shell printf call, you'd just need to escape all the %s (watching out for already-escaped %s).

You could call awk instead of the shell printf if you prefer.

Note that this approach, while clumsy, is much safer than calling an eval which might just execute an input line like rm -rf /*.*!

With help from Arnold Robbins (the creator of gawk), and Manuel Collado (another noted awk expert), here is a script which will expand single-character escape sequences:

$ cat tst2.awk
function expandEscapes(old,     segs, segNr, escs, idx, new) {
    split(old,segs,/\\./,escs)
    for (segNr=1; segNr in segs; segNr++) {
        if ( idx = index( "abfnrtv", substr(escs[segNr],2,1) ) )
            escs[segNr] = substr("\a\b\f\n\r\t\v", idx, 1)
        new = new segs[segNr] escs[segNr]
    }
    return new
}

{
    s = expandEscapes($0)
    printf s, "foo", "bar"
}

.

$ awk -f tst2.awk <<<"hello: %s\nworld: %s\n"
hello: foo
world: bar

Alternatively, this shoudl be functionally equivalent but not gawk-specific:

function expandEscapes(tail,   head, esc, idx) {
    head = ""
    while ( match(tail, /\\./) ) {
        esc  = substr( tail, RSTART + 1, 1 )
        head = head substr( tail, 1, RSTART-1 )
        tail = substr( tail, RSTART + 2 )
        idx  = index( "abfnrtv", esc )
        if ( idx )
             esc = substr( "\a\b\f\n\r\t\v", idx, 1 )
        head = head esc
    }

    return (head tail)
} 

If you care to, you can expand the concept to octal and hex escape sequences by changing the split() RE to

/\\(x[0-9a-fA-F]*|[0-7]{1,3}|.)/

and for a hex value after the \\:

c = sprintf("%c", strtonum("0x" rest_of_str))

and for an octal value:

c = sprintf("%c", strtonum("0" rest_of_str))
查看更多
\"骚年 ilove
3楼-- · 2020-07-09 02:21

That's a cool question, I don't know the answer in awk, but in perl you can use eval :

echo '%10s\t:\t%-10s\n' |  perl -ne ' chomp; eval "printf (\"$_\", \"hi\", \"hello\")"'
        hi  :   hello  

PS. Be aware of code injection danger when you use eval in any language, no just eval any system call can't be done blindly.

Example in Awk:

echo '$(whoami)' | awk '{"printf \"" $0 "\" " "b" | getline s; print s}'
tiago

What if the input was $(rm -rf /)? You can guess what would happen :)


ikegami adds:

Why would even think of using eval to convert \n to newlines and \t to tabs?

echo '%10s\t:\t%-10s\n' | perl -e'
   my %repl = (
      n => "\n",
      t => "\t",
   );

   while (<>) {
      chomp;
      s{\\(?:(\w)|(\W))}{
         if (defined($2)) {
            $2
         }
         elsif (exists($repl{$1})) {
            $repl{$1}
         }
         else {
            warn("Unrecognized escape \\$1.\n");
            $1
         }
      }eg;

      printf($_, "hi", "hello");
   }
'

Short version:

echo '%10s\t:\t%-10s\n' | perl -nle'
   s/\\(?:(n)|(t)|(.))/$1?"\n":$2?"\t":$3/seg;
   printf($_, "hi", "hello");
'
查看更多
▲ chillily
4楼-- · 2020-07-09 02:23

What you are trying to do is called templating. I would suggest that shell tools are not the best tools for this job. A safe way to go would be to use a templating library such as Template Toolkit for Perl, or Jinja2 for Python.

查看更多
再贱就再见
5楼-- · 2020-07-09 02:23

Graham,

Ed Morton's solution is the best (and perhaps only) one available.

I'm including this answer for a better explanation of WHY you're seeing what you're seeing.

A string is a string. The confusing part here is WHERE awk does the translation of \t to a tab, \n to a newline, etc. It appears NOT to be the case that the backslash and t get translated when used in a printf format. Instead, the translation happens at assignment, so that awk stores the tab as part of the format rather than translating when it runs the printf.

And this is why Ed's function works. When read from stdin or a file, no assignment is performed that will implement the translation of special characters. Once you run the command s="a\tb"; in awk, you have a three character string containing no backslash or t.

Evidence:

$ echo "a\tb\n" | awk '{ s=$0; for (i=1;i<=length(s);i++) {printf("%d\t%c\n",i,substr(s,i,1));} }'
1       a
2       \
3       t
4       b
5       \
6       n

vs

$ awk 'BEGIN{s="a\tb\n"; for (i=1;i<=length(s);i++) {printf("%d\t%c\n",i,substr(s,i,1));} }'
1       a
2               
3       b
4       

And there you go.

As I say, Ed's answer provides an excellent function for what you need. But if you can predict what your input will look like, you can probably get away with a simpler solution. Knowing how this stuff gets parsed, if you have a limited set of characters you need to translate, you may be able to survive with something simple like:

s=$0;
gsub(/\\t/,"\t",s);
gsub(/\\n/,"\n",s);
查看更多
霸刀☆藐视天下
6楼-- · 2020-07-09 02:25

The problem lies in the non-interpretation of the special characters \t and \n by echo: it makes sure that they are understood as as-is strings, and not as tabulations and newlines. This behavior can be controlled by the -e flag you give to echo, without changing your awk script at all:

echo -e "hello:\t%s\n\tfoo" | awk '{s=$0; printf(s "bar\n", "world");}'

tada!! :)

EDIT: Ok, so after the point rightfully raised by Chrono, we can devise this other answer corresponding to the original request to have the pattern read from a file:

echo "hello:\t%s\n\tfoo" > myfile
awk 'BEGIN {s="'$(cat myfile)'" ; printf(s "bar\n", "world")}'

Of course in the above we have to be careful with the quoting, as the $(cat myfile) is not seen by awk but interpreted by the shell.

查看更多
再贱就再见
7楼-- · 2020-07-09 02:27

@Ed Morton's answer explains the problem well.

A simple workaround is to:

  • pass the format-string file contents via an awk variable, using command substitution,
  • assuming that file is not too large to be read into memory in full.

Using GNU awk or mawk:

awk -v formats="$(tr '\n' '\3' <fmtStrings)" '
     # Initialize: Split the formats into array elements.
    BEGIN {n=split(formats, aFormats, "\3")}
     # For each data line, loop over all formats and print.
    { for(i=1;i<n;++i) {printf aFormats[i] "\n", $1, $2, $3} }
    ' sampleData

Note:

  • The advantage of this solution is that it works generically - you don't need to anticipate specific escape sequences and handle them specially.
  • On FreeBSD awk, this almost works, but - sadly - split() still splits by newlines, despite being given an explicit separator - this smells like a bug. Observed on versions 20070501 (OS X 10.9.4) and 20121220 (FreeBSD 10.0).
  • The above solves the core problem (for brevity, it omits stripping the ID from the front of the format strings and omits the output-file creation logic).

Explanation:

  • tr '\n' '\3' <fmtStrings replaces actual newlines in the format-strings file with \3 (0x3) characters, so as to be able to later distinguish them from the \n escape sequences embedded in the lines, which awk turns into actual newlines when assigning to variable formats (as desired).
    \3 (0x3) - the ASCII end-of-text char. - was arbitrarily chosen as an auxiliary separator that is assumed not to be present in the input file.
    Note that using \0 (NUL) is NOT an option, because awk interprets that as an empty string, causing split() to split the string into individual characters.
  • Inside the BEGIN block of the awk script, split(formats, aFormats, "\3") then splits the combined format strings back into individual format strings.
查看更多
登录 后发表回答