-->

Replace an entire field value in a file using awk

2019-08-23 05:03发布

问题:

I have an export from postgresql table with multiple fields, including boolean (exported by postgresql as t and f character), and I need to import it in another database (monetdb) that won't understand t/f as bool values.

(EDIT removed spaces to reflect true file aspect and avoid angry comments - previously there was spaces displayed)

id|val_str|bool_1|bool2|bool_3|bool4|
1|help|t|t|f|t|
2|test|f|t|f|f|
...

As I cannot replace all occurence of t/f I need to integrate the field separator in my pattern. I tried to use awk to replace fields t with TRUE and f with FALSE:

awk -F'|' '{gsub(/\|t\|/, "|TRUE|"); gsub(/\|f\|/, "|FALSE|"); print;}' 

This is working partially, as consecutive fields with a same value (|t|t|) will have only the first occurrence replaced (|TRUE|t| - as 2nd occurence is in fact t| and not |t|).

id|val_str|bool_1|bool2|bool_3|bool4|
1|help|TRUE|t|FALSE|TRUE|
2|test|FALSE|TRUE|FALSE|f|
...

Table has ~450 columns so I can't really specify the list of columns to be replaced, nor work in postgres to 'transform' boolean columns (I could but ...).

I could run the gsub() twice, but I was looking for more elegant way to match the entire field content for all fields.

gsub(/^t$/, ...) is not helping either as we are in the middle of a line most of the time.

回答1:

Table has ~450 columns so I can't really specify the list of columns to be replaced, nor work in postgres to 'transform' boolean columns (I could but ...).

You can let Postgres do the work for you. Basic query to produce the SELECT list:

SELECT string_agg(CASE WHEN atttypid = 'bool'::regtype
                       THEN quote_ident(attname) || '::text'
                       ELSE quote_ident(attname) END, ', ' ORDER BY attnum)
FROM   pg_attribute
WHERE  attrelid = 'mytable'::regclass  -- provide table name here
AND    attnum > 0
AND    NOT attisdropped;

Produces a string of the form:

col1, "CoL 2", bool1::text, "Bool 2"::text

All identifiers are escaped properly. Columns are in default order. Copy and execute it. Use COPY to export to file. (Or \copy in psql.) Performance is about the same as exporting a plain table. If you don't need upper case omit upper().

Why is a simple cast to text enough?

  • Return Boolean Value as TRUE or FALSE in Select (PostgreSQL/pgAdmin)

About regclass and escaping identifiers properly:

  • Table name as a PostgreSQL function parameter

If you need a complete statement with TRUE / FALSE / NULL in upper case, standard SQL cast notation (without colons ::), still original column names and maybe a schema-qualified tablename:

SELECT 'SELECT '
     || string_agg(CASE WHEN atttypid = 'bool'::regtype
                        THEN format('upper(cast(%1$I AS text)) AS %1$I', attname)
                        ELSE quote_ident(attname) END, ', ' ORDER BY attnum)
     || ' FROM myschema.mytable;'           -- provide table name twice now
FROM   pg_attribute
WHERE  attrelid = 'myschema.mytable'::regclass
AND    attnum > 0
AND    NOT attisdropped;

Produces a complete statement of the form:

SELECT col1, "CoL 2", upper(cast(bool1 AS text) AS bool1, upper(cast("Bool 2" AS text)) AS "Bool 2" FROM myschema.mytable;


回答2:

Assuming (based on your comments) that your input file actually doesn't look like the sample you posted but instead looks like this:

$ cat file
id|val_str|bool_1|bool2|bool_3|bool4|
1|help|t|t|f|t|
2|test|f|t|f|f|

then all you need is:

$ awk '{while(gsub(/\|t\|/,"|TRUE|")); while(gsub(/\|f\|/,"|FALSE|"));}1' file
id|val_str|bool_1|bool2|bool_3|bool4|
1|help|TRUE|TRUE|FALSE|TRUE|
2|test|FALSE|TRUE|FALSE|FALSE|

with the general solution for N replacement strings being:

$ awk 'BEGIN{m["f"]="FALSE"; m["t"]="TRUE"} {for (k in m) while(gsub("\\|"k"\\|","|"m[k]"|"));} 1' file
id|val_str|bool_1|bool2|bool_3|bool4|
1|help|TRUE|TRUE|FALSE|TRUE|
2|test|FALSE|TRUE|FALSE|FALSE|


回答3:

if perl is okay, you can use lookarounds:

$ cat ip.txt 
id |  val_str  | bool_1 | bool2  | bool_3 | bool4  | 
1  |    help   |   t    |   t    |   f    |   t    |
2  |    test   |   f    |   t    |   f    |   f    | 

$ perl -pe 's/\|\K\h*t\h*(?=\|)/  TRUE  /g; s/\|\K\h*f\h*(?=\|)/  FALSE /g' ip.txt 
id |  val_str  | bool_1 | bool2  | bool_3 | bool4  | 
1  |    help   |  TRUE  |  TRUE  |  FALSE |  TRUE  |
2  |    test   |  FALSE |  TRUE  |  FALSE |  FALSE | 
  • \|\K positive lookbehind to match |
  • \h* optional horizontal space, remove if not actually present in input
  • (?=\|) positive lookahead to match |


Can also use looping with sed. Tested on GNU sed 4.2.2, syntax may vary with other implementations

$ sed ':a s/| *t *|/|  TRUE  |/;ta; :b s/| *f *|/|  FALSE |/;tb' ip.txt 
id |  val_str  | bool_1 | bool2  | bool_3 | bool4  | 
1  |    help   |  TRUE  |  TRUE  |  FALSE |  TRUE  |
2  |    test   |  FALSE |  TRUE  |  FALSE |  FALSE | 
  • :a label
  • s/| *t *|/| TRUE |/ substitute command
  • ta branch to label a as long as substitute command succeeds
  • similarly for :b


With no spaces in input

perl -pe 's/\|\Kt(?=\|)/TRUE/g; s/\|\Kf(?=\|)/FALSE/g' ip.txt 
sed ':a s/|t|/|TRUE|/;ta; :b s/|f|/|FALSE|/;tb' ip.txt 
awk 'BEGIN{FS=OFS="|"} {for(i=1;i<=NF;i++){if($i=="t"){$i="TRUE"} if($i=="f"){$i="FALSE"}} print}' ip.txt


回答4:

Use sed, it`s standard.

sed 's/| *t */| TRUE /g;s/| *f */| FALSE /g'

This tells sed to substitute every substring that begins with a pipe character, an unknown number of spaces (maybe zero), t and a space followed by an unknown number of spaces with | TRUE; same with f.

Pipe the output through column -t if line lengths get messed up.