I am trying to use unix to transform a tab delimited file from a short/wide format to long format, in a similar way as the reshape function in R. I hope to create three rows for each row in the starting file. Column 4 currently contains 3 values separated by commas. I hope to keep columns 1, 2, and 3 the same for each starting row, but have column 4 be one of the values from the initial column 4. This example probably makes it more clear than I can describe verbally:
current file:
A1 A2 A3 A4,A5,A6
B1 B2 B3 B4,B5,B6
C1 C2 C3 C4,C5,C6
goal:
A1 A2 A3 A4
A1 A2 A3 A5
A1 A2 A3 A6
B1 B2 B3 B4
B1 B2 B3 B5
B1 B2 B3 B6
C1 C2 C3 C4
C1 C2 C3 C5
C1 C2 C3 C6
As someone just becoming familiar with this language, my initial thought was to use sed to find the commas replace with a hard return
sed 's/,/&\n/' data.frame
I am really not sure how to include the values for columns 1-3. I had low hopes of this working, but the only thing I could think of was to try inserting the column values with {print $1, $2, $3}.
sed 's/,/&\n{print $1, $2, $3}/' data.frame
Not to my surprise, the output looked like this:
A1 A2 A3 A4
{print $1, $2, $3} A5
{print $1, $2, $3} A6
B1 B2 B3 B4
{print $1, $2, $3} B5
{print $1, $2, $3} B6
C1 C2 C3 C4
{print $1, $2, $3} C5
{print $1, $2, $3} C6
It seems like an approach might be to store the values of columns 1-3 and then insert them. I am not really sure how to store the values, I think that it may involve using an adaptation of the following script, but I am having a hard time understanding all of the components.
NR==FNR{a[$1, $2, $3]=1}
Thanks in advance for your thoughts on this.
You can a write simple read
loop for this and use brace expansion for parsing the comma delimited field:
#!/bin/bash
while read -r f1 f2 f3 c1; do
# split the comma delimited field 'c1' into its constituents
for c in ${c1//,/ }; do
printf "$f1 $f2 $f3 $c\n"
done
done < input.txt
Output:
A1 A2 A3 A4
A1 A2 A3 A5
A1 A2 A3 A6
B1 B2 B3 B4
B1 B2 B3 B5
B1 B2 B3 B6
C1 C2 C3 C4
C1 C2 C3 C5
C1 C2 C3 C6
As solution without calling an external program :
#!/bin/bash
data_file="d"
while IFS=" " read -r f1 f2 f3 r
do
IFS="," read f4 f5 f6 <<<"$r"
printf "$f1 $f2 $f3 $f4\n$f1 $f2 $f3 $f5\n$f1 $f2 $f3 $f6\n"
done <"$data_file"
If you don't need the output to be in any particular order within a group of the fourth column, the following awk one-liner might do:
awk '{split($4,a,","); for(i in a) print $1,$2,$3,a[i]}' input.txt
This works by splitting your 4th column into an array, then for each element of the array, printing the "new" four columns.
If order is important -- that is, A4 must come before A5, etc, then you can use a classic for
loop:
awk '{split($4,a,","); for(i=1;i<=length(a);i++) print $1,$2,$3,a[i]}' input.txt
But that's awk. And you're asking about bash.
The following might work:
#!/usr/bin/env bash
mapfile -t arr < input.txt
for s in "${arr[@]}"; do
t=($s)
mapfile -t -d, u <<<"${t[3]}"
for v in "${u[@]}"; do
printf '%s %s %s %s\n' "${t[@]:0:3}" "${v%$'\n'}"
done
done
This copies your entire input file into the elements of an array, and then steps through that array, mapping each 4th-column into a second array. It then steps through that second array, printing the first three columns from the first array, along with the current field from the second array.
It's obviously similar in structure to the awk
alternative, but much more cumbersome to read and code.
Note the ${v%$'\n'}
on the printf
line. This strips off the last field's trailing newline, which doesn't get stripped by mapfile
because we're using an alternate delimiter.
Note also that there's no reason you have to copy all your input into an array, I just did it that way to demonstrate a little more of mapfile
. You could of course use the old standard,
while read s; do
...
done < input.txt
if you prefer.
In the great Miller there is the nest verb to do it
With
mlr --nidx --ifs "\t" nest --explode --values --across-records -f 4 --nested-fs "," input.tsv
you will have
A1 A2 A3 A4
A1 A2 A3 A5
A1 A2 A3 A6
B1 B2 B3 B4
B1 B2 B3 B5
B1 B2 B3 B6
C1 C2 C3 C4
C1 C2 C3 C5
C1 C2 C3 C6