Reshaping from wide to long format

2020-02-02 03:47发布

问题:

I am trying to use unix to transform a tab delimited file from a short/wide format to long format, in a similar way as the reshape function in R. I hope to create three rows for each row in the starting file. Column 4 currently contains 3 values separated by commas. I hope to keep columns 1, 2, and 3 the same for each starting row, but have column 4 be one of the values from the initial column 4. This example probably makes it more clear than I can describe verbally:

current file:  
A1  A2  A3  A4,A5,A6  
B1  B2  B3  B4,B5,B6  
C1  C2  C3  C4,C5,C6  

goal:  
A1  A2  A3  A4  
A1  A2  A3  A5  
A1  A2  A3  A6  
B1  B2  B3  B4  
B1  B2  B3  B5  
B1  B2  B3  B6  
C1  C2  C3  C4  
C1  C2  C3  C5  
C1  C2  C3  C6  

As someone just becoming familiar with this language, my initial thought was to use sed to find the commas replace with a hard return

sed 's/,/&\n/' data.frame

I am really not sure how to include the values for columns 1-3. I had low hopes of this working, but the only thing I could think of was to try inserting the column values with {print $1, $2, $3}.

sed 's/,/&\n{print $1, $2, $3}/' data.frame

Not to my surprise, the output looked like this:

A1  A2  A3  A4  
{print $1, $2, $3}  A5  
{print $1, $2, $3}  A6  
B1  B2  B3  B4  
{print $1, $2, $3}  B5  
{print $1, $2, $3}  B6  
C1  C2  C3  C4  
{print $1, $2, $3}  C5  
{print $1, $2, $3}  C6  

It seems like an approach might be to store the values of columns 1-3 and then insert them. I am not really sure how to store the values, I think that it may involve using an adaptation of the following script, but I am having a hard time understanding all of the components.

NR==FNR{a[$1, $2, $3]=1}

Thanks in advance for your thoughts on this.

回答1:

You can a write simple read loop for this and use brace expansion for parsing the comma delimited field:

#!/bin/bash

while read -r f1 f2 f3 c1; do
  # split the comma delimited field 'c1' into its constituents
  for c in ${c1//,/ }; do
     printf "$f1 $f2 $f3 $c\n"
  done
done < input.txt

Output:

A1 A2 A3 A4
A1 A2 A3 A5
A1 A2 A3 A6
B1 B2 B3 B4
B1 B2 B3 B5
B1 B2 B3 B6
C1 C2 C3 C4
C1 C2 C3 C5
C1 C2 C3 C6


回答2:

As solution without calling an external program :

#!/bin/bash

data_file="d"

while IFS=" " read -r f1 f2 f3 r
do
  IFS="," read f4 f5 f6 <<<"$r"
  printf "$f1 $f2 $f3 $f4\n$f1 $f2 $f3 $f5\n$f1 $f2 $f3 $f6\n"
done <"$data_file"


回答3:

If you don't need the output to be in any particular order within a group of the fourth column, the following awk one-liner might do:

awk '{split($4,a,","); for(i in a) print $1,$2,$3,a[i]}' input.txt

This works by splitting your 4th column into an array, then for each element of the array, printing the "new" four columns.

If order is important -- that is, A4 must come before A5, etc, then you can use a classic for loop:

awk '{split($4,a,","); for(i=1;i<=length(a);i++) print $1,$2,$3,a[i]}' input.txt

But that's awk. And you're asking about bash.

The following might work:

#!/usr/bin/env bash

mapfile -t arr < input.txt

for s in "${arr[@]}"; do
  t=($s)
  mapfile -t -d, u <<<"${t[3]}"
  for v in "${u[@]}"; do
    printf '%s %s %s %s\n' "${t[@]:0:3}" "${v%$'\n'}"
  done
done

This copies your entire input file into the elements of an array, and then steps through that array, mapping each 4th-column into a second array. It then steps through that second array, printing the first three columns from the first array, along with the current field from the second array.

It's obviously similar in structure to the awk alternative, but much more cumbersome to read and code.

Note the ${v%$'\n'} on the printf line. This strips off the last field's trailing newline, which doesn't get stripped by mapfile because we're using an alternate delimiter.

Note also that there's no reason you have to copy all your input into an array, I just did it that way to demonstrate a little more of mapfile. You could of course use the old standard,

while read s; do
   ...
done < input.txt

if you prefer.



回答4:

In the great Miller there is the nest verb to do it

With

mlr --nidx --ifs "\t" nest --explode --values  --across-records -f 4 --nested-fs ","  input.tsv

you will have

A1 A2 A3 A4
A1 A2 A3 A5
A1 A2 A3 A6
B1 B2 B3 B4
B1 B2 B3 B5
B1 B2 B3 B6
C1 C2 C3 C4
C1 C2 C3 C5
C1 C2 C3 C6