An efficient way to transpose a file in Bash

2018-12-31 07:57发布

I have a huge tab-separated file formatted like this

X column1 column2 column3
row1 0 1 2
row2 3 4 5
row3 6 7 8
row4 9 10 11

I would like to transpose it in an efficient way using only bash commands (I could write a ten or so lines Perl script to do that, but it should be slower to execute than the native bash functions). So the output should look like

X row1 row2 row3 row4
column1 0 3 6 9
column2 1 4 7 10
column3 2 5 8 11

I thought of a solution like this

cols=`head -n 1 input | wc -w`
for (( i=1; i <= $cols; i++))
do cut -f $i input | tr $'\n' $'\t' | sed -e "s/\t$/\n/g" >> output
done

But it's slow and doesn't seem the most efficient solution. I've seen a solution for vi in this post, but it's still over-slow. Any thoughts/suggestions/brilliant ideas? :-)

25条回答
怪性笑人.
2楼-- · 2018-12-31 08:26

Another option is to use rs:

rs -c' ' -C' ' -T

-c changes the input column separator, -C changes the output column separator, and -T transposes rows and columns. Do not use -t instead of -T, because it uses an automatically calculated number of rows and columns that is not usually correct. rs, which is named after the reshape function in APL, comes with BSDs and OS X, but it should be available from package managers on other platforms.

A second option is to use Ruby:

ruby -e'puts readlines.map(&:split).transpose.map{|x|x*" "}'

A third option is to use jq:

jq -R .|jq -sr 'map(./" ")|transpose|map(join(" "))[]'

jq -R . prints each input line as a JSON string literal, -s (--slurp) creates an array for the input lines after parsing each line as JSON, and -r (--raw-output) outputs the contents of strings instead of JSON string literals. The / operator is overloaded to split strings.

查看更多
呛了眼睛熬了心
3楼-- · 2018-12-31 08:27

Some *nix standard util one-liners, no temp files needed. NB: the OP wanted an efficient fix, (i.e. faster), and the top answers are usually faster than this answer. These one-liners are for those who like *nix software tools, for whatever reasons. In rare cases, (e.g. scarce IO & memory), these snippets can actually be faster than some of the top answers.

Call the input file foo.

  1. If we know foo has four columns:

    for f in 1 2 3 4 ; do cut -d ' ' -f $f foo | xargs echo ; done
    
  2. If we don't know how many columns foo has:

    n=$(head -n 1 foo | wc -w)
    for f in $(seq 1 $n) ; do cut -d ' ' -f $f foo | xargs echo ; done
    

    xargs has a size limit and therefore would make incomplete work with a long file. What size limit is system dependent, e.g.:

    { timeout '.01' xargs --show-limits ; } 2>&1 | grep Max
    

    Maximum length of command we could actually use: 2088944

  3. tr & echo:

    for f in 1 2 3 4; do cut -d ' ' -f $f foo | tr '\n\ ' ' ; echo; done
    

    ...or if the # of columns are unknown:

    n=$(head -n 1 foo | wc -w)
    for f in $(seq 1 $n); do 
        cut -d ' ' -f $f foo | tr '\n' ' ' ; echo
    done
    
  4. Using set, which like xargs, has similar command line size based limitations:

    for f in 1 2 3 4 ; do set - $(cut -d ' ' -f $f foo) ; echo $@ ; done
    
查看更多
心情的温度
4楼-- · 2018-12-31 08:28

A Python solution:

python -c "import sys; print('\n'.join(' '.join(c) for c in zip(*(l.split() for l in sys.stdin.readlines() if l.strip()))))" < input > output

The above is based on the following:

import sys

for c in zip(*(l.split() for l in sys.stdin.readlines() if l.strip())):
    print(' '.join(c))

This code does assume that every line has the same number of columns (no padding is performed).

查看更多
与风俱净
5楼-- · 2018-12-31 08:29

Here's a Haskell solution. When compiled with -O2, it runs slightly faster than ghostdog's awk and slightly slower than Stephan's thinly wrapped c python on my machine for repeated "Hello world" input lines. Unfortunately GHC's support for passing command line code is non-existent as far as I can tell, so you will have to write it to a file yourself. It will truncate the rows to the length of the shortest row.

transpose :: [[a]] -> [[a]]
transpose = foldr (zipWith (:)) (repeat [])

main :: IO ()
main = interact $ unlines . map unwords . transpose . map words . lines
查看更多
忆尘夕之涩
6楼-- · 2018-12-31 08:30

The only improvement I can see to your own example is using awk which will reduce the number of processes that are run and the amount of data that is piped between them:

/bin/rm output 2> /dev/null

cols=`head -n 1 input | wc -w` 
for (( i=1; i <= $cols; i++))
do
  awk '{printf ("%s%s", tab, $'$i'); tab="\t"} END {print ""}' input
done >> output
查看更多
皆成旧梦
7楼-- · 2018-12-31 08:31

Pure BASH, no additional process. A nice exercise:

declare -a array=( )                      # we build a 1-D-array

read -a line < "$1"                       # read the headline

COLS=${#line[@]}                          # save number of columns

index=0
while read -a line ; do
    for (( COUNTER=0; COUNTER<${#line[@]}; COUNTER++ )); do
        array[$index]=${line[$COUNTER]}
        ((index++))
    done
done < "$1"

for (( ROW = 0; ROW < COLS; ROW++ )); do
  for (( COUNTER = ROW; COUNTER < ${#array[@]}; COUNTER += COLS )); do
    printf "%s\t" ${array[$COUNTER]}
  done
  printf "\n" 
done
查看更多
登录 后发表回答