-->

Bash script to concatenate text files with specifi

2019-07-18 10:13发布

问题:

Within a certain directory I have many directories containing a bunch of text files. I’m trying to write a script that concatenates only those files in each directory that have the string ‘R1’ in their filename into one file within that specific directory, and those that have ‘R2’ in another . This is what I wrote but it’s not working.

#!/bin/bash

for f in */*.fastq; do

    if grep 'R1' $f ; then
        cat "$f" >> R1.fastq
    fi

    if grep 'R2' $f ; then
        cat "$f" >> R2.fastq
    fi

done

I get no errors and the files are created as intended but they are empty files. Can anyone tell me what I’m doing wrong?

Thank you all for the fast and detailed responses! I think I wasn't very clear in my question, but I need the script to only concatenate the files within each specific directory so that each directory has a new file ( R1 and R2). I tried doing

cat /*R1*.fastq >*/R1.fastq 

but it gave me an ambiguous redirect error. I also tried Charles Duffy's for loop but looping through the directories and doing a nested loop to run though each file within a directory like so

for f in */; do
   for d in "$f"/*.fastq;do
     case "$d" in
       *R1*) cat "$d" >&3
       *R2*) cat "$d" >&4
     esac
   done 3>R1.fastq 4>R2.fastq
done

but it was giving an unexpected token error regarding ')'.

Sorry in advance if I'm missing something elementary, I'm still very new to bash.

回答1:

A Note To The Reader

Please review edit history on the question in considering this answer; several parts have been made less relevant by question edits.

One cat Per Output File

For the purpose at hand, you can probably just let shell globbing do all the work (if R1 or R2 will be in the filenames, as opposed to the directory names):

set -x # log what's happening!
cat */*R1*.fastq >R1.fastq
cat */*R2*.fastq >R2.fastq

One find Per Output File

If it's a really large number of files, by contrast, you might need find:

find . -mindepth 2 -maxdepth 2 -type f -name '*R1*.fastq' -exec cat '{}' + >R1.fastq
find . -mindepth 2 -maxdepth 2 -type f -name '*R2*.fastq' -exec cat '{}' + >R2.fastq

...this is because of the OS-dependent limit on command-line length; the find command given above will put as many arguments onto each cat command as possible for efficiency, but will still split them up into multiple invocations where otherwise the limit would be exceeded.


Iterate-And-Test

If you really do want to iterate over everything, and then test the names, consider a case statement for the job, which is much more efficient than using grep to check just one line:

for f in */*.fastq; do
  case $f in
    *R1*) cat "$f" >&3
    *R2*) cat "$f" >&4
  esac
done 3>R1.fastq 4>R2.fastq

Note the use of file descriptors 3 and 4 to write to R1.fastq and R2.fastq respectively -- that way we're only opening the output files once (and thus truncating them exactly once) when the for loop starts, and reusing those file descriptors rather than re-opening the output files at the beginning of each cat. (That said, running cat once per file -- which find -exec {} + avoids -- is probably more overhead on balance).


Operating Per-Directory

All of the above can be updated to work on a per-directory basis quite trivially. For example:

for d in */; do
  find "$d" -name R1.fastq -prune -o -name '*R1*.fastq' -exec cat '{}' + >"$d/R1.fastq"
  find "$d" -name R2.fastq -prune -o -name '*R2*.fastq' -exec cat '{}' + >"$d/R2.fastq"
done

There are only two significant changes:

  • We're no longer specifying -mindepth, to ensure that our input files only come from subdirectories.
  • We're excluding R1.fastq and R2.fastq from our input files, so we never try to use the same file as both input and output. This is a consequence of the prior change: Previously, our output files couldn't be considered as input because they didn't meet the minimum depth.


回答2:

Your grep is searching the file contents instead of file name. You could rewrite it this way:

for f in */*.fastq; do
  [[ -f $f ]] || continue
  if [[ $f = *R1* ]]; then
    cat "$f" >> R1.fastq
  elif [[ $f = *R2* ]]; then
    cat "$f" >> R2.fastq
  fi
done


回答3:

Find in a forloop might suit this:

  for i in R1 R2 
    do 
      find . -type f -name "*${i}*" -exec cat '{}' + >"$i.txt"
   done


标签: bash fastq