How do I pass Biopython SeqIO.convert() over multi

2019-02-20 20:53发布

问题:

I’m writing a python script (version 2.7) that will change every input file (.nexus format) within the specified directory into .fasta format. The Biopython module SeqIO.convert handles the conversion perfectly for individually specified files but when I try to automate the process over a directory using os.walk I’m unable to correctly pass the pathname of each input file to SeqIO.convert. Where are I going wrong? Do I need to use join() from os.path module and pass the full path names on to SeqIO.convert?

    #Import modules
    import sys
    import re
    import os
    import fileinput

    from Bio import SeqIO

    #Specify directory of interest
    PSGDirectory = "/Users/InputDirectory”
    #Create a class that will run the SeqIO.convert function repeatedly
    def process(filename):
      count = SeqIO.convert("files", "nexus", "files.fa", "fasta", alphabet= IUPAC.ambiguous_dna)
    #Make sure os.walk works correctly
    for path, dirs, files in os.walk(PSGDirectory):
       print path
       print dirs
       print files

    #Now recursively do the count command on each file inside PSGDirectory
    for files in os.walk(PSGDirectory):
       print("Converted %i records" % count)
       process(files)      

When I run the script I get this error message: Traceback (most recent call last): File "nexus_to_fasta.psg", line 45, in <module> print("Converted %i records" % count) NameError: name 'count' is not defined This conversation was very helpful but I don’t know where to insert the join() function statements. Here is an example of one of my nexus files Thanks for your help!

回答1:

There are a few things going on.

First, your process function isn't returning 'count'. You probably want:

def process(filename):
   return seqIO.convert("files", "nexus", "files.fa", "fasta", alphabet=IUPAC.ambiguous_dna) 
   # assuming seqIO.convert actually returns the number you want

Also, when you write for files in os.walk(PSGDirectory) you're operating on the 3-tuple that os.walk returns, not individual files. You want to do something like this (note the use of os.path.join):

for root, dirs, files in os.walk(PSGDirectory):
    for filename in files:
            fullpath = os.path.join(root, filename)
            print process(fullpath)

Update:

So I looked at the documentation for seqIO.convert and it expects to be called with:

  • in_file - an input handle or filename
  • in_format - input file format, lower case string
  • out_file - an output handle or filename
  • out_format - output file format, lower case string
  • alphabet - optional alphabet to assume

in_file is the name of the file to convert, and originally you were just calling seqIO.convert with "files".

so your process function should probably be something like this:

def process(filename):
    return seqIO.convert(filename, "nexus", filename + '.fa', "fasta", alphabet=IUPAC.ambiguous_dna)