Open every file/subfolder in directory and print r

2019-07-08 22:14发布

At the moment I am working with this code:

from bs4 import BeautifulSoup
import glob
import os
import re
import contextlib


@contextlib.contextmanager
def stdout2file(fname):
    import sys
    f = open(fname, 'w')
    sys.stdout = f
    yield
    sys.stdout = sys.__stdout__
    f.close()

def trade_spider():
    os.chdir(r"C:\Users\6930p\FLO'S DATEIEN\Master FAU\Sommersemester 2016\02_Masterarbeit\04_Testumgebung\01_Probedateien für Analyseaspekt\Independent Auditors Report")
    with stdout2file("output.txt"):
        for file in glob.iglob('**/*.html', recursive=True):
            with open(file, encoding="utf8") as f:
                contents = f.read()
                soup = BeautifulSoup(contents, "html.parser")
                for item in soup.findAll("ix:nonfraction"):
                    if re.match(".*AuditFeesExpenses", item['name']):
                        print(file.split(os.path.sep)[-1], end="| ")
                        print(item['name'], end="| ")
                        print(item.get_text())
trade_spider()

So far this works perfectly. But now I am stucked with another issue. If I search within a folder which has no subfolders but only files this works without problems. However if i try to run this code on a folder that has subfolders it doesn't work (it prints nothing!). Furthermore I would like to get my results print into a .txt file without having the whole path in it. The result should be like:

Filename.html| RegEX Match| HTML text

I do get this result already, but only in PyCharm and not in a seperate .txt file.

To sum up, I do have 2 questions:

  1. How can I also walk through subfolders in my defined Directory? -> would os.walk() be an option for that?
  2. How can I print my results into a .txt file? -> would sys.stdout work on that?

Any help appreciated on this issue!

UPDATE: It only prints the first results of the first file into my "outout.txt" file (at least I think it is the first as it is the last file in my only subfolder and recursive=true is activated). Any idea why it is not looping through all the other files?

UPDATE_2: Question resolved! Final Code can be seen above!

1条回答
时光不老,我们不散
2楼-- · 2019-07-08 22:53

For walking in subdirectories, there are two options:

  1. Use ** with glob and the argument recursive=True (glob.glob('**/*.html')). This only works in Python 3.5+. I would also recommend using glob.iglob instead of glob.glob if the directory tree is large.

  2. Use os.walk and check the filenames (whether they end in ".html") manually or with fnmatch.filter.


Regarding the printing into a file, there are again several ways:

  1. Just execute the script and redirect stdout, i.e. python3 myscript.py >myfile.txt

  2. Replace calls to print with a call to the .write() method of a file object in write mode`.

  3. Keep using print, but give it the argument file=myfile where myfile is again a writable file object.

edit: Maybe the most unobstrusive method would be the following. First, include this somewhere:

import contextlib
@contextlib.contextmanager
def stdout2file(fname):
    import sys
    f = open(fname, 'w')
    sys.stdout = f
    yield
    sys.stdout = sys.__stdout__
    f.close()

And then, infront of the line in which you loop over the files, add this line (and appropriately indent):

with stdout2file("output.txt"):
查看更多
登录 后发表回答