At the moment I am working with this code:
from bs4 import BeautifulSoup
import glob
import os
import re
import contextlib
@contextlib.contextmanager
def stdout2file(fname):
import sys
f = open(fname, 'w')
sys.stdout = f
yield
sys.stdout = sys.__stdout__
f.close()
def trade_spider():
os.chdir(r"C:\Users\6930p\FLO'S DATEIEN\Master FAU\Sommersemester 2016\02_Masterarbeit\04_Testumgebung\01_Probedateien für Analyseaspekt\Independent Auditors Report")
with stdout2file("output.txt"):
for file in glob.iglob('**/*.html', recursive=True):
with open(file, encoding="utf8") as f:
contents = f.read()
soup = BeautifulSoup(contents, "html.parser")
for item in soup.findAll("ix:nonfraction"):
if re.match(".*AuditFeesExpenses", item['name']):
print(file.split(os.path.sep)[-1], end="| ")
print(item['name'], end="| ")
print(item.get_text())
trade_spider()
So far this works perfectly. But now I am stucked with another issue. If I search within a folder which has no subfolders but only files this works without problems. However if i try to run this code on a folder that has subfolders it doesn't work (it prints nothing!). Furthermore I would like to get my results print into a .txt file without having the whole path in it. The result should be like:
Filename.html| RegEX Match| HTML text
I do get this result already, but only in PyCharm and not in a seperate .txt file.
To sum up, I do have 2 questions:
- How can I also walk through subfolders in my defined Directory? -> would os.walk() be an option for that?
- How can I print my results into a .txt file? -> would sys.stdout work on that?
Any help appreciated on this issue!
UPDATE: It only prints the first results of the first file into my "outout.txt" file (at least I think it is the first as it is the last file in my only subfolder and recursive=true is activated). Any idea why it is not looping through all the other files?
UPDATE_2: Question resolved! Final Code can be seen above!
For walking in subdirectories, there are two options:
Use
**
with glob and the argumentrecursive=True
(glob.glob('**/*.html')
). This only works in Python 3.5+. I would also recommend usingglob.iglob
instead ofglob.glob
if the directory tree is large.Use
os.walk
and check the filenames (whether they end in".html"
) manually or withfnmatch.filter
.Regarding the printing into a file, there are again several ways:
Just execute the script and redirect stdout, i.e.
python3 myscript.py >myfile.txt
Replace calls to
print
with a call to the.write()
method of a file object in write mode`.Keep using print, but give it the argument
file=myfile
wheremyfile
is again a writable file object.edit: Maybe the most unobstrusive method would be the following. First, include this somewhere:
And then, infront of the line in which you loop over the files, add this line (and appropriately indent):