Context: Every week, I receive a list of lab results in the form of an html file. Each week, there are about 3,000 results with each set of results having between two and four tables associated with them. For each result/trial, I only care about some standard information that is stored in one of these tables. That table can be uniquely identified because the first cell, first column always has the text "Lab Results".
Problem: The following code works great when I do each file at a time. That is, instead of doing a for loop over the directory, I point get_data = open() to a specific file. However, I want to grab the data from the past few years and would rather not do each file individually. Therefore, I used the glob module and a for loop to cycle through all the files in the directory. The issue I am having is I get a MemoryError by the time I get to the third file in the directory.
The Question: Is there a way to clear/reset the memory between each file? That way, I could cycle through all the files in the directory and not paste in each file name individually. As you can see in the code below, I tried clearing the variables with del, but that did not work.
Thank you.
from bs4 import BeautifulSoup
import glob
import gc
for FileName in glob.glob("\\Research Results\\*"):
get_data = open(FileName,'r').read()
soup = BeautifulSoup(get_data)
VerifyTable = "Clinical Results"
tables = soup.findAll('table')
for table in tables:
First_Row_First_Column = table.findAll('tr')[0].findAll('td')[0].text
if VerifyTable == First_Row_First_Column.strip():
v1 = table.findAll('tr')[1].findAll('td')[0].text
v2 = table.findAll('tr')[1].findAll('td')[1].text
complete_row = v1.strip() + ";" + v2.strip()
print (complete_row)
with open("Results_File.txt","a") as out_file:
out_string = ""
out_string += complete_row
out_string += "\n"
out_file.write(out_string)
out_file.close()
del get_data
del soup
del tables
gc.collect()
print ("done")
I´m a very beginner programmer and I faced the same problem. I did three things that seemed to solve the problem:
I think the second change probably solved it, but I didn´t have time to check it and I don´t want to change a working code.
For the this code, the solution would be something like this: