BeautifulSoup MemoryError When Opening Several Fil

2020-04-11 18:05发布

Context: Every week, I receive a list of lab results in the form of an html file. Each week, there are about 3,000 results with each set of results having between two and four tables associated with them. For each result/trial, I only care about some standard information that is stored in one of these tables. That table can be uniquely identified because the first cell, first column always has the text "Lab Results".

Problem: The following code works great when I do each file at a time. That is, instead of doing a for loop over the directory, I point get_data = open() to a specific file. However, I want to grab the data from the past few years and would rather not do each file individually. Therefore, I used the glob module and a for loop to cycle through all the files in the directory. The issue I am having is I get a MemoryError by the time I get to the third file in the directory.

The Question: Is there a way to clear/reset the memory between each file? That way, I could cycle through all the files in the directory and not paste in each file name individually. As you can see in the code below, I tried clearing the variables with del, but that did not work.

Thank you.

from bs4 import BeautifulSoup
import glob
import gc

for FileName in glob.glob("\\Research Results\\*"):

    get_data = open(FileName,'r').read()

    soup = BeautifulSoup(get_data)

    VerifyTable = "Clinical Results"

    tables = soup.findAll('table')

    for table in tables:
        First_Row_First_Column = table.findAll('tr')[0].findAll('td')[0].text
        if VerifyTable == First_Row_First_Column.strip():
            v1 = table.findAll('tr')[1].findAll('td')[0].text
            v2 = table.findAll('tr')[1].findAll('td')[1].text

            complete_row = v1.strip() + ";" + v2.strip()

            print (complete_row)

            with open("Results_File.txt","a") as out_file:
                out_string = ""
                out_string += complete_row
                out_string += "\n"
                out_file.write(out_string)
                out_file.close()

    del get_data
    del soup
    del tables
    gc.collect()

print ("done")

1条回答
Melony?
2楼-- · 2020-04-11 18:36

I´m a very beginner programmer and I faced the same problem. I did three things that seemed to solve the problem:

  1. Also call garbage collection('gc.collect()') at the beginning of the iteration
  2. transforme the parsing on a iteration, so all the global variables will become local variables and will be deleted at the end of the function.
  3. Use soupe.decompose()

I think the second change probably solved it, but I didn´t have time to check it and I don´t want to change a working code.

For the this code, the solution would be something like this:

from bs4 import BeautifulSoup
import glob
import gc

def parser(file):
    gc.collect()

    get_data = open(file,'r').read()

    soup = BeautifulSoup(get_data)
    get_data.close()
    VerifyTable = "Clinical Results"

    tables = soup.findAll('table')

    for table in tables:
        First_Row_First_Column = table.findAll('tr')[0].findAll('td')[0].text
        if VerifyTable == First_Row_First_Column.strip():
            v1 = table.findAll('tr')[1].findAll('td')[0].text
            v2 = table.findAll('tr')[1].findAll('td')[1].text

            complete_row = v1.strip() + ";" + v2.strip()

            print (complete_row)

            with open("Results_File.txt","a") as out_file:
                out_string = ""
                out_string += complete_row
                out_string += "\n"
                out_file.write(out_string)
                out_file.close()

    soup.decompose()
    gc.collect()
    return None


for filename in glob.glob("\\Research Results\\*"):
    parser(filename)

print ("done")
查看更多
登录 后发表回答