BeautifulSoup MemoryError When Opening Several Fil

Context: Every week, I receive a list of lab results in the form of an html file. Each week, there are about 3,000 results with each set of results having between two and four tables associated with them. For each result/trial, I only care about some standard information that is stored in one of these tables. That table can be uniquely identified because the first cell, first column always has the text "Lab Results".

Problem: The following code works great when I do each file at a time. That is, instead of doing a for loop over the directory, I point get_data = open() to a specific file. However, I want to grab the data from the past few years and would rather not do each file individually. Therefore, I used the glob module and a for loop to cycle through all the files in the directory. The issue I am having is I get a MemoryError by the time I get to the third file in the directory.

The Question: Is there a way to clear/reset the memory between each file? That way, I could cycle through all the files in the directory and not paste in each file name individually. As you can see in the code below, I tried clearing the variables with del, but that did not work.

Thank you.

from bs4 import BeautifulSoup
import glob
import gc

for FileName in glob.glob("\\Research Results\\*"):

    get_data = open(FileName,'r').read()

    soup = BeautifulSoup(get_data)

    VerifyTable = "Clinical Results"

    tables = soup.findAll('table')

    for table in tables:
        First_Row_First_Column = table.findAll('tr')[0].findAll('td')[0].text
        if VerifyTable == First_Row_First_Column.strip():
            v1 = table.findAll('tr')[1].findAll('td')[0].text
            v2 = table.findAll('tr')[1].findAll('td')[1].text

            complete_row = v1.strip() + ";" + v2.strip()

            print (complete_row)

            with open("Results_File.txt","a") as out_file:
                out_string = ""
                out_string += complete_row
                out_string += "\n"
                out_file.write(out_string)
                out_file.close()

    del get_data
    del soup
    del tables
    gc.collect()

print ("done")

标签： python memory web-scraping beautifulsoup scraper

1条回答

Melony?

2楼-- · 2020-04-11 18:36

I´m a very beginner programmer and I faced the same problem. I did three things that seemed to solve the problem:

Also call garbage collection('gc.collect()') at the beginning of the iteration
transforme the parsing on a iteration, so all the global variables will become local variables and will be deleted at the end of the function.
Use soupe.decompose()

I think the second change probably solved it, but I didn´t have time to check it and I don´t want to change a working code.

For the this code, the solution would be something like this:

from bs4 import BeautifulSoup
import glob
import gc

def parser(file):
    gc.collect()

    get_data = open(file,'r').read()

    soup = BeautifulSoup(get_data)
    get_data.close()
    VerifyTable = "Clinical Results"

    tables = soup.findAll('table')

    for table in tables:
        First_Row_First_Column = table.findAll('tr')[0].findAll('td')[0].text
        if VerifyTable == First_Row_First_Column.strip():
            v1 = table.findAll('tr')[1].findAll('td')[0].text
            v2 = table.findAll('tr')[1].findAll('td')[1].text

            complete_row = v1.strip() + ";" + v2.strip()

            print (complete_row)

            with open("Results_File.txt","a") as out_file:
                out_string = ""
                out_string += complete_row
                out_string += "\n"
                out_file.write(out_string)
                out_file.close()

    soup.decompose()
    gc.collect()
    return None


for filename in glob.glob("\\Research Results\\*"):
    parser(filename)

print ("done")

0人赞添加讨论(0) 举报

BeautifulSoup MemoryError When Opening Several Fil

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间