Using BeautifulSoup to parse lines separated by

2019-02-16 19:54发布

I have a page that looks like this:

Company A<br />
123 Main St.<br />
Suite 101<br />
Someplace, NY 1234<br />
<br />
<br />
<br />
Company B<br />
456 Main St.<br />
Someplace, NY 1234<br />
<br />
<br />
<br />

Sometimes there are two rather than three "br" tags separating the entries. How would I use BeautifulSoup to parse through this document and extract the fields? I'm stumped because the bits of text that I need are not contained in paragraph (or similar) tags that I can simply iterate through.

5条回答
趁早两清
2楼-- · 2019-02-16 20:16

Perhaps you could use this function:

def partition_by(pred, iterable):
    current = None
    current_flag = None
    chunk = []
    for item in iterable:
        if current is None:
            current = item
            current_flag = pred(current)
            chunk = [current]
        elif pred(item) == current_flag:
            chunk.append(item)
        else:
            yield chunk
            current = item
            current_flag = not current_flag
            chunk = [current]
    if len(chunk) > 0:
        yield chunk

Add something to check for being a <br /> tag or newline:

def is_br(bs):
    try:
        return bs.name == u'br'
    except AttributeError:
        return False

def is_br_or_nl(bs):
    return is_br(bs) or u'\n' == bs

(Or whatever else is more appropriate... I'm not that good with BeautifulSoup.)

Then use partition_by(is_br_or_nl, cs) to yield (for cs set to BeautifulSoup.BeautifulSoup(your_example_html).childGenerator())

[[u'Company A'],
 [<br />],
 [u'\n123 Main St.'],
 [<br />],
 [u'\nSuite 101'],
 [<br />],
 [u'\nSomeplace, NY 1234'],
 [<br />, u'\n', <br />, u'\n', <br />, u'\n', <br />],
 [u'\nCompany B'],
 [<br />],
 [u'\n456 Main St.'],
 [<br />],
 [u'\nSomeplace, NY 1234'],
 [<br />, u'\n', <br />, u'\n', <br />, u'\n', <br />]]

That should be easy enough to process.

To generalise this, you'd probably have to write a predicate to check whether its argument is something you care about... Then you could use it with partition_by to have everything else lumped together. Note that the things which you care about are lumped together as well -- you basically have to process every item of every second list produced by the resulting generator, starting with the first one which includes things you care about.

查看更多
祖国的老花朵
3楼-- · 2019-02-16 20:32

you can do a little bit of manipulation first before anything. eg change all newlines to blanks, then substitute 2 occurrences and more of <br /> to some other delimiter like |. after that you can get your fields.

html="""
Company A<br />
123 Main St.<br />
Suite 101<br />
Someplace, NY 1234<br />
<br />
<br />
<br />
Company B<br />
456 Main St.<br />
Someplace, NY 1234<br />
<br />
<br />
<br />
"""
import re
newhtml=html.replace("\n","")
pat=re.compile("(<br \/>){2,}",re.DOTALL|re.M)
print pat.sub("|",newhtml)

output

$ ./python.py
Company A<br />123 Main St.<br />Suite 101<br />Someplace, NY 1234|Company B<br />456 Main St.<br />Someplace, NY 1234|

Now your company information are separated by pipes.

查看更多
smile是对你的礼貌
4楼-- · 2019-02-16 20:38

Once you have this HTML fragment, just use a regex to replace <br /> followed by an optional newline by a single newline, then split on multiple newlines. This should result in multiple individual paragraphs which you can process manually.

查看更多
神经病院院长
5楼-- · 2019-02-16 20:38

I have slimier issue .this how i solved

html=html.replace('<br>','<br />')
查看更多
成全新的幸福
6楼-- · 2019-02-16 20:40

You should look into the .stringsattribute found in tags, then use "\n".join() on that.

查看更多
登录 后发表回答