Using BeautifulSoup to parse lines separated by <b

I have a page that looks like this:

Company A<br />
123 Main St.<br />
Suite 101<br />
Someplace, NY 1234<br />
<br />
<br />
<br />
Company B<br />
456 Main St.<br />
Someplace, NY 1234<br />
<br />
<br />
<br />

Sometimes there are two rather than three "br" tags separating the entries. How would I use BeautifulSoup to parse through this document and extract the fields? I'm stumped because the bits of text that I need are not contained in paragraph (or similar) tags that I can simply iterate through.

标签： python parsing beautifulsoup

5条回答

趁早两清

2楼-- · 2019-02-16 20:16

Perhaps you could use this function:

def partition_by(pred, iterable):
    current = None
    current_flag = None
    chunk = []
    for item in iterable:
        if current is None:
            current = item
            current_flag = pred(current)
            chunk = [current]
        elif pred(item) == current_flag:
            chunk.append(item)
        else:
            yield chunk
            current = item
            current_flag = not current_flag
            chunk = [current]
    if len(chunk) > 0:
        yield chunk

Add something to check for being a <br /> tag or newline:

def is_br(bs):
    try:
        return bs.name == u'br'
    except AttributeError:
        return False

def is_br_or_nl(bs):
    return is_br(bs) or u'\n' == bs

(Or whatever else is more appropriate... I'm not that good with BeautifulSoup.)

Then use partition_by(is_br_or_nl, cs) to yield (for cs set to BeautifulSoup.BeautifulSoup(your_example_html).childGenerator())

[[u'Company A'],
 [<br />],
 [u'\n123 Main St.'],
 [<br />],
 [u'\nSuite 101'],
 [<br />],
 [u'\nSomeplace, NY 1234'],
 [<br />, u'\n', <br />, u'\n', <br />, u'\n', <br />],
 [u'\nCompany B'],
 [<br />],
 [u'\n456 Main St.'],
 [<br />],
 [u'\nSomeplace, NY 1234'],
 [<br />, u'\n', <br />, u'\n', <br />, u'\n', <br />]]

That should be easy enough to process.

To generalise this, you'd probably have to write a predicate to check whether its argument is something you care about... Then you could use it with partition_by to have everything else lumped together. Note that the things which you care about are lumped together as well -- you basically have to process every item of every second list produced by the resulting generator, starting with the first one which includes things you care about.

0人赞添加讨论(0) 举报

祖国的老花朵

3楼-- · 2019-02-16 20:32

you can do a little bit of manipulation first before anything. eg change all newlines to blanks, then substitute 2 occurrences and more of <br /> to some other delimiter like |. after that you can get your fields.

html="""
Company A<br />
123 Main St.<br />
Suite 101<br />
Someplace, NY 1234<br />
<br />
<br />
<br />
Company B<br />
456 Main St.<br />
Someplace, NY 1234<br />
<br />
<br />
<br />
"""
import re
newhtml=html.replace("\n","")
pat=re.compile("(<br \/>){2,}",re.DOTALL|re.M)
print pat.sub("|",newhtml)

output

$ ./python.py
Company A<br />123 Main St.<br />Suite 101<br />Someplace, NY 1234|Company B<br />456 Main St.<br />Someplace, NY 1234|

Now your company information are separated by pipes.

0人赞添加讨论(0) 举报

smile是对你的礼貌

4楼-- · 2019-02-16 20:38

Once you have this HTML fragment, just use a regex to replace <br /> followed by an optional newline by a single newline, then split on multiple newlines. This should result in multiple individual paragraphs which you can process manually.

0人赞添加讨论(0) 举报

神经病院院长

5楼-- · 2019-02-16 20:38

I have slimier issue .this how i solved

html=html.replace('<br>','<br />')

0人赞添加讨论(0) 举报

成全新的幸福

6楼-- · 2019-02-16 20:40

You should look into the .stringsattribute found in tags, then use "\n".join() on that.

0人赞添加讨论(0) 举报

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间