I have a page that looks like this:
Company A<br />
123 Main St.<br />
Suite 101<br />
Someplace, NY 1234<br />
<br />
<br />
<br />
Company B<br />
456 Main St.<br />
Someplace, NY 1234<br />
<br />
<br />
<br />
Sometimes there are two rather than three "br" tags separating the entries. How would I use BeautifulSoup to parse through this document and extract the fields? I'm stumped because the bits of text that I need are not contained in paragraph (or similar) tags that I can simply iterate through.
Perhaps you could use this function:
Add something to check for being a
<br />
tag or newline:(Or whatever else is more appropriate... I'm not that good with BeautifulSoup.)
Then use
partition_by(is_br_or_nl, cs)
to yield (forcs
set toBeautifulSoup.BeautifulSoup(your_example_html).childGenerator()
)That should be easy enough to process.
To generalise this, you'd probably have to write a predicate to check whether its argument is something you care about... Then you could use it with
partition_by
to have everything else lumped together. Note that the things which you care about are lumped together as well -- you basically have to process every item of every second list produced by the resulting generator, starting with the first one which includes things you care about.you can do a little bit of manipulation first before anything. eg change all newlines to blanks, then substitute 2 occurrences and more of
<br />
to some other delimiter like|
. after that you can get your fields.output
Now your company information are separated by pipes.
Once you have this HTML fragment, just use a regex to replace
<br />
followed by an optional newline by a single newline, then split on multiple newlines. This should result in multiple individual paragraphs which you can process manually.I have slimier issue .this how i solved
You should look into the
.strings
attribute found in tags, then use "\n".join() on that.