So I have the HTML from an NPR page, and I want to use regex to extract just certain URLs for me (these call the URLs to specific stories nested within the page). The actual links appear in the text (retrieved manually) as:
<a href="http://www.npr.org/blogs/parallels/2014/11/11/363018388/how-the-islamic-state-wages-its-propaganda-war">
<a href="http://www.npr.org/blogs/thetwo-way/2014/11/11/363309020/asked-to-stop-praying-alaska-school-won-t-host-state-tournament">
<a href="http://www.npr.org/2014/11/11/362817642/a-marines-parents-story-their-memories-that-you-should-hear">
<a href="http://www.npr.org/blogs/thetwo-way/2014/11/11/363288744/comets-rugged-landscape-makes-landing-a-roll-of-the-dice">
<a href="http://www.npr.org/blogs/thetwo-way/2014/11/11/363293514/for-dyslexics-a-font-and-a-dictionary-that-are-meant-to-help">
obviously, I cannot to continue to use manual retrieval if I want to be able to use this on a consistent basis. So far, I have this code:
import nltk
import re
f = open("/Users/shannonmcgregor/Desktop/npr.txt")
npr_lines = f.readlines()
f.close()
I have this code to grab everything between (
for line in npr_lines:
re.findall('<a href="?\'?([^"\'>]*)', line)
But that grabs all urls. I tried adding something like:
(parallels|thetwo-way|a-marines)
but that returns nothing. So what am I doing wrong? How I combine the larger URL stripper with these specific words that target the given URLs?
Please and thank you :)
You can use
re.search
function to match the regex in the line and prints the line if it matches aswill give an output as
You can do this by using a lookahead:
Debuggex Demo
Through a tool which is specially designed for parsing
html
andxml
files [BeautifulSoup],