I have a file that is HTML, and it has about 150 anchor tags. I need only the links from these tags, AKA, <a href="*http://www.google.com*"></a>
. I want to get only the http://www.google.com part.
When I run a grep,
cat website.htm | grep -E '<a href=".*">' > links.txt
this returns the entire line to me that it found on not the link I want, so I tried using a cut
command:
cat drawspace.txt | grep -E '<a href=".*">' | cut -d’”’ --output-delimiter=$'\n' > links.txt
Except that it is wrong, and it doesn't work give me some error about wrong parameters... So I assume that the file was supposed to be passed along too. Maybe like cut -d’”’ --output-delimiter=$'\n' grepedText.txt > links.txt
.
But I wanted to do this in one command if possible... So I tried doing an AWK command.
cat drawspace.txt | grep '<a href=".*">' | awk '{print $2}’
But this wouldn't run either. It was asking me for more input, because I wasn't finished....
I tried writing a batch file, and it told me FINDSTR is not an internal or external command... So I assume my environment variables were messed up and rather than fix that I tried installing grep on Windows, but that gave me the same error....
The question is, what is the right way to strip out the HTTP links from HTML? With that I will make it work for my situation.
P.S. I've read so many links/Stack Overflow posts that showing my references would take too long.... If example HTML is needed to show the complexity of the process then I will add it.
I also have a Mac and PC which I switched back and forth between them to use their shell/batch/grep command/terminal commands, so either or will help me.
I also want to point out I'm in the correct directory
HTML:
<tr valign="top">
<td class="beginner">
B03
</td>
<td>
<a href="http://www.drawspace.com/lessons/b03/simple-symmetry">Simple Symmetry</a> </td>
</tr>
<tr valign="top">
<td class="beginner">
B04
</td>
<td>
<a href="http://www.drawspace.com/lessons/b04/faces-and-a-vase">Faces and a Vase</a> </td>
</tr>
<tr valign="top">
<td class="beginner">
B05
</td>
<td>
<a href="http://www.drawspace.com/lessons/b05/blind-contour-drawing">Blind Contour Drawing</a> </td>
</tr>
<tr valign="top">
<td class="beginner">
B06
</td>
<td>
<a href="http://www.drawspace.com/lessons/b06/seeing-values">Seeing Values</a> </td>
</tr>
Expected output:
http://www.drawspace.com/lessons/b03/simple-symmetry
http://www.drawspace.com/lessons/b04/faces-and-a-vase
http://www.drawspace.com/lessons/b05/blind-contour-drawing
etc.
$ sed -n 's/.*href="\([^"]*\).*/\1/p' file
http://www.drawspace.com/lessons/b03/simple-symmetry
http://www.drawspace.com/lessons/b04/faces-and-a-vase
http://www.drawspace.com/lessons/b05/blind-contour-drawing
http://www.drawspace.com/lessons/b06/seeing-values
You can use grep
for this:
grep -Po '(?<=href=")[^"]*' file
It prints everything after href="
until a new double quote appears.
With your given input it returns:
http://www.drawspace.com/lessons/b03/simple-symmetry
http://www.drawspace.com/lessons/b04/faces-and-a-vase
http://www.drawspace.com/lessons/b05/blind-contour-drawing
http://www.drawspace.com/lessons/b06/seeing-values
Note that it is not necessary to write cat drawspace.txt | grep '<a href=".*">'
, you can get rid of the useless use of cat with grep '<a href=".*">' drawspace.txt
.
Another example
$ cat a
hello <a href="httafasdf">asdas</a>
hello <a href="hello">asdas</a>
other things
$ grep -Po '(?<=href=")[^"]*' a
httafasdf
hello
My guess is your PC or Mac will not have the lynx command installed by default (it's available for free on the web), but lynx will let you do things like this:
$lynx -dump -image_links -listonly /usr/share/xdiagnose/workloads/youtube-reload.html
Output:
References
- file://localhost/usr/share/xdiagnose/workloads/youtube-reload.html
- http://www.youtube.com/v/zeNXuC3N5TQ&hl=en&fs=1&autoplay=1
It is then a simple matter to grep for the http: lines. And there even may be lynx options to print just the http: lines (lynx has many, many options).
As per comment of triplee, using regex to parse HTML or XML files is essentially not done. Tools such as sed
and awk
are extremely powerful for handling text files, but when it boils down to parsing complex-structured data — such as XML, HTML, JSON, ... — they are nothing more than a sledgehammer. Yes, you can get the job done, but sometimes at a tremendous cost. For handling such delicate files, you need a bit more finesse by using a more targetted set of tools.
In case of parsing XML or HTML, one can easily use xmlstarlet
.
In case of an XHTML file, you can use :
xmlstarlet sel --html -N "x=http://www.w3.org/1999/xhtml" \
-t -m '//x:a/@href' -v . -n
where -N
gives the XHTML namespace if any, this is recognized by
<html xmlns="http://www.w3.org/1999/xhtml">
However, As HTML pages are often not well-formed XML, it might be handy to clean it up a bit using tidy
. In the example case above this gives then :
$ tidy -q -numeric -asxhtml --show-warnings no <file.html> \
| xmlstarlet sel --html -N "x=http://www.w3.org/1999/xhtml" \
-t -m '//x:a/@href' -v . -n
http://www.drawspace.com/lessons/b03/simple-symmetry
http://www.drawspace.com/lessons/b04/faces-and-a-vase
http://www.drawspace.com/lessons/b05/blind-contour-drawing
http://www.drawspace.com/lessons/b06/seeing-values
Use grep
to extract all the lines with links in them and then use sed
to pull out the URLs:
grep -o '<a href=".*">' *.html | sed 's/\(<a href="\|\">\)//g' > link.txt;