Can you provide examples of parsing HTML?-第5页回答

How do you parse HTML with a variety of languages and parsing libraries?

When answering:

Individual comments will be linked to in answers to questions about how to parse HTML with regexes as a way of showing the right way to do things.

For the sake of consistency, I ask that the example be parsing an HTML file for the href in anchor tags. To make it easy to search this question, I ask that you follow this format

Language: [language name]

Library: [library name]

[example code]

Please make the library a link to the documentation for the library. If you want to provide an example other than extracting links, please also include:

Purpose: [what the parse does]

标签： html language-agnostic html-parsing

29条回答

皆成旧梦

2楼-- · 2018-12-31 16:15

Language: Common Lisp
Library: Closure Html, Closure Xml, CL-WHO

(shown using DOM API, without using XPATH or STP API)

(defvar *html*
  (who:with-html-output-to-string (stream)
    (:html
     (:body (loop
               for site in (list "foo" "bar" "baz")
               do (who:htm (:a :href (format nil "http://~A.com/" site))))))))

(defvar *dom*
  (chtml:parse *html* (cxml-dom:make-dom-builder)))

(loop
   for tag across (dom:get-elements-by-tag-name *dom* "a")
   collect (dom:get-attribute tag "href"))
=> 
("http://foo.com/" "http://bar.com/" "http://baz.com/")

0人赞添加讨论(0) 举报

何处买醉

3楼-- · 2018-12-31 16:16

Language: JavaScript
Library: jQuery

$.each($('a[href]'), function(){
    console.debug(this.href);
});

(using firebug console.debug for output...)

And loading any html page:

$.get('http://stackoverflow.com/', function(page){
     $(page).find('a[href]').each(function(){
        console.debug(this.href);
    });
});

Used another each function for this one, I think it's cleaner when chaining methods.

0人赞添加讨论(0) 举报

裙下三千臣

4楼-- · 2018-12-31 16:16

Language: Perl
Library: pQuery

use strict;
use warnings;
use pQuery;

my $html = join '',
    "<html><body>",
    (map { qq(<a href="http://$_.com">$_</a>) } qw/foo bar baz/),
    "</body></html>";

pQuery( $html )->find( 'a' )->each(
    sub {  
        my $at = $_->getAttribute( 'href' ); 
        print "$at\n" if defined $at;
    }
);

0人赞添加讨论(0) 举报

浅入江南

5楼-- · 2018-12-31 16:16

Language: PHP Library: DOM

<?php
$doc = new DOMDocument();
$doc->strictErrorChecking = false;
$doc->loadHTMLFile('http://stackoverflow.com/questions/773340');
$xpath = new DOMXpath($doc);

$links = $xpath->query('//a[@href]');
for ($i = 0; $i < $links->length; $i++)
    echo $links->item($i)->getAttribute('href'), "\n";

Sometimes it's useful to put @ symbol before $doc->loadHTMLFile to suppress invalid html parsing warnings

0人赞添加讨论(0) 举报

梦醉为红颜

6楼-- · 2018-12-31 16:17

Using phantomjs, save this file as extract-links.js:

var page = new WebPage(),
    url = 'http://www.udacity.com';

page.open(url, function (status) {
    if (status !== 'success') {
        console.log('Unable to access network');
    } else {
        var results = page.evaluate(function() {
            var list = document.querySelectorAll('a'), links = [], i;
            for (i = 0; i < list.length; i++) {
                links.push(list[i].href);
            }
            return links;
        });
        console.log(results.join('\n'));
    }
    phantom.exit();
});

run:

$ ../path/to/bin/phantomjs extract-links.js

0人赞添加讨论(0) 举报

上一页 1 2 3 4 5

Can you provide examples of parsing HTML?

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间