In .Net, I found this great library, HtmlAgilityPack that allows you to easily parse non-well-formed HTML using XPath. I've used this for a couple years in my .Net sites, but I've had to settle for more painful libraries for my Python, Ruby and other projects. Is anyone aware of similar libraries for other languages?
相关问题
- Views base64 encoded blob in HTML with PHP
- how to define constructor for Python's new Nam
- streaming md5sum of contents of a large remote tar
- How to get the background from multiple images by
- Evil ctypes hack in python
BeautifulSoup is a good Python library for dealing with messy HTML in clean ways.
For Ruby, I highly recommend Hpricot that Jb Evain pointed out. If you're looking for a faster libxml-based competitor, Nokogiri (see http://tenderlovemaking.com/2008/10/30/nokogiri-is-released/) is pretty good too (it supports both XPath and CSS searches like Hpricot but is faster). There's a basic wiki and some benchmarks.