What's the best way to go about validating that a document follows some version of HTML (prefereably that I can specify)? I'd like to be able to know where the failures occur, as in a web-based validator, except in a native Python app.
相关问题
- Views base64 encoded blob in HTML with PHP
- how to define constructor for Python's new Nam
- streaming md5sum of contents of a large remote tar
- How to get the background from multiple images by
- Evil ctypes hack in python
I think that HTML tidy will do what you want. There is a Python binding for it.
You can decide to install the HTML validator locally and create a client to request the validation.
Here I had made a program to validate a list of urls in a txt file. I was just checking the HEAD to get the validation status, but if you do a GET you would get the full results. Look at the API of the validator, there are plenty of options for it.
XHTML is easy, use lxml.
HTML is harder, since there's traditionally not been as much interest in validation among the HTML crowd (run StackOverflow itself through a validator, yikes). The easiest solution would be to execute external applications such as nsgmls or OpenJade, and then parse their output.
In my case the python W3C/HTML validation packages did not work
pip search w3c
(as of sept 2016).I solved this with
More documentation here python requests, W3C Validator API
Try tidylib. You can get some really basic bindings as part of the elementtidy module (builds elementtrees from HTML documents). http://effbot.org/downloads/#elementtidy
Parsing the log should give you pretty much everything you need.
PyTidyLib is a nice python binding for HTML Tidy. Their example:
Moreover it's compatible with both legacy HTML Tidy and the new tidy-html5.