I would like to extract the XPATH //DIV[@id="ps-content"] out from this web page: http://www.amazon.com/dp/1449319432 (saved as a local file)
I would like to do it with a single line of command-line with one of the best parsers, like Saxon-PE or BaseX.
So far the shortest solution that I (seemed to have) found is with these two lines:
java -jar tagsoup-1.2.1.jar <page.html >page.xhtml"
java -cp saxon9pe.jar net.sf.saxon.Query -s:"page.xhtml" -qs:"//DIV[@id='ps-content']"
but all what it returns is this, that is not my expected block of html code:
<?xml version="1.0" encoding="UTF-8"?>
My questions are two:
- what's wrong with my command-lines? why they doesn't return the expected block of html code as defined by my XPATH?
- since Saxon-PE has embedded TagSoup capability (see https://www.odesk.com/leaving-odesk?ref=http%253A%252F%252Fsaxonica.com%252Fdocumentation9.4-demo%252Fhtml%252Fextensions%252Ffunctions%252Fparse-html.html), how can I integrate my two lines into a single line?
I found the correct command-line to launch the query without TagSoup:
Note that inverting the type of quotes like this doesn't work (in Win7):
Does anyone know how to add the TagSoup preprocess in the same command-line?
My last failed attempts to integrate TagSoup in the same command-line: