how to extract data using jtidy and xpath

i have to extract d company name and face value from http://money.rediff.com/companies/20-microns-ltd/15110088

i noticed that this task could be accomplished using xpath api. since this is an html page, i am using jtidy parser.

this is the xpath for the face value which i have to extract.

/html/body/div[4]/div[6]/div[9]/div/table/tbody/tr[4]/td[2]

This is my code

URL oracle = new URL("http://money.rediff.com/companies/20-microns-ltd/15110088");
URLConnection yc = oracle.openConnection();
InputStream is = yc.getInputStream();
is = oracle.openStream();
Tidy tidy = new Tidy();
tidy.setQuiet(true);
tidy.setShowWarnings(false);
Document tidyDOM = tidy.parseDOM(is, null);
XPathFactory xPathFactory = XPathFactory.newInstance();
XPath xPath = xPathFactory.newXPath();
String expression = "/html";
XPathExpression xPathExpression = xPath.compile(expression);
Object result = xPathExpression.evaluate(tidyDOM,XPathConstants.NODESET);
System.out.println(result.toString());

please guide me further, because, i cannot find a right solution for the above

标签： xpath jtidy

1条回答

等我变得足够好

2楼-- · 2019-09-02 15:39

Try not to use "full" xpaths.

//div[@id='leftcontainer']//div[9]//table//tr[4]/td[2]

is better than

/html/body/.../.../.../.../.../...

Most HTML pages are not valid or even well-formed. So the DOM structure may change when processed by "real-world HTML parsers". For example, a <tbody> may be inserted under <table> if there isn't one. Things are worse when different HTML parsers generate different DOM trees so one XPath may be valid for one parser, but not the other. I would rather use "wildcards" like table//tr[4] instead of table/tbody/tr[4] or table/tr[4] so that I can forget about <tbody>. Such expressions are more robust when used against the messy real-world HTML pages.

You can use Firepath, a plugin for Firebug which is then a plugin for Firefox, to debug XPath expressions.

p.s. You can try my JHQL (http://github.com/wks/jhql) project for exactly this task. You will like it if you have more pages to extract data from.

0人赞添加讨论(0) 举报

how to extract data using jtidy and xpath

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间