How can I change HTML content of tag in Java? For example:
before:
<html>
<head>
</head>
<body>
<div>text<div>**text**</div>text</div>
</body>
</html>
after:
<html>
<head>
</head>
<body>
<div>text<div>**new text**</div>text</div>
</body>
</html>
I tried JTidy, but it doesn't support getTextContent
. Is there any other solution?
Thanks, I want parse no well-formed HTML. I tried TagSoup, but when I have this code:
<body>
sometext <div>text</div>
</body>
and I want change "sometext" to "someAnotherText," and when I use {bodyNode}.getTextContent()
it gives me: "sometext text"; when I use setTextContet("someAnotherText"+{bodyNode}.getTextContent())
, and serialize these structure, the result is <body>someAnotherText sometext text</body>
, without <div>
tags. This is a problem for me.
Unless you are absolutely sure that the HTML will be valid and well formed, I'd strongly recommend to use an HTML parser, something like TagSoup, Jericho, NekoHTML, HTML Parser, etc, the two first being especially powerful to parse any kind of crap :)
For example, with HTML Parser (because the implementation is very easy), using a visitor, provide your own NodeVisitor
:
public class MyNodeVisitor extends NodeVisitor {
public MyNodeVisitor() {
}
public void visitStringNode (Text string)
{
if (string.getText().equals("**text**")) {
string.setText("**new text**");
}
}
}
Then, create a Parser
, parse the HTML string and visit the returned node list:
Parser parser = new Parser(htmlString);
NodeList nl = parser.parse(null);
nl.visitAllNodesWith(new MyNodeVisitor());
System.out.println(nl.toHtml());
This is just one way to implement this, pretty straight forward.
Provided that your HTML is a well-formed XML (if it is not then you may use JTidy to tidify it), you can parse it using DOM or SAX parser. DOM is probably easier if your document is not huge.
Something like this will do the trick if your text is the only child of a node with id="id":
Document d = DocumentBuilderFactory.newInstance().newDocumentBuilder().parse(file);
Element e = d.getElementById("id");
Node text = e.getFirstChild();
text.setNodeValue(process(text.getNodeValue());
You may save d afterwards to a file.
There are a bunch of Open source Java HTML parsers listed here.
I'm not sure what's most commonly used, but this one (just called HTML parser) will probably do what you want. It has functions to modify your tree and write it back out.
In general you have a HTML document that you want to extract data from. You know generally the structure of the HTML document.
There are several parser libraries but the best one is Jsoup ,you can use the DOM methods to navigate your document and update values.In your case you need to read your file and use the attribute setter methods.
Sample XHTML file :
<?xml version="1.0" encoding="UTF-8"?>
<!--
To change this license header, choose License Headers in Project Properties.
To change this template file, choose Tools | Templates
and open the template in the editor.
-->
<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<title>Example</title>
</head>
<body>
<p id="content">Hello World</p>
</body>
</html>
Java code :
File input = new File("D:\\Projects\\Odata Project\\Odata\\src\\web\\html\\inscription_template.xhtml");
org.jsoup.nodes.Document doc = Jsoup.parse(input,null);
org.jsoup.nodes.Element content = doc.getElementById("content");
System.out.println(content.text("Hi How are you ?"));
System.out.println(content.text());
System.out.println(doc);
Output after execution:
<p id="content">Hi How are you ?</p>
Hi How are you ?
<!--?xml version="1.0" encoding="UTF-8"?-->
<!--
To change this license header, choose License Headers in Project Properties.
To change this template file, choose Tools | Templates
and open the template in the editor.
--><!doctype html>
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<title>Example</title>
</head>
<body>
<p id="content">Hi How are you ?</p>
</body>
</html>