In Java, what is the best way to split a string into an array of blocks, when the delimiters at the beginning of each block are different from the delimiters at the end of each block?
For example, suppose I have String string = "abc 1234 xyz abc 5678 xyz"
.
I want to apply some sort of complex split
in order to obtain {"1234","5678"}
.
The first thing that comes to mind is:
String[] parts = string.split("abc");
for (String part : parts)
{
String[] blocks = part.split("xyz");
String data = blocks[0];
// Do some stuff with the 'data' string
}
Is there a simpler / cleaner / more efficient way of doing it?
My purpose (as you've probably guessed) is to parse an XML document.
I want to split a given XML string into the Inner-XML blocks of a given tag.
For example:
String xml = "<tag>ABC</tag>White Spaces Only<tag>XYZ</tag>";
String[] blocks = Split(xml,"<tag>","</tag>"); // should be {"ABC","XYZ"}
How would you implement String[] Split(String str,String prefix,String suffix)
?
Thanks
The best is to use one of the dedicated XML parsers.
See this discussion about best XML parser for Java.
I found this DOM XML parser example as a simple and good one.
IMHO the best solution will be to parse the XML file, which is not a one line thing...
Look here
Here you have sample code from another question on SO to parse the document and then move around with XPATH:
String xml = "<resp><status>good</status><msg>hi</msg></resp>";
InputSource source = new InputSource(new StringReader(xml));
DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
DocumentBuilder db = dbf.newDocumentBuilder();
Document document = db.parse(source);
XPathFactory xpathFactory = XPathFactory.newInstance();
XPath xpath = xpathFactory.newXPath();
String msg = xpath.evaluate("/resp/msg", document);
String status = xpath.evaluate("/resp/status", document);
System.out.println("msg=" + msg + ";" + "status=" + status);
Complete thread of this post here
You can write a regular expression for this type of string…
How about something like \s*((^abc)|(xyz\s*abc)|(\s*xyz$))\s*
which says abc
at the beginning, or xyz
at the end, or abc xyz
in the middle (modulo some spaces)? This produces an empty value at the beginning, but aside from that, it seems like it'd do what you want.
import java.util.Arrays;
public class RegexDelimitersExample {
public static void main(String[] args) {
final String string = "abc 1234 xyz abc 5678 xyz";
final String pattern = "\\s*((^abc)|(xyz\\s*abc)|(\\s*xyz$))\\s*";
final String[] parts_ = string.split( pattern );
// parts_[0] is "", because there's nothing before ^abc,
// so a copy of the rest of the array is what we want.
final String[] parts = Arrays.copyOfRange( parts_, 1, parts_.length );
System.out.println( Arrays.deepToString( parts ));
}
}
[1234, 5678]
Depending on how you want to handle spaces, you could adjust this as necessary. E.g.,
\s*((^abc)|(xyz\s*abc)|(\s*xyz$))\s* # original
(^abc\s*)|(\s*xyz\s*abc\s*)|(\s*xyz$) # no spaces on outside
... # ...
…but you shouldn't use it for XML.
As I noted in the comments, though, this will work for splitting a non-nested string that has these sorts of delimiters. You won't be able to handle nested cases (e.g., abc abc 12345 xyz xyz
) using regular expressions, so you won't be able to handle general XML (which seemed to be your intent). If you actually need to parse XML, use a tool designed for XML (e.g., a parser, an XPath query, etc.).
Don't use regexes here. But you don't have to do full-fledged XML parsing either. Use XPath. The expression to search for in your example would be
//tag/text()
The code needed is:
import org.w3c.dom.NodeList;
import org.xml.sax.*;
import javax.xml.xpath.*;
public class Test {
public static void main(String[] args) throws Exception {
InputSource ins = new InputSource("c:/users/ndh/hellos.xml");
XPath xpath = XPathFactory.newInstance().newXPath();
NodeList list = (NodeList)xpath.evaluate("//bar/text()", ins, XPathConstants.NODESET);
for (int i = 0; i < list.getLength(); i++) {
System.out.println(list.item(i).getNodeValue());
}
}
}
where my example xml file is
<?xml version="1.0"?>
<foo>
<bar>hello</bar>
<bar>ohayoo</bar>
<bar>hola</bar>
</foo>
This is the most declarative way to do it.