Parsing XML with REGEX in Java

Given the below XML snippet I need to get a list of name/value pairs for each child under DataElements. XPath or an XML parser cannot be used for reasons beyond my control so I am using regex.

<?xml version="1.0"?>
<StandardDataObject xmlns="myns">
  <DataElements>
    <EmpStatus>2.0</EmpStatus>
    <Expenditure>95465.00</Expenditure>
    <StaffType>11.A</StaffType>
    <Industry>13</Industry>
  </DataElements>
  <InteractionElements>
    <TargetCenter>92f4-MPA</TargetCenter>
    <Trace>7.19879</Trace>
  </InteractionElements>
</StandardDataObject>

The output I need is: [{EmpStatus:2.0}, {Expenditure:95465.00}, {StaffType:11.A}, {Industry:13}]

The tag names under DataElements are dynamic and so cannot be expressed literally in the regex. The tag names TargetCenter and Trace are static and could be in the regex but if there is a way to avoid hardcoding that would be preferable.

"<([A-Za-z0-9]+?)>([A-Za-z0-9.]*?)</"

This is the regex I have constructed and it has the problem that it erroneously includes {Trace:719879} in the results. Relying on new-lines within the XML or any other apparent formatting is not an option.

Below is an approximation of the Java code I am using:

private static final Pattern PATTERN_1 = Pattern.compile(..REGEX..);
private List<DataElement> listDataElements(CharSequence cs) {
    List<DataElement> list = new ArrayList<DataElement>();
    Matcher matcher = PATTERN_1.matcher(cs);
    while (matcher.find()) {
        list.add(new DataElement(matcher.group(1), matcher.group(2)));
    }
    return list;
}

How can I change my regex to only include data elements and ignore the rest?

标签： java xml regex

8条回答

与风俱净

2楼-- · 2019-01-02 21:32

Sorry to give you yet another "Don't use regex" answer, but seriously. Please use Commons-Digester, JAXP (bundled with Java 5+) or JAXB (bundled with Java 6+) as it will save you from a boatload of hurt.

0人赞添加讨论(0) 举报

千与千寻千般痛.

3楼-- · 2019-01-02 21:32

You should listen to everyone. A lightweight parser is a bad idea.

However, if you are really that hard headed about it, you should be able to tweak your code to exclude the tags outside of the DataElements tag.

private static final Pattern PATTERN_1 = Pattern.compile(..REGEX..);
private static final String START_TAG = "<DataElements>";
private static final String END_TAG = "</DataElements>";
private List<DataElement> listDataElements(String input) {
    String cs = input.substring(input.indexOf(START_TAG) + START_TAG.length(), input.indexOf(END_TAG);
    List<DataElement> list = new ArrayList<DataElement>();
    Matcher matcher = PATTERN_1.matcher(cs);
    while (matcher.find()) {
        list.add(new DataElement(matcher.group(1), matcher.group(2)));
    }
    return list;
}

This will fail horribly if the dataelements tag does not exist.

Once again, this is a bad idea, and you will likely be revisiting this piece of code some time in the future in the form of a bug report.

0人赞添加讨论(0) 举报

妖精总统

4楼-- · 2019-01-02 21:33

XML is not a regular language. You cannot parse it using a regular expression. An expression you think will work will break when you get nested tags, then when you fix that it will break on XML comments, then CDATA sections, then processor directives, then namespaces, ... It cannot work, use an XML parser.

0人赞添加讨论(0) 举报

千与千寻千般痛.

5楼-- · 2019-01-02 21:34

This should work in Java, if you can assume that between the DataElements tags, everything has the form value. I.e. no attributes, and no nested elements.

Pattern regex = Pattern.compile("<DataElements>(.*?)</DataElements>", Pattern.DOTALL);
Matcher matcher = regex.matcher(subjectString);
Pattern regex2 = Pattern.compile("<([^<>]+)>([^<>]+)</\\1>");
if (matcher.find()) {
    String DataElements = matcher.group(1);
    Matcher matcher2 = regex2.matcher(DataElements);
    while (matcher2.find()) {
        list.add(new DataElement(matcher2.group(1), matcher2.group(2)));
    } 
}

0人赞添加讨论(0) 举报

梦寄多情

6楼-- · 2019-01-02 21:35

You really should be using an XML library for this.

If you have to use RE, why not do it in two stages? DataElements>.*?</DataElements then what you have now.

0人赞添加讨论(0) 举报

查无此人

7楼-- · 2019-01-02 21:41

Is there any reason you're not using a proper XML parser instead of regex's? This would be trivial with the right library.

0人赞添加讨论(0) 举报

1 2 下一页

Parsing XML with REGEX in Java

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间