I have an 'xml file' file that has some unwanted characters in it
<data>
<tag>blar </tag><tagTwo> bo </tagTwo>
some extra
characters not enclosed that I want to remove
<anothertag>bbb</anothertag>
</data>
I thought the following non-greedy substitution would remove the characters that were not properly encased in <sometag></sometag>
re.sub("</([a-zA-Z]+)>.*?<","</\\1><",text)
^ ^ ^ ^ text is the xml txt.
remember tag, | | put tag back without and reopen next tag
read everything until the next '<' (non-gready)
This regex seems only to find the position indicated with the [[]]
in </tag>[[]]<tagTwo>
What am I doing wrong?
EDIT: The motivation for this question has been solved (see comments, I had a stray & in the xml file which was causing it not to parse - it had nothing to do with the characters that I want to delete). However, I am still curious as to whether the regex is possible (and what was wrong with my attempt) and so I don't delete the question.
The dot does not match newlines unless you specify the
re.DOTALL
flag.should work fine. (If it does not, my python is at fault, not the regex. Please correct.)
I think it is good practise to be as precise as possible when defining character classes that are to be repeated. This helps to prevent catastrophic backtracking. Therefore, I'd use
[^<]*
instead of.*?
with the added bonus that it now finds stray characters after the last tag. This would not need there.DOTALL
flag any longer, since[^<]
does match newlines.in ipython: