I have written the code below to read XML files (file_1.xml and file_2.xml) and to extract the string between tags and to write it down into a TXT file. The issue is that some strings include double quotation marks and the program then takes these characters as being proper instructions (not part of the strings)...
Content of file_1.xml :
<AAA>C086002-T1111</AAA>
<AAA>C086002-T1222 </AAA>
<AAA>C086002-TR333 "</AAA>
<AAA>C086002-T5444 </AAA>
Content of file_2.xml :
<AAA>C086002-T5555 </AAA>
<AAA>C086002-T1666</AAA>
<AAA>C086002-T1777 "</AAA>
<AAA>C086002-T1888 "</AAA>
My code :
@echo off
setlocal enabledelayedexpansion
for /f "delims=;" %%f in ('dir /b D:\depart\*.xml') do (
for /f "usebackq delims=;" %%z in ("D:\depart\%%f") do (
(for /f "delims=<AAA></AAA> tokens=2" %%a in ('echo "%%z" ^| Findstr /r "<AAA>"') do (
set code=%%a
set code=!code:""=!
set code=!code: =!
echo !code!
)) >> result.txt
)
)
I get this in result.txt :
C086002-T1111
C086002-T1222
C086002-T5444
C086002-T5555
C086002-T1666
In fact, 3 out of the 8 lines are missing. These lines include double quotation marks or follow lines that include double quotation marks...
How can I deal with these characters and consider them as parts of the strings ?
Please note - parsing XML with batch is a risky business because XML generally ignores white space. Any script you write could probably be broken by simply reformatting the XML into another equivalent valid form. That being said...
I haven't traced the problem through to fully explain your observed behavior, but the unbalanced quote is causing a problem with this line:
(for /f "delims=<AAA></AAA> tokens=2" %%a in ('echo "%%z" ^| Findstr /r "<AAA>"') do (
You can eliminate that problem and get your code to sort of work by eliminating any quotes before-hand.
@echo off
setlocal enabledelayedexpansion
del result.txt
for /f "delims=;" %%f in ('dir /b D:\depart\*.xml') do (
for /f "usebackq delims=;" %%z in ("D:\depart\%%f") do (
set code=%%z
set code=!code:"=!
set code=!code: =!
(for /f "delims=<AAA></AAA> tokens=2" %%a in ('echo "!code!" ^| Findstr /r "<AAA>"') do (
echo %%a
)) >> result.txt
)
)
But you have a potential major problem. DELIMS does not specify a string - it specifies a list of characters. So your DELIMS=<AAA></AAA>
is equivalent to DELIMS=<>/A
. If your element value ever has an A or / in it, then your code will fail.
There is a much better way:
First off, you can use FINDSTR to collect all your <AAA>----</AAA>
lines from all files in one pass, without any loop:
findstr /r "<AAA>.*</AAA>" "D:\depart\*.xml"
Each matching line will be output as the file path, followed by a colon, followed by the matching line, as in:
D:\depart\file_1.xml:<AAA>C086002-T1111</AAA>
The file path can never contain <
, or >
, so you can use the following to iterate the result, capturing the appropriate token:
for /f "delims=<> tokens=3" %%A in ( ...
Finally, you can put parentheses around the entire loop, and redirect just once. I'm assuming you want each run to create a new file, so I use >
instead of >>
.
@echo off
setlocal enabledelayedexpansion
>result.txt (
for /f "delims=<> tokens=3" %%A in (
'findstr /r "<AAA>.*</AAA>" "D:\depart\*.xml"''
) do (
set code=%%A
set code=!code:"=!
set code=!code: =!
echo(!code!
)
Assuming that you only need to trim leading or trailing spaces/quotes, then the solution is even simpler. It does require odd syntax to specify a quote as a DELIM character. Note that there are two spaces between the last ^
and %%B
. The first escaped space is taken as a DELIM character. The unescaped space terminates the FOR /F options string.
@echo off
>result.txt (
for /f "delims=<> tokens=3" %%A in (
'findstr /r "<AAA>.*</AAA>" "D:\depart\*.xml"'
) do for /f delims^=^"^ %%B in ("%%A") do echo(%%B
)
UPDATE in response to comment
I'm assuming your data value will never contain a colon.
If you want to append source file name to each line of output, then you simply need to alter the first FOR /F to capture the first token (the source file) as well as the third token (the data value). The file will contain the full path as well as a trailing colon. The second FOR /F appends the file to the source data string using the ~nx
modifier to get just the name and extension (no drive or path), and a colon is added to the DELIMS option so the trailing colon is trimmed off.
@echo off
>result.txt (
for /f "delims=<> tokens=1,3" %%A in (
'findstr /r "<AAA>.*</AAA>" "D:\depart\*.xml"'
) do for /f delims^=:^"^ %%C in ("%%B;%%~nxA") do echo %%C
)
If I keep @dbenham suggestion and I complete it in order to echo the filename :
@echo off
>result.txt (
for /f %%f in ("D:\depart\*.xml") do (
for /f "delims=<> tokens=3" %%A in ('findstr /r "<AAA>.*</AAA>" "D:\depart\*.xml"') do (
for /f delims^=^"^ %%B in ("%%A") do (
echo %%B;%%f
)
)
)
)
Thanks for your opinion on this code !