I have a file, which is in XML format (consists just of root start and end tags, and children of the root). The text elements of the children contain the ampersand symbol &. In XML it is not allowed to have this symbol in order the document to be valid, and when I tried to process the file using the DOM API in Java and an XML parser, I obtained parsing errors. Therefore, I have replaced & with &
, and I processed the file successfully: I had to extract the values of the text elements in different plain text files.
When I opened these newly created text files, I expected to see &
, but there was & instead. Why is this? I have stored the text in text files without any extension (my original file with the XML format also did not have .xml extension), and I do have just & in the text of the new file, no matter how I open the file: as txt or as xml file (these are some of the options in my XML editor). What happens exactly? Does Java (?) convert &
to & automatically? Or there is some default encoding? Well, &
stands for &, and I suppose there is some "invisible" automatic conversion, but I am confused when and how this happens. Here are examples of my original file and the extracted file which I receive after I processed the original file with Java:
This is my "negative.review" file in XML format:
<review>
<review_text>
I will not wear it as it is too big & looks funny on me.
</review_text>
</review>
This is my extracted file "negative_1":
I will not wear it as it is too big & looks funny on me.
For me it is important to have the original data as it is (without doing any conversions/replacements), so I thought that I have to process the extracted file "negative_1" converting back &
to &. As you see, it seems I don't have to do this. But I don't understand why :(.
Thank you in advance!