How it is possible to extract textual content of an XML document preferably using XSLT.
For such fragment,
<record>
<tag1>textual content</tag1>
<tag2>textual content</tag2>
<tag2>textual content</tag2>
</record>
the desired result is :
textual content, textual content, textual content
What's the best format for output (table, CSV, etc,) in which the content be processable for further operation, such as text mining?
Thanks
Update
To extend the question, how it’s possible to extract content of each record separately. For example, for the below XML:
<Records>
<record id="1">
<tag1>textual co</tag1>
<tag2>textual con</tag2>
<tag2>textual cont</tag2>
</record>
<record id="2">
<tag1>some text</tag1>
<tag2>some tex</tag2>
<tag2>some te</tag2>
</record>
</Records>
The desired result should be such as:
(textual co, textual con, textual cont) , (some text, some tex, some te)
or in better format for further processing operations.
Just an (updated) answer for the first part of the question - for the input in the question following XSLT
has the result
The template matching
record
prints the value of each child element and adds,
in case it's not the last element.This is shorter and more generic in that it does not name any elements. It also exploits XSLT's built in templates which provide the language with default behaviour that lessens the amount you have to code. Assuming XSLT 1.0
Below is a shorter variation of lingamurthyCS's answer that let's the built-in template rule handle the last text node. It's analogous to my previous answer.
However this particular job is better suited to XQuery.
Paste your XML into http://try.zorba.io/queries/xquery and just stick a /string-join(*,',') on the end of it like so
Exercise for the OP to translate that into XSLT 2.0 if that is what they are using.
You can use the following XSLT:
And for the update in the question, you can use the following XSLT: