Extracting textual content from XML documents usin

2019-09-22 10:40发布

How it is possible to extract textual content of an XML document preferably using XSLT.

For such fragment,

<record>
    <tag1>textual content</tag1>
    <tag2>textual content</tag2>
    <tag2>textual content</tag2>
</record>

the desired result is :

textual content, textual content, textual content

What's the best format for output (table, CSV, etc,) in which the content be processable for further operation, such as text mining?

Thanks

Update

To extend the question, how it’s possible to extract content of each record separately. For example, for the below XML:

<Records>
<record id="1">
    <tag1>textual co</tag1>
    <tag2>textual con</tag2>
    <tag2>textual cont</tag2>
</record>
<record id="2">
    <tag1>some text</tag1>
    <tag2>some tex</tag2>
    <tag2>some te</tag2>
</record>
</Records>

The desired result should be such as:

(textual co, textual con, textual cont) , (some text, some tex, some te)

or in better format for further processing operations.

3条回答
来,给爷笑一个
2楼-- · 2019-09-22 10:59

Just an (updated) answer for the first part of the question - for the input in the question following XSLT

<?xml version="1.0" encoding="UTF-8" ?>
<xsl:transform xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
<xsl:output method="text" doctype-public="XSLT-compat" 
omit-xml-declaration="yes" encoding="UTF-8" indent="yes" />
<xsl:template match="record">
    <xsl:for-each select="child::*">
      <xsl:value-of select="normalize-space()"/>
      <xsl:if test="position()!= last()">, </xsl:if>
    </xsl:for-each>
  </xsl:template>
</xsl:transform>

has the result

textual content, textual content, textual content

The template matching record prints the value of each child element and adds , in case it's not the last element.

查看更多
beautiful°
3楼-- · 2019-09-22 11:08

This is shorter and more generic in that it does not name any elements. It also exploits XSLT's built in templates which provide the language with default behaviour that lessens the amount you have to code. Assuming XSLT 1.0

Below is a shorter variation of lingamurthyCS's answer that let's the built-in template rule handle the last text node. It's analogous to my previous answer.

<xsl:transform version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="text"/>
<xsl:strip-space elements="*"/>

<xsl:template match="*[position() != last()]">
    <xsl:value-of select="."/><xsl:text>,</xsl:text>    
</xsl:template>
</xsl:transform>

However this particular job is better suited to XQuery.

Paste your XML into http://try.zorba.io/queries/xquery and just stick a /string-join(*,',') on the end of it like so

<record>
    <tag1>textual content</tag1>
    <tag2>textual content</tag2>
    <tag2>textual content</tag2>
</record>/string-join(*,',')

Exercise for the OP to translate that into XSLT 2.0 if that is what they are using.

查看更多
Melony?
4楼-- · 2019-09-22 11:18

You can use the following XSLT:

<xsl:transform version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="text" indent="yes"/>
<xsl:strip-space elements="*"/>
<xsl:template match="/">
    <xsl:apply-templates select="//text()"/>
</xsl:template>
<xsl:template match="text()">
    <xsl:value-of select="."/>
    <xsl:if test="position() != last()">, </xsl:if>
</xsl:template>
</xsl:transform>

And for the update in the question, you can use the following XSLT:

<xsl:transform version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="text" indent="yes"/>
<xsl:strip-space elements="*"/>
<xsl:template match="/*">
    <xsl:apply-templates/>
</xsl:template>
<xsl:template match="*">(<xsl:apply-templates select=".//text()"/>)<xsl:if test="position() != last()">, </xsl:if>
</xsl:template>
<xsl:template match="text()">
    <xsl:value-of select="."/>
    <xsl:if test="position() != last()">, </xsl:if>
</xsl:template>
</xsl:transform>
查看更多
登录 后发表回答