Extracting textual content from XML documents usin

How it is possible to extract textual content of an XML document preferably using XSLT.

For such fragment,

<record>
    <tag1>textual content</tag1>
    <tag2>textual content</tag2>
    <tag2>textual content</tag2>
</record>

the desired result is :

textual content, textual content, textual content

What's the best format for output (table, CSV, etc,) in which the content be processable for further operation, such as text mining?

Thanks

Update

To extend the question, how it’s possible to extract content of each record separately. For example, for the below XML:

<Records>
<record id="1">
    <tag1>textual co</tag1>
    <tag2>textual con</tag2>
    <tag2>textual cont</tag2>
</record>
<record id="2">
    <tag1>some text</tag1>
    <tag2>some tex</tag2>
    <tag2>some te</tag2>
</record>
</Records>

The desired result should be such as:

(textual co, textual con, textual cont) , (some text, some tex, some te)

or in better format for further processing operations.

标签： xml xslt text-mining

3条回答

来，给爷笑一个

2楼-- · 2019-09-22 10:59

Just an (updated) answer for the first part of the question - for the input in the question following XSLT

<?xml version="1.0" encoding="UTF-8" ?>
<xsl:transform xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
<xsl:output method="text" doctype-public="XSLT-compat" 
omit-xml-declaration="yes" encoding="UTF-8" indent="yes" />
<xsl:template match="record">
    <xsl:for-each select="child::*">
      <xsl:value-of select="normalize-space()"/>
      <xsl:if test="position()!= last()">, </xsl:if>
    </xsl:for-each>
  </xsl:template>
</xsl:transform>

has the result

textual content, textual content, textual content

The template matching record prints the value of each child element and adds , in case it's not the last element.

0人赞添加讨论(0) 举报

beautiful°

3楼-- · 2019-09-22 11:08

This is shorter and more generic in that it does not name any elements. It also exploits XSLT's built in templates which provide the language with default behaviour that lessens the amount you have to code. Assuming XSLT 1.0

Below is a shorter variation of lingamurthyCS's answer that let's the built-in template rule handle the last text node. It's analogous to my previous answer.

<xsl:transform version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="text"/>
<xsl:strip-space elements="*"/>

<xsl:template match="*[position() != last()]">
    <xsl:value-of select="."/><xsl:text>,</xsl:text>    
</xsl:template>
</xsl:transform>

However this particular job is better suited to XQuery.

Paste your XML into http://try.zorba.io/queries/xquery and just stick a /string-join(*,',') on the end of it like so

<record>
    <tag1>textual content</tag1>
    <tag2>textual content</tag2>
    <tag2>textual content</tag2>
</record>/string-join(*,',')

Exercise for the OP to translate that into XSLT 2.0 if that is what they are using.

0人赞添加讨论(0) 举报

Melony?

4楼-- · 2019-09-22 11:18

You can use the following XSLT:

<xsl:transform version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="text" indent="yes"/>
<xsl:strip-space elements="*"/>
<xsl:template match="/">
    <xsl:apply-templates select="//text()"/>
</xsl:template>
<xsl:template match="text()">
    <xsl:value-of select="."/>
    <xsl:if test="position() != last()">, </xsl:if>
</xsl:template>
</xsl:transform>

And for the update in the question, you can use the following XSLT:

<xsl:transform version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="text" indent="yes"/>
<xsl:strip-space elements="*"/>
<xsl:template match="/*">
    <xsl:apply-templates/>
</xsl:template>
<xsl:template match="*">(<xsl:apply-templates select=".//text()"/>)<xsl:if test="position() != last()">, </xsl:if>
</xsl:template>
<xsl:template match="text()">
    <xsl:value-of select="."/>
    <xsl:if test="position() != last()">, </xsl:if>
</xsl:template>
</xsl:transform>

0人赞添加讨论(0) 举报

Extracting textual content from XML documents usin

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间