Batch processing tab-delimited files in XSLT

2019-07-23 14:28发布

I have an XML file with a list of 92 tab-delimited text files:

<?xml version="1.0" encoding="UTF-8"?>
<dumpSet>
  <dump filename="file_one.txt"/>
  <dump filename="file_two.txt"/>
  <dump filename="file_three.txt"/>
  ...
</dumpSet>

The first row in each file contains the field names for the subsequent rows. This is just an example. The names and number of elements will vary by record. Most will have around 50 field names.

Title   Translated Title    Watch Video Interviewee Interviewer 
Interview with Barack Obama         Obama, Barack   Walters, Barbara
Interview with Sarah Palin          Palin, Sarah    Couric, Katie   Smith, John
...

Oxygen XML Editor has an Import function that can convert text files to XML, but--as far as I know--this cannot be done in a batch process with multiple files. So far, the batch processing part has not been a problem. I am using XSLT 2.0's unparsed-text() function to pull in the content from the files in the list. However, I am struggling to group the XML output correctly. Example of desired output:

<collection>
  <record>
    <title>Interview with Barack Obama</title>
    <translatedtitle></translatedtitle>
    <watchvideo></watchvideo>
    <interviewee>Obama, Barack</interviewee>
    <interviewer>Walters, Barbara</interviewer>
    <videographer>Smith, John</videographer>
  </record>
  <record>
    <title>Interview with Sarah Palin</title>
    <translatedtitle></translatedtitle>
    <watchvideo></watchvideo>
    <interviewee>Palin, Sarah</interviewee>
    <interviewer>Couric, Katie</interviewer>
    <videographer>Smith, John</videographer>
  </record>
  ...
</collection>

Right now, here is the kind of output I am getting:

<collection>
  <record>
    <title>title</title>
    <value>Interview with Barack Obama</value>
    <value>Interview with Sarah Palin</value>
    <translatedtitle>translatedtitle</translatedtitle>
    <value/>
    <value/>
    <watchvideo>watchvideo</watchvideo>
    <value/>
    <value/>
    <interviewee>interviewee</interviewee>
    <value>Obama, Barack</value>
    <value>Palin, Sarah</value>
    <interviewer>interviewer</interviewer>
    <value>Walters, Barbara</value>
    <value>Couric, Katie</value>
    <videographer>videographer</videographer>
    <value>Smith, John</value>
    <value>Smith, John </value>
    <value/>
    <value/>
  </record>
</collection>

That is, I'm not able to group the output by record. Here's the current code I'm working with, based on an example in Doug Tidwell's XSLT book:

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
    xmlns:xs="http://www.w3.org/2001/XMLSchema" exclude-result-prefixes="#all" version="2.0">

    <xsl:param name="i" select="1"/>
    <xsl:param name="increment" select="1"/>
    <xsl:param name="operator" select="'&lt;='"/>
    <xsl:param name="testVal" select="100"/>    

    <xsl:template match="/">
        <collections>
            <collection>
                <xsl:for-each select="dumpSet/dump">

                    <!-- Pull in external tab-delimited files -->  
                    <xsl:for-each select="unparsed-text(concat('../2013-04-26/',@filename),'UTF-8')">
                        <record>

                            <!-- Call recursive template to loop through elements. -->
                            <xsl:call-template name="for-loop">
                                <xsl:with-param name="i" select="$i"/>
                                <xsl:with-param name="increment" select="$increment"/>
                                <xsl:with-param name="operator" select="$operator"/>
                                <xsl:with-param name="testVal" select="$testVal"/>
                            </xsl:call-template>
                        </record>
                    </xsl:for-each>
                </xsl:for-each>
            </collection>
        </collections>
    </xsl:template>

    <xsl:template name="for-loop">
        <xsl:param name="i"/>
        <xsl:param name="increment"/>
        <xsl:param name="operator"/>
        <xsl:param name="testVal"/>
        <xsl:variable name="testPassed">
            <xsl:choose>
                <xsl:when test="$operator = '&lt;='">
                    <xsl:if test="$i &lt;= $testVal">
                        <xsl:text>true</xsl:text>
                    </xsl:if>
                </xsl:when>
            </xsl:choose>
        </xsl:variable>
        <xsl:if test="$testPassed = 'true'">

            <!-- Separate the header from the tab-delimited file. -->
            <xsl:for-each select="tokenize(.,'\r|\n')[1]">

                <!-- Spit out the field names. -->
                <xsl:for-each select="tokenize(.,'\t')[$i]">
                    <xsl:element name="{replace(lower-case(translate(.,'-.','')),' ','')}">
                        <xsl:value-of select="replace(lower-case(translate(.,'-.','')),' ','')"/>
                    </xsl:element>
                </xsl:for-each>
            </xsl:for-each>

            <!-- For the following rows, loop through the field values. -->
            <xsl:for-each select="tokenize(.,'\r|\n')[position()&gt;1]">
                <xsl:for-each select="tokenize(.,'\t')[$i]">
                    <value>
                        <xsl:value-of select="."/>
                    </value>
                </xsl:for-each>
            </xsl:for-each>

            <!-- Call the template to increment. -->  
            <xsl:call-template name="for-loop">
                <xsl:with-param name="i" select="$i + $increment"/>
                <xsl:with-param name="increment" select="$increment"/>
                <xsl:with-param name="operator" select="$operator"/>
                <xsl:with-param name="testVal" select="$testVal"/>
            </xsl:call-template>
        </xsl:if>
    </xsl:template>
</xsl:stylesheet>

How should I change this to to group the output by record?

2条回答
爷、活的狠高调
2楼-- · 2019-07-23 14:39

It might be easier if you use xsl:analyze-string to parse each record. There might be a better way to get the element names from the header than what is in my example, but I didn't have time to think about this too long.

Notes:

You may have to change the encoding for unparsed-text(). I usually pass the encoding in as a parameter so I don't have to modify the stylesheet. Maybe the encoding could be added to <dump/>?

It would be a good idea to use unparsed-text-available() to see if the file exists and can be read with the specified encoding.

Also, you may want to do a check to make sure the value from the header is a valid QName. For example if you have an apostrophe in the header, you'll get an error. Maybe it would be better to use the field names from the header as an attribute value instead of an element name. (Like: <field name="Interviewee">Obama, Barack</field>)

Here's my example:

XML Input

<dumpSet>
  <dump filename="file_one.txt"/>
</dumpSet>

file_one.txt

Title   Translated Title    Watch Video Interviewee Interviewer Videographer
Interview with Barack Obama         Obama, Barack   Walters, Barbara
Interview with Sarah Palin          Palin, Sarah    Couric, Katie   Smith, John

XSLT 2.0

<xsl:stylesheet version="2.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
    <xsl:output indent="yes"/>
    <xsl:strip-space elements="*"/>

    <xsl:template match="dumpSet">
        <collection>
            <xsl:apply-templates select="dump[@filename]"/>
        </collection>
    </xsl:template>

    <xsl:template match="dump">
        <xsl:variable name="text" select="unparsed-text(@filename, 'iso-8859-1')"/>
        <xsl:variable name="header">
            <xsl:analyze-string select="$text" regex="(..*)">
                <xsl:matching-substring>
                    <xsl:if test="position()=1">
                        <xsl:value-of select="regex-group(1)"/>
                    </xsl:if>                   
                </xsl:matching-substring>
            </xsl:analyze-string>
        </xsl:variable>
        <xsl:variable name="headerTokens" select="tokenize($header,'\t')"/>
        <xsl:analyze-string select="$text" regex="(..*)">
            <xsl:matching-substring>
                <xsl:if test="not(position()=1)">
                    <record>
                        <xsl:analyze-string select="." regex="([^\t][^\t]*)\t?|\t">
                            <xsl:matching-substring>
                                <xsl:variable name="pos" select="position()"/>
                                <xsl:element name="{replace(normalize-space(lower-case($headerTokens[$pos])),' ','')}">
                                    <xsl:value-of select="normalize-space(regex-group(1))"/>                            
                                </xsl:element>                              
                            </xsl:matching-substring>
                        </xsl:analyze-string>
                    </record>
                </xsl:if>
            </xsl:matching-substring>
        </xsl:analyze-string>
    </xsl:template>

</xsl:stylesheet>

Output

<collection>
   <record>
      <title>Interview with Barack Obama</title>
      <translatedtitle/>
      <watchvideo/>
      <interviewee>Obama, Barack</interviewee>
      <interviewer>Walters, Barbara</interviewer>
   </record>
   <record>
      <title>Interview with Sarah Palin</title>
      <translatedtitle/>
      <watchvideo/>
      <interviewee>Palin, Sarah</interviewee>
      <interviewer>Couric, Katie</interviewer>
      <videographer>Smith, John</videographer>
   </record>
</collection>
查看更多
走好不送
3楼-- · 2019-07-23 14:58

Please try this XSLT to get some idea how you may gerate you desired. You need to include your translate function, where every you needed.

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
  xmlns:xs="http://www.w3.org/2001/XMLSchema" version="2.0">

  <xsl:output method="xml" indent="yes"/>

  <xsl:template match="/">
    <collections>
      <collection>
        <xsl:for-each select="dumpSet/dump">
          <xsl:for-each select="tokenize(unparsed-text(@filename,'UTF-8'),'\n')[not(position()=1)]">
            <record>
              <title><xsl:value-of select="tokenize(.,'\t')[1]"/></title>
              <translatedtitle><xsl:value-of select="tokenize(.,'\t')[2]"/></translatedtitle>
              <watchvideo><xsl:value-of select="tokenize(.,'\t')[3]"/></watchvideo>
              <interviewee><xsl:value-of select="tokenize(.,'\t')[4]"/></interviewee>
              <interviewer><xsl:value-of select="tokenize(.,'\t')[5]"/></interviewer>
              <videographer><xsl:value-of select="tokenize(.,'\t')[6]"/></videographer>
            </record>
          </xsl:for-each>
        </xsl:for-each>
      </collection>
    </collections>
  </xsl:template>

</xsl:stylesheet>

output:

<collections xmlns:xs="http://www.w3.org/2001/XMLSchema">
   <collection>
      <record>
         <title>Interview with Barack Obama</title>
         <translatedtitle/>
         <watchvideo>Obama, Barack</watchvideo>
         <interviewee>Walters, Barbara</interviewee>
         <interviewer>&#xD;</interviewer>
         <videographer/>
      </record>
      <record>
         <title>Interview with Sarah Palin</title>
         <translatedtitle/>
         <watchvideo>Palin, Sarah</watchvideo>
         <interviewee>Couric, Katie</interviewee>
         <interviewer>Smith, John</interviewer>
         <videographer/>
      </record>
   </collection>
</collections>
查看更多
登录 后发表回答