I have an XML file with a list of 92 tab-delimited text files:
<?xml version="1.0" encoding="UTF-8"?>
<dumpSet>
<dump filename="file_one.txt"/>
<dump filename="file_two.txt"/>
<dump filename="file_three.txt"/>
...
</dumpSet>
The first row in each file contains the field names for the subsequent rows. This is just an example. The names and number of elements will vary by record. Most will have around 50 field names.
Title Translated Title Watch Video Interviewee Interviewer
Interview with Barack Obama Obama, Barack Walters, Barbara
Interview with Sarah Palin Palin, Sarah Couric, Katie Smith, John
...
Oxygen XML Editor has an Import function that can convert text files to XML, but--as far as I know--this cannot be done in a batch process with multiple files. So far, the batch processing part has not been a problem. I am using XSLT 2.0's unparsed-text() function to pull in the content from the files in the list. However, I am struggling to group the XML output correctly. Example of desired output:
<collection>
<record>
<title>Interview with Barack Obama</title>
<translatedtitle></translatedtitle>
<watchvideo></watchvideo>
<interviewee>Obama, Barack</interviewee>
<interviewer>Walters, Barbara</interviewer>
<videographer>Smith, John</videographer>
</record>
<record>
<title>Interview with Sarah Palin</title>
<translatedtitle></translatedtitle>
<watchvideo></watchvideo>
<interviewee>Palin, Sarah</interviewee>
<interviewer>Couric, Katie</interviewer>
<videographer>Smith, John</videographer>
</record>
...
</collection>
Right now, here is the kind of output I am getting:
<collection>
<record>
<title>title</title>
<value>Interview with Barack Obama</value>
<value>Interview with Sarah Palin</value>
<translatedtitle>translatedtitle</translatedtitle>
<value/>
<value/>
<watchvideo>watchvideo</watchvideo>
<value/>
<value/>
<interviewee>interviewee</interviewee>
<value>Obama, Barack</value>
<value>Palin, Sarah</value>
<interviewer>interviewer</interviewer>
<value>Walters, Barbara</value>
<value>Couric, Katie</value>
<videographer>videographer</videographer>
<value>Smith, John</value>
<value>Smith, John </value>
<value/>
<value/>
</record>
</collection>
That is, I'm not able to group the output by record. Here's the current code I'm working with, based on an example in Doug Tidwell's XSLT book:
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:xs="http://www.w3.org/2001/XMLSchema" exclude-result-prefixes="#all" version="2.0">
<xsl:param name="i" select="1"/>
<xsl:param name="increment" select="1"/>
<xsl:param name="operator" select="'<='"/>
<xsl:param name="testVal" select="100"/>
<xsl:template match="/">
<collections>
<collection>
<xsl:for-each select="dumpSet/dump">
<!-- Pull in external tab-delimited files -->
<xsl:for-each select="unparsed-text(concat('../2013-04-26/',@filename),'UTF-8')">
<record>
<!-- Call recursive template to loop through elements. -->
<xsl:call-template name="for-loop">
<xsl:with-param name="i" select="$i"/>
<xsl:with-param name="increment" select="$increment"/>
<xsl:with-param name="operator" select="$operator"/>
<xsl:with-param name="testVal" select="$testVal"/>
</xsl:call-template>
</record>
</xsl:for-each>
</xsl:for-each>
</collection>
</collections>
</xsl:template>
<xsl:template name="for-loop">
<xsl:param name="i"/>
<xsl:param name="increment"/>
<xsl:param name="operator"/>
<xsl:param name="testVal"/>
<xsl:variable name="testPassed">
<xsl:choose>
<xsl:when test="$operator = '<='">
<xsl:if test="$i <= $testVal">
<xsl:text>true</xsl:text>
</xsl:if>
</xsl:when>
</xsl:choose>
</xsl:variable>
<xsl:if test="$testPassed = 'true'">
<!-- Separate the header from the tab-delimited file. -->
<xsl:for-each select="tokenize(.,'\r|\n')[1]">
<!-- Spit out the field names. -->
<xsl:for-each select="tokenize(.,'\t')[$i]">
<xsl:element name="{replace(lower-case(translate(.,'-.','')),' ','')}">
<xsl:value-of select="replace(lower-case(translate(.,'-.','')),' ','')"/>
</xsl:element>
</xsl:for-each>
</xsl:for-each>
<!-- For the following rows, loop through the field values. -->
<xsl:for-each select="tokenize(.,'\r|\n')[position()>1]">
<xsl:for-each select="tokenize(.,'\t')[$i]">
<value>
<xsl:value-of select="."/>
</value>
</xsl:for-each>
</xsl:for-each>
<!-- Call the template to increment. -->
<xsl:call-template name="for-loop">
<xsl:with-param name="i" select="$i + $increment"/>
<xsl:with-param name="increment" select="$increment"/>
<xsl:with-param name="operator" select="$operator"/>
<xsl:with-param name="testVal" select="$testVal"/>
</xsl:call-template>
</xsl:if>
</xsl:template>
</xsl:stylesheet>
How should I change this to to group the output by record?
It might be easier if you use
xsl:analyze-string
to parse each record. There might be a better way to get the element names from the header than what is in my example, but I didn't have time to think about this too long.Notes:
You may have to change the encoding for
unparsed-text()
. I usually pass the encoding in as a parameter so I don't have to modify the stylesheet. Maybe the encoding could be added to<dump/>
?It would be a good idea to use
unparsed-text-available()
to see if the file exists and can be read with the specified encoding.Also, you may want to do a check to make sure the value from the header is a valid QName. For example if you have an apostrophe in the header, you'll get an error. Maybe it would be better to use the field names from the header as an attribute value instead of an element name. (Like:
<field name="Interviewee">Obama, Barack</field>
)Here's my example:
XML Input
file_one.txt
XSLT 2.0
Output
Please try this XSLT to get some idea how you may gerate you desired. You need to include your translate function, where every you needed.
output: