XSLT to process XML with very loose standards (EAD

2019-08-03 04:13发布

I've been having a hell of a week trying to write XSLT code that can process XML documents that conform to the (very permissive) EAD standards.

The useful information in an EAD document is hard to locate precisely. Different EAD documents can place the same bit of information in entirely different parts of the data tree. In addition, within a single EAD document, the same tag can be used numerous times in different locations for different information. For an example of this, please see this SO post. This makes it hard to design a single XSLT file that properly handles these different files.

In general terms, the problem can be described as:

  • How do I select a specific EAD node which is in an unknown location,
  • Without accidentally selecting unwanted nodes that have the same name()?

I've finally put together the XSLT I needed and thought it would be best to drop a generic version of the code here so others can benifit from it or improve upon it.

I'd love to tag this question with an "EAD" tag, but I don't have enough rep. If anyone with the appropriate amount of rep thinks it would be useful, please do so.

1条回答
\"骚年 ilove
2楼-- · 2019-08-03 04:54

First a quick description of the solution, followed by the code.

  1. Check if this EAD document contains component (child) records (designated with a <cXX>). If not, we don't have to worry about duplicate EAD tags. The tags can still be burried under arbitrary wrappers. To find them, see step 3.
  2. If child records exist, be careful to not process the <dsc> tag until other tags are processed. To find the other tags, see step 3, then step 4 to process child records.
  3. Recurse through the various wrappers with a template that matches them and calls apply-template on any element node farther down the tree.
  4. We are now processing a child record. Do this by repeating step 2 (carefully process all other tags before tackling the children of this child record), then step 4.

Here's the (generic version of the) XSLT code I came up with:

<?xml version="1.0" encoding="ISO-8859-1"?>
<xsl:stylesheet version="2.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="xml" encoding="ISO-8859-1" indent="yes"/>

<xsl:template match="/ead">
<records>
    <xsl:if test="//dsc">
        <!-- if there are <cXX> nodes, we'll handle the main record differently.
             <cXX> nodes are always found in the 'dsc' node, which contains nothing else -->
        <xsl:call-template name="carefully_process"/>
    </xsl:if>
    <xsl:if test="not(//dsc)">
        <record>
            <!-- Just process the existing nodes -->
            <xsl:apply-templates select="*"/>
        </record>
    </xsl:if>
</records>
</xsl:template>

<xsl:template name="carefully_process">
    <!-- first we'll process all the nodes for the main
         record. Then we'll call the child records -->
    <record>
        <!-- have to be careful not to process //archdesc/dsc yet -->
        <xsl:apply-templates select="*[not(self::archdesc)]"/>
        <xsl:apply-templates select="archdesc/*[not(self::dsc)]"/>

    <!-- Now we can close off the master record, -->
    </record>
    <!-- and process the child records -->
    <xsl:apply-templates select="/ead/archdesc/dsc"/>
</xsl:template>

<xsl:template match="dsc">
    <!-- Start processing the child records (we use for-each to get a good position() -->
    <xsl:for-each select="*[starts-with(name(),'c0') or starts-with(name(),'c1') or name() = 'c']">
        <xsl:apply-templates select=".">
            <!-- we pass the unittitle and unitid of the master record, so that child
                 records can be linked to it. We pass the position of the child so that
                 a unitid can be created if it doesn't exist -->
            <xsl:with-param name="partitle" select="normalize-space(/ead/archdesc/did/unittitle)"/>
            <xsl:with-param name="parid" select="normalize-space(/ead/archdesc/did/unitid)"/>
            <xsl:with-param name="pos" select="position()"/>
        </xsl:apply-templates>
    </xsl:for-each>
</xsl:template>

<!-- process child nodes -->
<xsl:template match="*[starts-with(name(),'c0') or starts-with(name(),'c1') or name() = 'c']" >
<xsl:param name="partitle"/>
<xsl:param name="parid"/>
<xsl:param name="pos"/>
    <!-- start this child record -->
    <record>

        <!-- EAD does not require a unitid, but my code does.
             If it doesn't exist, create it -->
        <xsl:if test="not(./did/unitid)">
            <atom name="unitid">
                <xsl:value-of select="$parid"/><xsl:text>-</xsl:text><xsl:value-of select="$pos"/>
            </atom>
        </xsl:if>

        <!-- get the level of this component -->
        <atom name="eadlevel">
            <xsl:value-of select="concat(translate(substring(@level,1,1),'abcdefghijklmnopqrstuvwxyz','ABCDEFGHIJKLMNOPQRSTUVWXYZ'),substring(@level,2))"/>
        </atom>

        <!-- Do *something* to attach this record to it's parent.
             Probably involves $partitle and $parid. For example: -->
        <ref>
            <atom name="unittitle"><xsl:value-of select="$partitle"/></atom>
            <atom name="unitid"><xsl:value-of select="$parid"/></atom>
        </ref>

        <!-- now process all the other nodes -->
        <xsl:apply-templates select="*[not(starts-with(name(),'c0') or starts-with(name(),'c1') or name() = 'c')]"/>

    <!-- finish this child record -->
    </record>

    <!-- prep the variables we'll need for attaching any child records (<cXX+1>) to this record -->
    <xsl:variable name="this_title">
        <xsl:value-of select="normalize-space(./did/unittitle)"/>
    </xsl:variable> 
    <xsl:variable name="this_id">
        <xsl:if test="./did/unitid">
            <xsl:value-of select="./did/unitid"/>
        </xsl:if>
        <xsl:if test="not(./did/unitid)">
            <xsl:value-of select="$parid"/><xsl:text>-</xsl:text><xsl:value-of select="$pos"/>
        </xsl:if>
    </xsl:variable>

    <!-- now process the children of this node -->
    <xsl:for-each select="*[starts-with(name(),'c0') or starts-with(name(),'c1') or name() = 'c']">
        <xsl:apply-templates select=".">
            <xsl:with-param name="partitle" select="$this_title"/>
            <xsl:with-param name="parid" select="$this_id"/>
            <xsl:with-param name="pos" select="position()"/>
        </xsl:apply-templates>
    </xsl:for-each>
</xsl:template>

<!-- these are usually just wrappers. Go one level deeper -->
<xsl:template match="descgrp|eadheader|revisiondesc|filedesc|titlestmt|profiledesc|archdesc|archdescgrp|daogrp|langusage|did|frontmatter">
    <xsl:apply-templates select="*"/>
</xsl:template>

<!-- below this point, add templates for processing specific EAD units
     of information. For example, the template might look like

<xsl:template match="titleproper">
    <atom name="titleproper">
        <xsl:value-of select="normalize-space(.)"/>
    </atom>
</xsl:template>
-->

<!-- instead of having a template for each EAD information unit, consider
     a generic template that handles them all the same way. For example:
-->
<xsl:template match="*">
    <atom>
        <xsl:attribute name="name"><xsl:value-of select="name()"/></xsl:attribute>
        <xsl:value-of select="normalize-space(.)"/>
    </atom>
</xsl:template>

</xsl:stylesheet>
查看更多
登录 后发表回答