XSLT: Merging two log files with different structu

2019-01-27 06:30发布

问题:

As asked by Dimitre Novatchev I created a new question, as some parts of the old question changed.

(Link to the old question: Merging two different XML log files (trace and messages) using date and timestamp?)

I need to merge two XML log files (up to 700MB). One log file contains a trace with position updates. The other log file contains the received messages. There can be multiple received messages without having a position update inbetween and the other way round.

Both logs have timestamps including milliseconds (123 in this example):

  • The trace log uses <date> (eg. 14.7.2012 11:08:07.123)
  • The message log uses a unix timestamp <timeStamp> (eg. 1342264087123)

There are also other <timeStamp> elements included in the message log, but only the one within the path messageList/Message/originator/originatorPosition/timeStamp is relevant.

The following structures are slightly simplified, as additional content like "acceleration" etc. is left out. This additional content just needs to be copied together with the rest of the messages/items.

The structure of the position trace looks like:

<itemList>
    <item>
        <date>14.7.2012 12:13:05.123</date>
        <FilteredPosition>
            <Latitude>51.12235</Latitude>
            <Longitude>9.347214</Longitude>
        </FilteredPosition>
    </item>
    <item>
        <date>14.7.2012 12:13:07.456</date>
        <FilteredPosition>
            <Latitude>51.12235</Latitude>
            <Longitude>9.347214</Longitude>
        </FilteredPosition>
    </item>
</itemList>

The structure of the message log is like that:

<messageList>
    <Message>
        <messageId>1234</messageId>
        <originator>
            <originatorPosition>
                <nodeId>2345</nodeId>
                <timeStamp>1342264087061</timeStamp>
            </originatorPosition>
            <senderPosition>
                <nodeId>2345</nodeId>
                <timeStamp>1342264087234</timeStamp>
            </senderPosition>
            <medium></medium>
        </originator>
        <MessagePayload>
           <generationTime>
              <timeStamp>1342264087</timeStamp>
              <milliSec>42</milliSec>
           </generationTime>
        </MessagePayload>
    </Message>
    <Message>
        <messageId>1234</messageId>
        <originator>
            <originatorPosition>
                <nodeId>2345</nodeId>
                <timeStamp>1342264088064</timeStamp>
            </originatorPosition>
            <senderPosition>
                <nodeId>2345</nodeId>
                <timeStamp>1342264088254</timeStamp>
            </senderPosition>
            <medium></medium>
        </originator>
        <MessagePayload>
           <generationTime>
              <timeStamp>1342264088</timeStamp>
              <milliSec>42</milliSec>
           </generationTime>
        </MessagePayload>
    </Message>
</messageList>

When doing the merging, the timestamps should be read (also converting/comparing "date" and "timestamp" including milliseconds in the format "14.7.2012 11:08:07.123") and all positions and messages added in the right order.

The position data can just be added as it is. However, the message should be placed inside of <item> tags, a <date> tag should be added (based on the messages' unix time with milliseconds) and the <Message> tag should be replaced by <m:Message type="received"> tags. The items are placed within the root <itemList>, just as it has been with the position trace.

A result could look like this:

<itemList>
    <item>
        <date>14.7.2012 12:13:05.123</date>
        <FilteredPosition>
            <Latitude>51.12235</Latitude>
            <Longitude>9.347214</Longitude>
        </FilteredPosition>
    </item>
    <item>
        <date>14.7.2012 12:13:07.061</date>
        <m:Message type="received">
            <messageId>1234</messageId>
            <originator>
                <originatorPosition>
                    <nodeId>2345</nodeId>
                    <timeStamp>1342264087061</timeStamp>
                </originatorPosition>
                <senderPosition>
                    <nodeId>2345</nodeId>
                    <timeStamp>1342264087234</timeStamp>
                </senderPosition>
                <medium></medium>
            </originator>
            <MessagePayload>
               <generationTime>
                  <timeStamp>1342264087</timeStamp>
                  <milliSec>63</milliSec>
               </generationTime>
            </MessagePayload>
        </m:Message>
    </item>
    <item>
        <date>14.7.2012 12:13:07.456</date>
        <FilteredPosition>
            <Latitude>51.12235</Latitude>
            <Longitude>9.347214</Longitude>
        </FilteredPosition>
    </item>
    <item>
        <date>14.7.2012 12:13:08.064</date>
        <m:Message type="received">
            <messageId>1234</messageId>
            <originator>
                <originatorPosition>
                    <nodeId>2345</nodeId>
                    <timeStamp>1342264088064</timeStamp>
                </originatorPosition>
                <senderPosition>
                    <nodeId>2345</nodeId>
                    <timeStamp>1342264088254</timeStamp>
                </senderPosition>
                <medium></medium>
            </originator>
            <MessagePayload>
               <generationTime>
                  <timeStamp>1342264088</timeStamp>
                  <milliSec>70</milliSec>
               </generationTime>
            </MessagePayload>
        </m:Message>
    </item>
<itemList>  

There are also some <item> elements that do not contain a timestamp (and no "FilteredPosition") inside the position log file. These items can be ignored and do not need to be copied.

I'd appreciate any help with the XSLT-code as I'm quite new to this topic... :-/

回答1:

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
    xmlns:xs="http://www.w3.org/2001/XMLSchema"
    xmlns:m="http://www.example.com/"
    exclude-result-prefixes="xs"
    version="2.0">

    <xsl:output indent="yes" method="xml"/>

    <!-- The two source-documents. -->
    <xsl:variable name="doc1" select="doc('log1.xml')"/>
    <xsl:variable name="doc2" select="doc('log2.xml')"/>

    <!-- Timezone adjustment -->
    <xsl:variable name="timezoneAdjustment" select="1"/>

    <!-- Root template to start the transformation. -->
    <xsl:template match="/">
        <!-- Transform and collect all the elements -->
        <xsl:variable name="data" as="node()*">
            <xsl:apply-templates select="$doc1/itemList/item"/>
            <xsl:apply-templates select="$doc2/messageList/Message"/>
        </xsl:variable>
        <!-- Sort by the timestamp, and discard the wrapper. -->
        <itemList>
            <xsl:for-each select="$data">
                <xsl:sort select="@timestamp" data-type="number"/>
                <xsl:copy-of select="item"/>
            </xsl:for-each>
        </itemList>
    </xsl:template>

    <!--
        Template to transform <item> elements in the first format.
        It just parses the date, and adds a wrapper with the timestamp.
    -->
    <xsl:template match="item[date]">
        <xsl:variable name="dateTimeString" select="date" as="xs:string"/>
        <xsl:variable name="datePart" select="substring-before($dateTimeString,' ')"/>
        <xsl:variable name="day" select="xs:integer(substring-before($datePart,'.'))"/>
        <xsl:variable name="month" select="xs:integer(substring-before(substring-after($datePart,'.'),'.'))"/>
        <xsl:variable name="year" select="xs:integer(substring-after(substring-after($datePart,'.'),'.'))"/>
        <xsl:variable name="timePart" select="substring-after($dateTimeString,' ')"/>
        <xsl:variable name="reformatted" select="concat(format-number($year,'0000'),'-',format-number($month,'00'),'-',format-number($day,'00'),'T',$timePart)"/>
        <xsl:variable name="timestamp" select="( xs:dateTime($reformatted) - xs:dateTime('1970-01-01T00:00:00') - $timezoneAdjustment * xs:dayTimeDuration('PT1H') ) div xs:dayTimeDuration('PT0.001S')"/>
        <wrapper timestamp="{$timestamp}">
            <xsl:copy-of select="self::*"/>
        </wrapper>
    </xsl:template>

    <!--
        Template to transform <Message> elements in the second log format.
        It generates an item with the date, and wraps it with the timestamp.
    -->
    <xsl:template match="Message[originator/originatorPosition/timeStamp]">
        <xsl:variable name="timestamp" select="originator/originatorPosition/timeStamp" as="xs:integer"/>
        <xsl:variable name="date" select="xs:dateTime('1970-01-01T00:00:00') + $timezoneAdjustment * xs:dayTimeDuration('PT1H') + $timestamp * xs:dayTimeDuration('PT0.001S')"/>
        <wrapper timestamp="{$timestamp}">
            <item>
                <date>
                    <xsl:value-of select="format-dateTime($date,'[D01].[M01].[Y0001] [H01]:[m01]:[s01].[f001]')"/>
                </date>
                <m:Message type="recieved">
                    <xsl:copy-of select="*"/>
                </m:Message>
            </item>
        </wrapper>
    </xsl:template>

</xsl:stylesheet>

EDIT: I added a variable for timezone adjustment for Messages.

EDIT: Fixed the attribute names, so the items will sort correctly.