My company processes alot of product feeds using hadoop. We have a process to extract exactly one product node and make that a line in a file. we then use xsl to convert the product xml to a single line triple pipe delimited file. This has worked well so far. However I ran into an issue with one client. They made some changes in the new xml file are using some namespaces this caused things to break. I had to modify the links in the xml so i could post it. I changed the http to httc The Original xml file was setup like this:
<?xml version="1.0" encoding="utf-8"?>
<CATALOG APIKEY="88ac00e4f3e16e44" xmlns="urn:rrXML" xmlns:xsd="httc://www.w3.org/2001/XMLSchema" xmlns:xsi="httc://www.w3.org/2001/XMLSchema-instance">
<PRODUCTS>
<PRODUCT ID="692174">
<PRODUCTNAME>HP Pavilion g6t Laptop 3rd generation Intel® Core™ i5-3210M 2.5GHz SuperMulti 8X DVD+/-R/RW</PRODUCTNAME>
<PRODUCTDESCRIPTION></PRODUCTDESCRIPTION>
<PRODUCTSKU>100005487</PRODUCTSKU>
<LISTPRICE>$499.99</LISTPRICE>
<SALEPRICE xsi:type="xsd:string" xmlns:xsi="httc://www.w3.org/2001/XMLSchema-instance">$499.99</SALEPRICE>
<PRODUCTURL>/.product.100005487.html</PRODUCTURL>
<IMAGEURL>httc://images.test-static.com/image/media/150-__1</IMAGEURL>
<RATING xsi:type="xsd:string" xmlns:xsi="httc://www.w3.org/2001/XMLSchema-instance">0.0</RATING>
<BRAND>HEWLETT PACKARD</BRAND>
<INSTOCK>1</INSTOCK>
<REVIEWS xsi:type="xsd:string" xmlns:xsi="httc://www.w3.org/2001/XMLSchema-instance">0</REVIEWS>
<KEYWORDS></KEYWORDS>
<ACTIONBUTTONURL></ACTIONBUTTONURL>
<PARENTPRODUCTID>100005487</PARENTPRODUCTID>
<CATEGORIES />
<ATTRIBUTES>
<ATTRIBUTE NAME="Categories">Kaspersky Promotion</ATTRIBUTE>
<ATTRIBUTE NAME="FSA">False</ATTRIBUTE>
<ATTRIBUTE NAME="HIDEPRICEFROMBROWSE">False</ATTRIBUTE>
<ATTRIBUTE NAME="ADDTOCARTFROMSEARCH">0</ATTRIBUTE>
<ATTRIBUTE NAME="ITEMMINQTY">1.0</ATTRIBUTE>
<ATTRIBUTE NAME="ITEMMAXQTY">1.0</ATTRIBUTE>
<ATTRIBUTE NAME="MERCHANDISINGDESC"></ATTRIBUTE>
<ATTRIBUTE NAME="DISCOUNTDESC"></ATTRIBUTE>
<ATTRIBUTE NAME="ALTTEXT">HP Pavilion g6t Laptop 3rd generation Intel® Core™ i5-3210M 2.5GHz SuperMulti 8X DVD+/-R/RW</ATTRIBUTE>
<ATTRIBUTE NAME="MAPITEM">False</ATTRIBUTE>
<ATTRIBUTE NAME="MEMBERONLYITEM">False</ATTRIBUTE>
<ATTRIBUTE NAME="Brand">HP</ATTRIBUTE>
<ATTRIBUTE NAME="Graphic Card">Intel HD Graphics</ATTRIBUTE>
<ATTRIBUTE NAME="Hard Drive Size">500 GB</ATTRIBUTE>
<ATTRIBUTE NAME="Operating System">Windows ®</ATTRIBUTE>
<ATTRIBUTE NAME="RAM Included">4 GB</ATTRIBUTE>
<ATTRIBUTE NAME="Screen Size">15.6 in.</ATTRIBUTE>
</ATTRIBUTES>
</PRODUCT>
The new xml file is setup like this:
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<CATALOG APIKEY="88ac00e4f3e16e44" xmlns="urn:rrXML" xmlns:xsd="httc://www.w3.org/2001/XMLSchema" xmlns:xsi="httc://www.w3.org/2001/XMLSchema-instance">
<PRODUCTS>
<PRODUCT ID="692174">
<PRODUCTNAME>HP Pavilion g6t Laptop 3rd generation Intel® Core™ i5-3210M 2.5GHz SuperMulti 8X DVD+/-R/RW</PRODUCTNAME>
<PRODUCTDESCRIPTION></PRODUCTDESCRIPTION>
<PRODUCTSKU>100005487</PRODUCTSKU>
<LISTPRICE>$499.99</LISTPRICE>
<SALEPRICE xsi:type="xsd:string">$499.99</SALEPRICE>
<PRODUCTURL>/.product.100005487.html</PRODUCTURL>
<IMAGEURL>httc://images.test-static.com/image/media/150-__1</IMAGEURL>
<RATING xsi:type="xsd:string">0.0</RATING>
<BRAND>HEWLETT PACKARD</BRAND>
<INSTOCK>1</INSTOCK>
<REVIEWS xsi:type="xsd:string">0</REVIEWS>
<KEYWORDS></KEYWORDS>
<ACTIONBUTTONURL></ACTIONBUTTONURL>
<PARENTPRODUCTID>100005487</PARENTPRODUCTID>
<CATEGORIES>
<CATEGORY ID="103510">
<CATEGORYNAME>Kaspersky Promotion</CATEGORYNAME>
</CATEGORY>
</CATEGORIES>
<ATTRIBUTES>
<ATTRIBUTE NAME="Categories">Kaspersky Promotion</ATTRIBUTE>
<ATTRIBUTE NAME="FSA">False</ATTRIBUTE>
<ATTRIBUTE NAME="HIDEPRICEFROMBROWSE">False</ATTRIBUTE>
<ATTRIBUTE NAME="ADDTOCARTFROMSEARCH">0</ATTRIBUTE>
<ATTRIBUTE NAME="ITEMMINQTY">1.0</ATTRIBUTE>
<ATTRIBUTE NAME="ITEMMAXQTY">1.0</ATTRIBUTE>
<ATTRIBUTE NAME="MERCHANDISINGDESC"></ATTRIBUTE>
<ATTRIBUTE NAME="DISCOUNTDESC"></ATTRIBUTE>
<ATTRIBUTE NAME="ALTTEXT">HP Pavilion g6t Laptop 3rd generation Intel® Core™ i5-3210M 2.5GHz SuperMulti 8X DVD+/-R/RW</ATTRIBUTE>
<ATTRIBUTE NAME="MAPITEM">False</ATTRIBUTE>
<ATTRIBUTE NAME="MEMBERONLYITEM">False</ATTRIBUTE>
<ATTRIBUTE NAME="Brand">HP</ATTRIBUTE>
<ATTRIBUTE NAME="Graphic Card">Intel HD Graphics</ATTRIBUTE>
<ATTRIBUTE NAME="Hard Drive Size">500 GB</ATTRIBUTE>
<ATTRIBUTE NAME="Operating System">Windows ®</ATTRIBUTE>
<ATTRIBUTE NAME="RAM Included">4 GB</ATTRIBUTE>
<ATTRIBUTE NAME="Screen Size">15.6 in.</ATTRIBUTE>
</ATTRIBUTES>
</PRODUCT>
When convert the product to single lines we only take everything between and including the product beginning and end tags.
When we did this with the new file it failed because it was dropping off the namespace. so i modified the process to include a wrapper around the product with the namespace tags. So the text being sent to be converted via xsl is:
<wrapper xmlns="urn:rrXML" xmlns:xsd="httc://www.w3.org/2001/XMLSchema" xmlns:xsi="httc://www.w3.org/2001/XMLSchema-instance">
<PRODUCTS>
<PRODUCT ID="692174">
<PRODUCTNAME>HP Pavilion g6t Laptop 3rd generation Intel® Core™ i5-3210M 2.5GHz SuperMulti 8X DVD+/-R/RW</PRODUCTNAME>
<PRODUCTDESCRIPTION></PRODUCTDESCRIPTION>
<PRODUCTSKU>100005487</PRODUCTSKU>
<LISTPRICE>$499.99</LISTPRICE>
<SALEPRICE xsi:type="xsd:string">$499.99</SALEPRICE>
<PRODUCTURL>/.product.100005487.html</PRODUCTURL>
<IMAGEURL>httc://images.test-static.com/image/media/150-__1</IMAGEURL>
<RATING xsi:type="xsd:string">0.0</RATING>
<BRAND>HEWLETT PACKARD</BRAND>
<INSTOCK>1</INSTOCK>
<REVIEWS xsi:type="xsd:string">0</REVIEWS>
<KEYWORDS></KEYWORDS>
<ACTIONBUTTONURL></ACTIONBUTTONURL>
<PARENTPRODUCTID>100005487</PARENTPRODUCTID>
<CATEGORIES>
<CATEGORY ID="103510">
<CATEGORYNAME>Kaspersky Promotion</CATEGORYNAME>
</CATEGORY>
</CATEGORIES>
<ATTRIBUTES>
<ATTRIBUTE NAME="Categories">Kaspersky Promotion</ATTRIBUTE>
<ATTRIBUTE NAME="FSA">False</ATTRIBUTE>
<ATTRIBUTE NAME="HIDEPRICEFROMBROWSE">False</ATTRIBUTE>
<ATTRIBUTE NAME="ADDTOCARTFROMSEARCH">0</ATTRIBUTE>
<ATTRIBUTE NAME="ITEMMINQTY">1.0</ATTRIBUTE>
<ATTRIBUTE NAME="ITEMMAXQTY">1.0</ATTRIBUTE>
<ATTRIBUTE NAME="MERCHANDISINGDESC"></ATTRIBUTE>
<ATTRIBUTE NAME="DISCOUNTDESC"></ATTRIBUTE>
<ATTRIBUTE NAME="ALTTEXT">HP Pavilion g6t Laptop 3rd generation Intel® Core™ i5-3210M 2.5GHz SuperMulti 8X DVD+/-R/RW</ATTRIBUTE>
<ATTRIBUTE NAME="MAPITEM">False</ATTRIBUTE>
<ATTRIBUTE NAME="MEMBERONLYITEM">False</ATTRIBUTE>
<ATTRIBUTE NAME="Brand">HP</ATTRIBUTE>
<ATTRIBUTE NAME="Graphic Card">Intel HD Graphics</ATTRIBUTE>
<ATTRIBUTE NAME="Hard Drive Size">500 GB</ATTRIBUTE>
<ATTRIBUTE NAME="Operating System">Windows ®</ATTRIBUTE>
<ATTRIBUTE NAME="RAM Included">4 GB</ATTRIBUTE>
<ATTRIBUTE NAME="Screen Size">15.6 in.</ATTRIBUTE>
</ATTRIBUTES>
</PRODUCT>
</wrapper>
The xsl I am trying to use is:
<?xml version="1.0"?>
<xsl:stylesheet version="1.0" xmlns:xsl="httc://www.w3.org/1999/XSL/Transform">
<xsl:output method="text" indent="no" />
<xsl:strip-space elements="*" />
<xsl:template match="PRODUCT">
<!-- skuId --><xsl:value-of select="PRODUCTSKU"/>
<xsl:text>|||</xsl:text>
<!-- parentSkuId --><xsl:value-of select="PARENTPRODUCTID"/>
<xsl:text>|||</xsl:text>
<!-- globalSkuID --><xsl:text></xsl:text>
<xsl:text>|||</xsl:text>
<!-- TaxonomyKey Path --><xsl:text></xsl:text>
<xsl:text>|||</xsl:text>
<!-- TaxonomyText --><xsl:text></xsl:text>
<xsl:text>|||</xsl:text>
<!-- upc --><xsl:text></xsl:text>
<xsl:text>|||</xsl:text>
<!-- mpn --><xsl:text></xsl:text>
<xsl:text>|||</xsl:text>
<!-- model_Number --><xsl:text></xsl:text>
<xsl:text>|||</xsl:text>
<!-- Name --><xsl:value-of select="PRODUCTNAME"/>
<xsl:text>|||</xsl:text>
<!-- shortDescription --><xsl:text></xsl:text>
<xsl:text>|||</xsl:text>
<!-- longDescription --><xsl:value-of select="PRODUCTDESCRIPTION"/>
<xsl:text>|||</xsl:text>
<!-- price --><xsl:value-of select="SALEPRICE"/>
<xsl:text>|||</xsl:text>
<!-- comparePrice --><xsl:value-of select="LISTPRICE"/>
<xsl:text>|||</xsl:text>
<!-- productPage --><xsl:value-of select="PRODUCTURL"/>
<xsl:text>|||</xsl:text>
<!-- thumbnail --><xsl:value-of select="IMAGEURL"/>
<xsl:text>|||</xsl:text>
<!-- fullImage --><xsl:value-of select="IMAGEURL"/>
<xsl:text>|||</xsl:text>
<!-- rating --><xsl:value-of select="RATING"/>
<xsl:text>|||</xsl:text>
<!-- brand --><xsl:value-of select="BRAND"/>
<xsl:text>|||</xsl:text>
<!-- isActive --><xsl:value-of select="INSTOCK"/>
<xsl:text>|||</xsl:text>
<!-- ReviewCouunt --><xsl:value-of select="REVIEWS"/>
<xsl:text>|||</xsl:text>
<!-- AlternateTaxonomyKeys -->
<xsl:for-each select="CATEGORIES/CATEGORY">
<xsl:value-of select="@ID" /><xsl:text>^</xsl:text>
</xsl:for-each>
<xsl:text>|||</xsl:text>
<!-- AlternateTaxonomyNames -->
<xsl:for-each select="CATEGORIES/CATEGORY/CATEGORYNAME">
<xsl:value-of select="." /><xsl:text>^</xsl:text>
</xsl:for-each>
<xsl:text>|||</xsl:text>
<!-- AttributeNames -->
<xsl:for-each select="ATTRIBUTES/ATTRIBUTE">
<xsl:value-of select="@NAME" /><xsl:text>^</xsl:text>
</xsl:for-each>
<xsl:text>|||</xsl:text>
<!-- Attribute Values -->
<xsl:for-each select="ATTRIBUTES/ATTRIBUTE">
<xsl:value-of select="." /><xsl:text>^</xsl:text>
</xsl:for-each>
<xsl:text>
</xsl:text>
</xsl:template>
</xsl:stylesheet>
This results in the output of just the string concatenated from the product level node like: HP Pavilion g6t Laptop 3rd generation Intel® Core™ i5-3210M 2.5GHz SuperMulti 8X DVD+/-R/RW100005487$499.99$499.99/.product.100005487.htmlhttc://images.test-static.com/image/media/150-__10.0HEWLETT PACKARD10100005487
I'm guessing it has something to do with the namespaces they are including but I don't really know enough about using xsl to figure out what. Please Help
You have to add the namespace of the XML document to the XSLT by defining a namespace with the same
namespace-uri()
, e.g.xmlns:u="urn:rrXML"
. Then you can access the elements in the XML with this prefix, meaning: you get the value using<xsl:value-of select="u:PRODUCTSKU"/>
instead of<xsl:value-of select="PRODUCTSKU"/>
. When the missing closing PRODUCTS tag is added in your input XML, following XSLTproduces the output
100005487|||100005487|||||||||||||||||||||HP Pavilion g6t Laptop 3rd generation Intel® Core™ i5-3210M 2.5GHz SuperMulti 8X DVD+/-R/RW|||||||||$499.99|||$499.99|||/.product.100005487.html|||httc://images.test-static.com/image/media/150-__1|||httc://images.test-static.com/image/media/150-__1|||0.0|||HEWLETT PACKARD|||1|||0|||103510^|||Kaspersky Promotion^|||Categories^FSA^HIDEPRICEFROMBROWSE^ADDTOCARTFROMSEARCH^ITEMMINQTY^ITEMMAXQTY^MERCHANDISINGDESC^DISCOUNTDESC^ALTTEXT^MAPITEM^MEMBERONLYITEM^Brand^Graphic Card^Hard Drive Size^Operating System^RAM Included^Screen Size^|||Kaspersky Promotion^False^False^0^1.0^1.0^^^HP Pavilion g6t Laptop 3rd generation Intel® Core™ i5-3210M 2.5GHz SuperMulti 8X DVD+/-R/RW^False^False^HP^Intel HD Graphics^500 GB^Windows ®^4 GB^15.6 in.^
in one line, if that's really the intended ouput.
You are guessing correctly - and a short search should have revealed the answer: assign a prefix to the namespace and use that prefix when addressing the elements of the XML source, for example: