I'm trying to load rss data from Wordpress into MarkLogic database. The data is in the form of following:
<?xml version="1.0" encoding="UTF-8" ?>
<rss version="2.0"
xmlns:excerpt="http://wordpress.org/export/1.2/excerpt/"
xmlns:content="http://purl.org/rss/1.0/modules/content/"
xmlns:wfw="http://wellformedweb.org/CommentAPI/"
xmlns:dc="http://purl.org/dc/elements/1.1/"
xmlns:wp="http://wordpress.org/export/1.2/">
<item>
<wp:post_id>1</wp:post_id>
<wp:post_title>title 1</wp:post_title>
<dc:creator>bob</dc:creator>
</item>
<item>
<wp:post_id>2</title>
<wp:post_title>title 1</wp:post_title>
<dc:creator>john</dc:creator>
</item>
</rss>
However, when I run the mlcp command, I get following warning and data is not inserted into the database:
WARN mapreduce.ContentWriter: XDMP-DOCNONSBIND: No namespace binding for prefix wp
WARN mapreduce.ContentWriter: XDMP-DOCNONSBIND: No namespace binding for prefix dc
The mlcp command I used is:
./mlcp.sh import -host localhost -port 8088 -username admin -password admin -input_file_path data.xml -mode local -input_file_type aggregates -aggregate_record_element item -aggregate_uri_id post_id -output_uri_prefix /resources/ -output_uri_suffix .xml
Any idea how I can fix this?
Thank you!
Seong
Your test case has one malformed line: <wp:post_id>2</title>
. When I fix that and mlcp-Hadoop2-1.2-3 with 7.0-4, I see one warning per item element:
15/01/12 14:16:14 WARN mapreduce.ContentWriter: XDMP-DOCNONSBIND: No namespace binding for prefix wp at /resources/1.xml line 2
15/01/12 14:16:14 WARN mapreduce.ContentWriter: XDMP-DOCNONSBIND: No namespace binding for prefix wp at /resources/2.xml line 2
This looks like an mlcp bug to me. Your namespace declarations are above the level of the item
element, and they aren't being sent up to the server.
As a workaround, you could edit the XML. Or you could try http://marklogic.github.io/recordloader/ with something like this:
$ recordloader.sh -DCONNECTION_STRING=xcc://admin:admin@localhost:8088 \
-DRECORD_NAME=item -DID_NAME="#AUTO" data.xml
See http://marklogic.github.io/recordloader/ for other options.
It does look like an MLCP bug to me as well. However, before giving up, try adding a default namespace to the root element, so it would be like:
<rss version="2.0" xmlns="http://yournamespace.com/"
xmlns:excerpt="http://wordpress.org/export/1.2/excerpt/"
xmlns:content="http://purl.org/rss/1.0/modules/content/"
xmlns:wfw="http://wellformedweb.org/CommentAPI/"
xmlns:dc="http://purl.org/dc/elements/1.1/"
xmlns:wp="http://wordpress.org/export/1.2/">