I'm looking into chunking my data source for optimial data import into solr and was wondering if it was possible to use a master url that chunked data into sections.
For example File 1 may have
<chunks>
<chunk url="http://localhost/chunker?start=0&stop=100" />
<chunk url="http://localhost/chunker?start=100&stop=200" />
<chunk url="http://localhost/chunker?start=200&stop=300" />
<chunk url="http://localhost/chunker?start=300&stop=400" />
<chunk url="http://localhost/chunker?start=400&stop=500" />
<chunk url="http://localhost/chunker?start=500&stop=600" />
</chunks>
with each chunk url leading to something like
<items>
<item data1="info1" />
<item data1="info2" />
<item data1="info3" />
<item data1="info4" />
</iems>
I'm working with 500+ million records so I think that the data will need to be chunked to avoid memory issues (ran into that when using the SQLEntityProcessor). I would also like to avoid making 500+ Million web requests as that could get expensive I think
Due to the lack of examples on the internet, I figured I would post what I ended up using
<?xml version="1.0" encoding="utf-8"?>
<result>
<dataCollection func="chunked">
<data info="test" info2="test" />
<data info="test" info2="test" />
<data info="test" info2="test" />
<data info="test" info2="test" />
<data info="test" info2="test" />
<data info="test" info2="test" />
<data hasmore="true" nexturl="http://server.domain.com/handler?start=0&end=1000000000&page=1&pagesize=10"
</dataCollection>
</result>
It's important to note that I use specify that there is more on the next page and provide a url to the next page. This is consistant with the Solr Documentation for DataImportHandlers. Please note that the documentation specifies that the paginated feed should tell the system that it has more and where to get the next batch.
<dataConfig>
<dataSource name="b" type="URLDataSource" baseUrl="http://server/" encoding="UTF-8" />
<document>
<entity name="continue"
dataSource="b"
url="handler?start=${dataimport.request.startrecord}&end=${dataimport.request.stoprecord}&pagesize=100000"
stream="true"
processor="XPathEntityProcessor"
forEach="/result/dataCollection/data"
transformer="DateFormatTransformer"
connectionTimeout="120000"
readTimeout="300000"
>
<field column="id" xpath="/result/dataCollection/data/@info" />
<field column="id" xpath="/result/dataCollection/data/@info" />
<field column="$hasMore" xpath="/result/dataCollection/data/@hasmore" />
<field column="$nextUrl" xpath="/result/dataCollection/data/@nexturl" />
</entity>
</document>
Note the $hasMore and $nextUrl fields. You may want to place with the timeouts. I also recomend allowing for specifying page size (it helps with tweeking settings to get optimal processing speed). I'm indexing @ about 12.5K records per second using a multicore (3) solr instance on a single server with a quad core Xeon processor and 32GB of ram.
The app paginating the results is uses the same system as does the SQL server storing the data. I'm also passing in the start and stop positions to minimize configuration changes when we eventually load balance the solr server....
The entity can be nested to do what you wanted originally. The inner entity can refer to the outter field like this url="${chunk.link}"
where chunk
is the outter entity name & link
is the field name.
<?xml version="1.0" encoding="windows-1250"?>
<dataConfig>
<dataSource name="b" type="URLDataSource" baseUrl="http://server/" encoding="UTF-8" />
<document>
<entity name="chunk"
dataSource="b"
url="path/to/chunk.xml"
stream="true"
processor="XPathEntityProcessor"
forEach="/chunks/chunk"
transformer="DateFormatTransformer"
connectionTimeout="120000"
readTimeout="300000" >
<field column="link" xpath="/chunks/chunk/@url" />
<entity name="item"
dataSource="b"
url="${chunk.link}"
stream="true"
processor="XPathEntityProcessor"
forEach="/items/item"
transformer="DateFormatTransformer"
connectionTimeout="120000"
readTimeout="300000" >
<field column="info" xpath="/items/item/@info" />
</entity>
</entity>
</document>
</dataConfig>