Recently my team is using hbase-indexer on CDH for indexing hbase table column to solr . When we deploy hbase-indexer server (which is called Key-Value Store Indexer) and begin testing. We found a situation that the rows size between hbase table and solr index is different :
We used Phoenix to count hbase table rows:
0: jdbc:phoenix:slave1,slave2,slave3:2181> SELECT /*+ NO_INDEX */ COUNT(1) FROM C_PICRECORD;
+------------------------------------------+
| COUNT(1) |
+------------------------------------------+
| 4084355 |
+------------------------------------------+
And we use Solr Web UI to count solr index size :
numFound : 4060479
We could not found any error log from hbase-indexer log and solr log. But the rows size between hbase table and solr index is really different ! Is there anyone meet this situation ? I don't know how to do
All right, we solved the problem recently.
The reason why solr numfound is different from hbase table row count due to hbase-indexer make a mistake of deleting some row instead of inserting them. We found this situation according to hbase-indexer metrics : https://github.com/NGDATA/hbase-indexer/wiki/Metrics
We use jconsole to watch jmx metrics data and found :
indexer deletes count = hbase table row count - solr numfound
Finally we debug into the hbase-indexer source code and find some code will cause this problem, maybe it is a issue about hbase-indexer, please see : https://github.com/NGDATA/hbase-indexer/issues/78
My understanding :
Hbase rowcount - Solr rowcount(numfound) = missing records
4084355 - 4060479 = 23876 (which are there in Hbase and missing in Solr)
The Key-Value Store Indexer service uses the Lily HBase NRT Indexer to index the stream of records being added to HBase tables.
NRT works on incremental data not whole data.
Out of my experience these are possible reasons :
1) NRT worked initially, and if suddenly NRT is not working(due to some health issues) then there is a possibility of discrepancy in numbers.
2) NRT works on WAL(write ahead log) if WAL is switched off while inserting the records in to HBASE (possible.. for performance reasons), NRT wont work.
Possible solution : 1) Delete Solr documents and freshly load data in to Solr from Hbase. Hbase batch indexer you can run on whole data (Batch indexer wont work on incremental data, it works on whole dataset)
2) As part of data-flow pipe line, Write a map-reduce program to insert the data in to solr.(what we have done in one of our implementation)