Solr delta-import not working

2020-02-13 06:09发布

问题:

Full import and deletedPkQuery works , I've traced the database server both the deltaQuery and deletedPkQuery are executed.

ive executed these queries manually many times and they do indeed return row(s), but

It does not fetch any rows. Last thing i did was to output the FILE_ID as id on all the queries. Still doesnt work.

<dataConfig>

<dataSource name="db" type="JdbcDataSource" driver="com.microsoft.sqlserver.jdbc.SQLServerDriver" url="jdbc:sqlserver://localhost:1433;databaseName=norway_operations;responseBuffering=adaptive;selectMethod=cursor" user="noropuser" password="noropuser" autoCommit="false" transactionIsolation="TRANSACTION_READ_COMMITTED"holdability="CLOSE_CURSORS_AT_COMMIT"/>
<dataSource name="bin" type="BinFileDataSource"  basePath="D:\OPG_FILESTORE"/>

<document>

    <entity name="file" dataSource="db" pk="id" query="select FILE_ID as id, CATEGORY_ID, CATEGORY_NAME, FILENAME, FILE_MIME_TYPE, PATH, LAST_MODIFIED as last_modified from DOCUMENTS"
            deltaQuery="select FILE_ID as id from DOCUMENTS where LAST_MODIFIED > '${dataimporter.last_index_time}'" 
            deltaImportQuery="select FILE_ID as id, CATEGORY_ID, CATEGORY_NAME, FILENAME, FILE_MIME_TYPE, PATH, LAST_MODIFIED as last_modified from DOCUMENTS where FILE_ID = '${dih.delta.id}'" 
            deletedPkQuery="delete from PK_DELETE_HISTORY output DELETED.PK AS id where PK_NAME = 'FILE_ID'" >

        <field column="id" name="id" />
        <field column="CATEGORY_ID" name="categoryId" />
        <field column="CATEGORY_NAME" name="category" />
        <field column="FILENAME" name="filename" />
        <field column="FILE_MIME_TYPE" name="content_type" />
        <field column="last_modified" name="last_modified" />

        <entity name="tika" processor="TikaEntityProcessor" url="${file.PATH}" parser="org.apache.tika.parser.AutoDetectParser" format="text" dataSource="bin" onError="continue">                
            <field column="text" name="content" />
            <field column="title" name="title"/>
            <field column="subject" name="subject"/>
            <field column="description" name="description"/>
            <field column="comments" name="comments"/>
            <field column="author" name="author"/>
            <field column="keywords" name="keywords"/>
            <field column="url"  name="url"/>
            <field column="content_type" name="content_type" />                
            <field column="links"  name="links" />                
        </entity>            
    </entity>        
</document>

Trace

declare @p1 int
set @p1=180150003
declare @p5 int
set @p5=-1
exec sp_cursoropen @p1 output,N'select FILE_ID as id from DOCUMENTS where LAST_MODIFIED > ''2014-02-06 15:02:40''',16,8193,@p5 output
select @p1, @p5

When i run this manually it returns 1 row

Response:

    <?xml version="1.0" encoding="UTF-8" ?> 
- <response>
- <lst name="responseHeader">
  <int name="status">0</int> 
  <int name="QTime">31</int> 
  </lst>
- <lst name="initArgs">
- <lst name="defaults">
  <str name="config">db-data-config.xml</str> 
  <int name="rows">0</int> 
  <int name="start">0</int> 
  </lst>
  </lst>
  <str name="command">delta-import</str> 
  <str name="mode">debug</str> 
  <arr name="documents" /> 
  <lst name="verbose-output" /> 
  <str name="status">idle</str> 
  <str name="importResponse" /> 
- <lst name="statusMessages">
  <str name="Total Requests made to DataSource">2</str> 
  <str name="Total Rows Fetched">0</str> 
  <str name="Total Documents Skipped">0</str> 
  <str name="Delta Dump started">2014-02-06 15:32:20</str> 
  <str name="Identifying Delta">2014-02-06 15:32:20</str> 
  <str name="Deltas Obtained">2014-02-06 15:32:20</str> 
  <str name="Building documents">2014-02-06 15:32:20</str> 
  <str name="Total Changed Documents">0</str> 
  <str name="Total Documents Processed">0</str> 
  <str name="Time taken">0:0:0.16</str> 
  </lst>
  <str name="WARNING">This response format is experimental. It is likely to change in the future.</str> 
  </response>

回答1:

Things might worth looking for:

1.Timestamp saved in dataimport.properties config file

This happens to me before

Running delta-import (successfully) will update the {dataimporter.last_index_time} in conf/dataimport.properties file. And next time, your query may run based on the new timestamp, which may return zero rows unless you updated the database.

2. dataimporter.delta.id and dataimporter.last_index_time

dataimporter.delta.id should be dih.delta.id

last_index_time remains in the dataimporter namespace. **dataimporter.last_index_time** works at least in solr 4.2.0. dih.last_index_time might works too as it was mentioned in the solr wiki, but I haven't test it

3. Timestamp need to be converted to proper DataTime datatype depends on the DB .

In case of SQL server:

LAST_MODIFIED_DATETIME > convert(datetime,'${dataimporter.last_index_time}')


回答2:

There are some bugs in certain versions with last_index_time. You haven't indicated which Solr version you're on, but most people these days are on 4.x.

Also, there are some bugs where the old dataimporter property namespace doesn't work. With 4.x you should be using the dih property namespace, which means dih.last_index_time and dih.delta.id instead of dataimporter.* for the property names.



回答3:

I had same issue and figured out that deltaImportQuery is case sensitive

Made my id Column as "ID"

deltaImportQuery="select id,state,name,place,city from temp where ID='${dih.delta.ID}



回答4:

I'm running SOLR in Tomcat 7 on windows. Tracing the ODBC connection i see that the language is set to Norwegian. (Norwegian = Norsk) in Norwegian ;)

set arithabort off
set numeric_roundabort off
set ansi_warnings on
set ansi_padding on
set ansi_nulls on
set concat_null_yields_null on
set cursor_close_on_commit off
set implicit_transactions off
set language Norsk
set dateformat dmy
set datefirst 1
set transaction isolation level read committed

JVM is started with these args

-Duser.region=US
-Duser.language=en
-Duser.timezone=Europe/Oslo

It didn't make any difference whether is was set Norwegian or English

Adding a propertyWriter tag to the configuration file fixed the problem.

<dataConfig>
<propertyWriter dateFormat="yyyy-dd-MM HH:mm:ss" type="SimplePropertiesWriter" directory="D:/tmp" filename="knowledgebase.dih.properties" locale="English (United States)" />
<dataSource name="db" type="JdbcDataSource" driver="com.microsoft.sqlserver.jdbc.SQLServerDriver" url="jdbc:sqlserver://localhost:1433;databaseName=norway_operations;responseBuffering=adaptive;selectMethod=cursor" user="noropuser" password="noropuser" autoCommit="false" transactionIsolation="TRANSACTION_READ_COMMITTED" holdability="CLOSE_CURSORS_AT_COMMIT" />

<document>

    <entity type="a" name="knowledge" dataSource="db" pk="BASE_ID" query="select * from vKNOWLEDGE_BASE"
            deltaQuery="select BASE_ID from vKNOWLEDGE_BASE where '${dataimporter.last_index_time}' &lt; TIMESTAMP" 
            deltaImportQuery="select * from vKNOWLEDGE_BASE where BASE_ID = '${dataimporter.delta.BASE_ID}'" 
            deletedPkQuery="delete from PK_DELETE_HISTORY output DELETED.PK AS BASE_ID where PK_NAME = 'BASE_ID'" >

        <field column="BASE_ID" name="id" />
        <field column="CATEGORY_ID" name="categoryId" />
        <field column="CATEGORY_NAME" name="category" />
        <field column="DESCRIPTION" name="description" />
        <field column="SOLUTION" name="solution" />
        <field column="USER_FULL_NAME" name="author" />
        <field column="SOFTWARE_VERSION" name="software_version" />
        <field column="TIMESTAMP" name="last_modified" />

        <entity name="keywords" dataSource="db" pk="KEYWORD_ID" query="select KNOWLEDGE_KEYWORDS.* from KNOWLEDGE_KEYWORDS_TO_BASE left join KNOWLEDGE_KEYWORDS on (KNOWLEDGE_KEYWORDS_TO_BASE.KEYWORD_ID = KNOWLEDGE_KEYWORDS.KEYWORD_ID) where  KNOWLEDGE_KEYWORDS_TO_BASE.BASE_ID = '${knowledge.BASE_ID}'">                
            <field column="KEYWORD_NAME" name="keywords" />              
        </entity>      
    </entity>        
</document>

It's also possible to add a language option to the JdbcDataSource url.

jdbc:sqlserver://localhost:1433;databaseName=XXX;responseBuffering=adaptive;selectMethod=cursor;language=XXX

I did not test this, but i assume this would also fix the problem if it had been set to english, because in the SQL server query the language is set to Norwegian, but the date format used in the where clause to compare the LAST_MODIFIED column was yyyy-MM-dd HH:mm:ss and the default format for Norwegian is yyyy-dd-MM HH:mm:ss.



回答5:

Solr seems to save timestamps in dataimport.properties in UTC timezone, so you need to convert you timezone in database to UTC before compare to values in dataimport.properties.

e.g

-- for mysql, following would convert `update_date` to utc before compare in where clause
deltaQuery="select id from book where status = 0 and CONVERT_TZ(`update_date`, @@session.time_zone, '+00:00') &gt; '${dih.last_index_time}';"