Just installed Solr, edited the schema.xml
, and am now trying to index it and search on it with some test data.
In the XML file I'm sending to Solr, one of my fields look like this:
<field name="PageContent"><![CDATA[<p>some text in a paragrah tag</p>]]></field>
There's HTML there, so I've wrapped it in CDATA.
In my Solr schema.xml
, the definition for that field looks like this:
<field name="PageContent" type="text" indexed="true" stored="true"/>
When I ran the POSTing tool, everything went ok, but when I search for content which I know is inside the PageContent
field, I get no results.
However, when I set the <defaultSearchField>
node to PageContent
, it works. But if I set it to any other field, it doesn't search in PageContent
.
Am I doing something wrong? what's the issue?
To clarify on the error:
I've uploaded a "doc" with the following data:
<field name="PageID">928</field>
<field name="PageName">some name</field>
<field name="PageContent"><![CDATA[<p>html content</p>]]></field>
In my schema I've defined the fields as such:
<field name="PageID" type="integer" indexed="true" stored="true" required="true"/>
<field name="PageName" type="text" indexed="true" stored="true"/>
<field name="PageContent" type="text" indexed="true" stored="true"/>
And:
<uniqueKey>PageID</uniqueKey>
<defaultSearchField>PageName</defaultSearchField>
Now, when I use the Solr admin tool and search for "some name
" I get a result. But, if I search for "html content
", "html
", "content
" or "928
", I get no results
Why?
You mentioned that your default search field is set to PageName, I wouldn't expect a search for "content" to return anything.
You probably meant to put "PageContent:content" in the search box to find data in that field. If you want to search against multiple fields you'll want to check this out http://wiki.apache.org/solr/DisMaxRequestHandler. The solr admin console is not that great of a tool to play around with all the DisMax search options, you'll want to just manipulate the URL for that.
Regardless, I agree with the previous poster, if your analysis setup isn't setup up properly to deal with HTML you are likely to get all sorts of unexpected search results. Strip the HTML out and index text only.
If you want your standard query handler to search against all your fields you can change it in your solrconfig.xml (I always add a second query handler instead of modifying "standard". The qf field is the list of fields you want to search against. It's a space separated list.
<requestHandler name="standard" class="solr.DisMaxRequestHandler">
<lst name="defaults">
<str name="echoParams">all</str>
<str name="hl">true</str>
<str name="fl">*</str>
<str name="qf">PageName PageContent</str>
</lst>
</requestHandler>
You are making sure that your data has been committed before you attempt to search on it, right?
Also, if you want to store raw HTML its probably best to actually remove the HTML. You can do this in your application or using Solr's solr.HTMLStripWhitespaceTokenizerFactory, like:
<tokenizer class="solr.HTMLStripWhitespaceTokenizerFactory"/>
Which you declare in your fieldtype definition for "text". You might want to create a new field type just for your html, maybe something like text_html and you can use it like so:
<fieldtype name="text_html" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.HTMLStripWhitespaceTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true"/>
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.EnglishPorterFilterFactory" protected="protwords.txt"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.HTMLStripWhitespaceTokenizerFactory"/>
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
<filter class="solr.StopFilterFactory" ignoreCase="true"/>
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.EnglishPorterFilterFactory" protected="protwords.txt"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
</fieldtype>
I am not sure what you mean by:
However, when I set the node to
PageContent, it works. But if I set it
to any other field, it doesn't search
in PageContent.
Can you please elaborate?
fl
is the list of fields returned by the query.. qf
is the list you wanted to refer to and it doesn't support wild cards..
The only way to search all fields without enlisting them is to have a copyField that catches all values (not stored just indexed), then mimic searching against all fields by searching against it
In my schema.xml I have something such as the following which copy the value of each field ending with _t into the text field.
<defaultSearchField>text</defaultSearchField>
<copyField source="*_t" dest="text" maxChars="3000"/>
The parameter fl
does not specify the fields to query against, but the fields to return in the response.
You could just add to schema.xml
:
<field name="fieldContainingEverything" type="text" indexed="true" stored="true" multiValued="true" />
<defaultSearchField>fieldContainingEverything</defaultSearchField>
<copyField source="*" dest="fieldContainingEverything" maxChars="3000"/>
Now when indexing, every field is copied to fieldContainingEverything
. The problem here is that you lose track of the field the content is coming from, if you want to further evaluate with that information. I would be glad if someone had an idea about that.
I found a somewhat functional solution:
To describe the scenario with a bit more details: I have a MySQL database table with a lot of fields to index, and do so by just importing every field without specifying every field (SELECT * FROM
...). I want to query the index against every field of the table and want to know which field matched the query. This is not possible out of the box as the highlighter just tells you that the field matching the query is fieldContainingEverything
. By using dismax query handler I found that even though it is said to search in every field, I don't seem to get it to search through fields which are not specified in the qf
parameter. The idea now is to additionally index every field by adding:
<dynamicField name="*" type="string" indexed="true" stored="true"/>
to your schema.xml
. Now, when you query Solr via dismax with hl.true&hl.fl=*
, you add qf=fieldContainingEverything^1
to your parameterlist. Solr now searches through every indexed field, but also highlights every field containing the query term. Downside of this methods obviously is the increased index size which should not be that relevant in most cases I assume.