I wrote a TokenFilter
which adds tokens in a stream.
1. Tests shows it works, but I don't completely understand why.
If someone could shed a light on the semantics I'd be grateful. In particular, at (*)
, restoring the state, doesn't that mean we either overwrite the current token or the token created before capturing the state?
This is roughly what I did
private final LinkedList<String> extraTokens = new LinkedList<String>();
private final CharTermAttribute termAtt = addAttribute(CharTermAttribute.class);
private State savedState;
@Override
public boolean incrementToken() throws IOException {
if (!extraTokens.isEmpty()) {
// Do we not loose/overwrite the current termAtt token here? (*)
restoreState(savedState);
termAtt.setEmpty().append(extraTokens.remove());
return true;
}
if (input.incrementToken()) {
if (/* condition */) {
extraTokens.add("fo");
savedState = captureState();
}
return true;
}
return false;
}
Does that mean, for an input stream of whitespace tokenized string "a b c"
(a) -> (b) -> (c) -> ...
where bb
is a new synonym to b
, that the graph will be constructed like this when restoreState
is used?
(a)
/ \
(b) (bb)
\ /
(c)
|
...
2. Attributes
Given the text foo bar baz
with fo
being the stem of foo
and qux
being synonym to bar baz
, have I constructed the correct attribute table?
+--------+---------------+-----------+--------------+-----------+
| Term | startOffset | endOffset | posIncrement | posLenght |
+--------+---------------+-----------+--------------+-----------+
| foo | 0 | 3 | 1 | 1 |
| fo | 0 | 3 | 0 | 1 |
| qux | 4 | 11 | 0 | 2 |
| bar | 4 | 7 | 1 | 1 |
| baz | 8 | 11 | 1 | 1 |
+--------+---------------+-----------+--------------+-----------+
1.
How the Attribute based API works is, that every
TokenStream
in your analyzer chain somehow modifies the state of someAttribute
s on every call ofincrementToken()
. The last element in your chain then produces the final tokens.Whenever the client of your analyzer chain calls
incrementToken()
, the lastTokenStream
would set the state of someAttribute
s to whatever is necessary to represent the next token. If it is unable to do so, it may callincrementToken()
on its input, to let the previousTokenStream
do its work. This goes on until the lastTokenStream
returnsfalse
, indicating, that no more tokens are available.A
captureState
copies the state of allAttribute
s of the callingTokenStream
into aState
, arestoreState
overwrites everyAttribute
's state with whatever was captured before (is given as an argument).The way your token filter works is, it will call
input.incrementToken()
, so that the previousTokenStream
will set theAttribute
s' state to what would be the next token. Then, if your defined condition holds (say, the termAtt is "b"), it would add "bb" to a stack, save this state somewhere and return true, so that the client may consume the token. On the next call ofincrementToken()
, it would not useinput.incrementToken()
. Whatever the current state is, it represents the previous, already consumed token. The filter then restores the state, so that everything is exactly as it was before, and then produces "bb" as the current token and returns true, so that the client may consume the token. Only on the next call, it would (again) consume the next token from the previous filter.This won't actually produce the graph you displayed, but insert
"bb"
after"b"
, so it's reallySo, why do you save the state in the first place? When producing tokens, you want to make sure, that e.g. phrase queries or highlighting will work correctly. When you have the text
"a b c"
and"bb"
is a synonym for"b"
, you'd expect the phrase query"b c"
to work, as well as"bb c"
. You have to tell the index, that both, "b" and "bb" are in the same position. Lucene uses a position increment for that and per default, the position increment is 1, meaning that every new token (read, call ofincrementToken()
) comes 1 position after the previous one. So, with the final positions, the produces stream iswhile you actually want
So, for your filter to produce the graph, you have to set the position increment to 0 for the inserted
"bb"
The
restoreState
makes sure, that other attributes, like offsets, token types, etc. are preserved and you only have to change the ones, that are required for your use case. Yes, you are overwriting whatever state was there beforerestoreState
, so it is your responsibility to use this in the right place. And as long as you don't callinput.incrementToken()
, you don't advance the input stream, so you can do whatever you want with the state.2.
A stemmer only changes the token, it typically doesn't produce new tokens nor changes the position increment or offsets. Also, as the position increment means, that the current term should come
positionIncrement
positions after the previous token, you should havequx
with an increment of 1, because it is the next token afterof
andbar
should have an increment of 0 because it is in the same position asqux
. The table would rather look likeAs a basic rule, for multi-term synonyms, where "ABC" is a synonym for "a b c", you should see, that
Hope this helps to shed some light.