Oracle Text: How to sanitize user input

2019-04-10 18:21发布

问题:

If anyone has experience using Oracle text (CTXSYS.CONTEXT), I'm wondering how to handle user input when the user wants to search for names that may contain an apostrophe.

Escaping the ' seems to work in some cases, but not for 's at the end of the word - s is in the list of stop words, and so seems to get removed.

We currently change simple query text (i.e. anything that's just letters) to %text%, for example:

contains(field, :text) > 0

A search for O'Neil works, but Joe's doesn't.

Has anyone using Oracle Text dealt with this issue?

回答1:

Escape all special characters with backslashes. Curly braces won't work with substring searches as they define complete tokens. Eg %{ello}% won't match the token 'Hello'

Escaped space characters will be included in the search token, so the search string '%stay\ near\ me%' will be treated as a literal string "stay near me" and will not invoke the 'near' operator.

If you are indexing short strings (like names, etc ) and you want Oracle Text to behave exactly as the like operator, you must write your own lexer that won't create tokens for individual words. (Unfortunately CATSEARCH does not support substring search...)

It is probably a good idea to change the searches to use oracle text's semantics, with token matching, but for some applications, the wildcard expansion of multiple (short) tokens and numeric tokens will create too many hits for search strings that the users reasonably would expect to work.

Eg, a search for "%I\ AM\ NUMBER\ 9%" will most likely fail if there are a lot of numeric tokens in the indexed data, since all tokens ending with 'I' and starting with '9' must be searched and merged before the result can be returned.

'I' and 'AM' is probably also in the default stoplist and will be totally ignored, so for this hypothetical application, a null stoplist may be used if these tokens are important.



回答2:

Using PARAMETERS('STOPLIST ctxsys.empty_stoplist') when indexing would include all alphabetical tokens in the index. Accented characters are indexed as well. Non-alphabetical characters are generally treated as whitespace by BASIC_LEXER.

Also, CONTEXT grammar uses a lot of operators that include symbols and reserved words such as WITHIN, NEAR, ABOUT. These all have to be escaped somehow in the input. If you need to search for substrings, the correct approach to escaping is to escape all characters with \. This is an answer to a related question here: Oracle text escaping with curly braces and wildcards. If your requirements is to search for whole terms (names, etc.) you can use simpler {input} escaping.



回答3:

Forget about sanitizing. Why? Refer to http://en.wikipedia.org/wiki/SQL_injection .

It depends on the kind of database interface API you are using. Perl DBI, ODBC, JDBC support parameterized queries or prepared statements. If you're using a native DBI and it doesn't support it, then God bless you.