I have an incoming records filter stored with the logical clause as given below.
Acct1 = 'Y' AND Acct2 = 'N' AND Acct3 = 'N' AND Acct4 = 'N' AND Acct5 = 'N' AND ((Acct6 = 'N' OR Acct7 = 'N' AND Acct1 = 'Y') AND Formatted= 'N' AND Acct9 = 'N' AND (Acct10 = 'N' AND Acct11 = 'N') AND EditableField= 'N' )
My data input to this clause will be from Csv file as below.
Country,Type,Usage,Acct1,Acct2,Acct3,Acct4,Acct5,Acct6,Acct7,Formatted,Acct9,Acct10,Acct11,EditableField
USA,Premium,Corporate,Y,N,Y,N,N,N,Y,N,Y,N,Y,N,
Mexico,Premium,Corporate,Y,N,Y,N,Y,N,Y,N,Y,N,Y,N,
USA,Premium,Corporate,Y,N,Y,N,N,N,N,Y,Y,N,Y,N,
USA,Premium,Corporate,Y,N,Y,N,Y,N,Y,Y,Y,N,Y,N,
I will have to filter out the records in the file based on the conditions defined in the clause. This is a example of one simple clause but there will be more inner conditions than this and the clause can be changed whenever the user want and there will be 10 such clauses the records has to pass through sequentially.
So I am looking for a way to dynamically interpret the clause and apply it on the incoming records. Please provide me your suggestions about how to design/ any example if available.
Here's the complete solution which does not include third-party libraries like ANTLR or JavaCC. Note that while it's extensible, its capabilities are still limited. If you want to create much more complex expressions, you'd better use grammar generator.
First, let's write a tokenizer which splits the input string to the tokens. Here's the token types:
The token class itself:
To simplify the tokenization let's create a regexp which reads the next token from the input string:
Note that it has many groups, one group per
TokenType
in the same order (first comesWHITESPACE
, thenAND
and so on). Finally the tokenizer method:I'm using
java.text.ParseException
. Here we apply the regexMatcher
till the end of the input. If it doesn't match at the current position, we throw an exception. Otherwise we look for found matching group and create a token from it ignoring theWHITESPACE
tokens. Finally we add aEOF
token which indicates the end of the input. The result is returned as specialTokenStream
object. Here's theTokenStream
class which will help us to do the parsing:So we have a tokenizer, hoorah. You can test it right now using
System.out.println(tokenize("Acct1 = 'Y' AND (Acct2 = 'N' OR Acct3 = 'N')"));
Now let's write the parser which will create the tree-like representation of our expression. First the interface
Expr
for all the tree nodes:Its only method used to evaluate the expression for given data set and return true if data set matches.
The most basic expression is the
EqualsExpr
which is likeAcct1 = 'Y'
or'Y' = Acct1
:The
toString()
method is just for information, you can remove it.Next we will define the
SubExpr
class which is eitherEqualsExpr
or something more complex in parentheses (if we see the parenthesis):Next is
AndExpr
which is a set ofSubExpr
expressions joined byAND
operator:I use Java-8 Stream API in the
toString
for brevity. If you cannot use Java-8, you may rewrite it with the for loop or removetoString
completely.Finally we define
OrExpr
which is a set ofAndExpr
joined byOR
(usuallyOR
has lower priority thanAND
). It's very similar toAndExpr
:And the final
parse
method:So you can parse your expressions to get the
Expr
objects, then evaluate them against the rows of your CSV file. I assume that you're capable to parse the CSV row into theMap<String, String>
. Here's usage example:I don't know how efficient this will be in Java, but basic string-replace operations could be a simple solution for this.
You start with a query string:
For each line in the csv, e.g.
Y,N,Y,N,Y,N,Y,N,Y,N,Y,N
string-replace the column headers in the query by the values; that gives you:Then replace the comparisons by their boolean value:
- replace
N = 'N'
andY = 'Y'
byY
- replace
N = 'Y'
andY = 'N'
byN
This will result in:
Then loop through a number of string-replace operations which replace truthy values by
Y
and falsey values byN
:- replace
Y AND Y
byY
- replace
N AND N
,N AND Y
andY AND N
, byN
- replace
Y OR Y
,N OR Y
andY OR N
, byY
- replace
N OR N
byN
- replace
(N)
byN
- replace
(Y)
byY
This will gradually reduce the boolean statement:
If the queries include implicit precedences without brackets, like
N AND N OR Y AND Y
where you wantAND
to have precedence overOR
, always exhaust the possibilities to replaceAND
and brackets before replacingOR
:During this reduction, make sure to check whether the string length has decreased after every iteration, to avoid infinite loops caused by malformed queries.
Hint:
A possible solution is to store your Boolean condition values in a single string attribute, like "YNYNNNYNYNYN", or , better, packed as a binary integer. Then, for a given clause, generate a table of all accepted strings. A join operation will return all desired records.
You can even process several clauses in a single go by adjoining the clause number to the accepted strings when generating the table.
Even though the table size can be exponential in the number of conditions, this can remain quite manageable for a moderate number of conditions.
What you have is an expression written in some language that seems compliant with the grammar of the WHERE clause of SQL. So you need:
This is a simple language so you can build your parser by hand, or otherwise look at ANTLR or JavaCC - and in this case I suggest you take a look at some sample (ANTLR or JavaCC) - of course, you don't need a full SQL parser! Just extract the bits you need.
An easier approach is to write the filter expression in some language that can be invoked via the Java scripting interface, like Javascript or Groovy (or Ruby, Python...). I don't suggest running a find/replace on the input text to transform the SQL-like language to the target language (for example Python has
and
andor
operators - lowercase) as that would break easily depending on the content of the input string.