This question relates to the answer given in this post.
I want to convert the output from a tree analysis in Weka into a hierarchical table of decision splits and leaf-values (as per the post linked above). I can parse the Weka output to extract the fac
, split
and val
values but I'm struggling to parse the output and generate the correct hierachyid
values.
First thing I note is that the tree description don't map one-to-one with the records in decisions
. There are 20 lines in the Weka output and 21 records in the decisions
table. This is because there are 11 leaf-nodes and 10 splits — each record in decisions
is either a leaf-node or a split.
The Weka output lines correspond to either zero, one or two records in decisions
. For example Ruleset #8 corresponds to no records; ruleset #1 corresponds to one record; ruleset #4 corresponds to two records.
I have the following example output
# Ruleset
1 fac_a < 64
2 | fac_d < 71.5
3 | | fac_a < 49.5
4 | | | fac_d < 23.5 : 19.44 (13/43.71) [13/77.47]
5 | | | fac_d >= 23.5 : 24.25 (32/23.65) [16/49.15]
6 | | fac_a >= 49.5 : 30.8 (10/17.68) [5/22.44]
7 | fac_d >= 71.5 : 33.6 (25/53.05) [15/47.35]
8 fac_a >= 64
9 | fac_d < 83.5
10 | | fac_a < 91
11 | | | fac_e < 93.5
12 | | | | fac_d < 45 : 31.9 (16/23.25) [3/64.14]
13 | | | | fac_d >= 45
14 | | | | | fac_e < 21.5 : 44.1 (5/16.58) [2/21.39]
15 | | | | | fac_e >= 21.5
16 | | | | | | fac_a < 77.5 : 33.45 (4/2.89) [1/0.03]
17 | | | | | | fac_a >= 77.5 : 39.46 (7/10.21) [1/11.69]
18 | | | fac_e >= 93.5 : 45.97 (2/8.03) [1/107.71]
19 | | fac_a >= 91 : 42.26 (9/9.57) [4/69.03]
20 | fac_d >= 83.5 : 47.1 (9/30.24) [6/40.15]
I can determine if a Weak output line generates a split
record in decisions
by parsing for the substring <
. I can determine if a line generates a val
record in decisions
by parsing for the :
. However, I'm struggling to generate the appropriate hierachyid
for both types of record in the decisions
table.
The desired code to autogenerate for this example would be:
insert decisions values
(cast('/0/' as hierarchyid), 'a', 64,null),
(cast('/0/0/' as hierarchyid), 'd', 71.5,null),
(cast('/0/0/0/' as hierarchyid), 'a', 49.5,null),
(cast('/0/0/0/0/' as hierarchyid), 'd', 23.5,null),
(cast('/0/0/0/0/0/' as hierarchyid), NULL, NULL,19.44),
(cast('/0/0/0/0/1/' as hierarchyid), NULL, NULL, 24.25),
(cast('/0/0/0/1/' as hierarchyid), NULL, NULL, 30.8),
(cast('/0/0/1/' as hierarchyid), NULL, NULL, 33.6),
(cast('/0/1/' as hierarchyid), 'd', 83.5,null),
(cast('/0/1/0/' as hierarchyid), 'a', 91,null),
(cast('/0/1/1/' as hierarchyid), NULL, NULL, 47.1),
(cast('/0/1/0/0/' as hierarchyid), 'e', 93.5,null),
(cast('/0/1/0/0/0/' as hierarchyid), 'd', 45,null),
(cast('/0/1/0/0/0/0/' as hierarchyid), null,null,31.9),
(cast('/0/1/0/0/0/1/' as hierarchyid), 'e', 21.5,null),
(cast('/0/1/0/0/0/1/0/' as hierarchyid), null,null,44.1),
(cast('/0/1/0/0/0/1/1/' as hierarchyid), 'a', 77.5,null),
(cast('/0/1/0/0/0/1/1/0/' as hierarchyid), NULL,NULL,33.45),
(cast('/0/1/0/0/0/1/1/1/' as hierarchyid), NULL,NULL,39.46),
(cast('/0/1/0/0/1/' as hierarchyid), NULL,NULL,45.97),
(cast('/0/1/0/1/' as hierarchyid), NULL,NULL, 42.26);
go
What algorithm can I apply to generate the strings such as /0/1/0/0/0/1/1/0/
that I need to attach to each split
or val
record in the decisions
table?
Here's SQL code that may work to turn your Weka output into the rows for the [decisions] table.
Obviously, SQL isn't the natural language to use, but it's what I had open and handy near the rest of the SQL for this question. Ultimately, they key idea is to implement a stack to keep track of the hierarchy. This is terribly kludgy, so I'd examine and test it well before using the idea in whatever language you use for your data-munging script. The overall idea isn't as awful as this looks. The worst of the code is string manipulation; that can be slicked up a great deal if you use a language with regular expression support.
I also junked the hierarchyid type, following Itzik's improvements (noted in the other thread).
Hope this helps.
You'll note that I make no use of the indentation in the Weka output. Instead, I'm making relatively strong assumptions about the nature of the rules and their order. (Every new nested comparison uses the < operator, for example, and a >= with the same value appears later. I also make assumptions about exact numbers of spaces and names like fac_x, some of which the use of regular expressions will obviate.)
As you noted, each of your Weka output lines corresponds to 0, 1, or 2 INSERT statements. I'm restating some of what you said in case it helps you or someone else reading.
Summary
Output lines with < and without . are pure branch nodes (IFs) and correspond to 1
INSERT
with null for the column [val].Output lines with < and : are both branch and assignment nodes, so they correspond to 2
INSERT
s. One with null [val], and one with the hierarchyid extended by0/
and with non-null [val].Output lines with >= and without . are ELSE nodes in your tree. The >= comparison information is redundant in your source and those lines require no INSERT statement.
In this example, no
INSERT
statement is needed for the >= branching (source lines 8, 13, 15), because the >= condition is necessarily true at that point in the decision tree. Those lines of your output are like ELSE statements, where you've redundantly stated what must be true about the factor value at that point. (The decisions could be made correctly even without the ">= ##.#" information from the tree in those lines.)Algorithm outline
Go through your Weka output in order.
INSERT
once (append '0\' to the hierarchyid) for the decision (put NULL in [val]),:
in it,INSERT
another row in the table (appending a second0\
) for the assignment:
in it:
and is an assignment, find it's "sibling" in the decision tree (the most recent row above it at the same indentation level). The sibling's hierarchyid will end in '0\', because it's a < comparison. Change the0\
to1\
andINSERT
with a non-null [val].Hope that helps and can be done practically from what you have.
Here's another set of INSERT statements that reference the line of your Weka output.
I have a version for you that will allow any number of factors and tree depth (with only slight modifications necessary to demo more). I don't know what the performance will be like, but it is potentially good if appropriate indexes are added.
First we load the raw data:
Then we parse this into a
RuleSets
table that encodes the tree in the form needed for a data-probing query:Here's what the resulting data looks like (only leaf node IDs are needed and present):
And I've created some fake sample probe data. Note that here, the factors are in rows, not in columns. If you have
fac_a
throughfac_z
and thenfac_aa
throughfac_zz
, you're still in business.Example probe data:
Finally, here's the query that shows the innermost ID row from the Weka Tree that matches the conditions of the probe row. Please keep in mind that I have not created suitable indexes here, and you should do so. Using the values 25, 50, 75, and 100 for each of the factors, this creates every possibly combination:
Example results:
Please feel free to ask any questions you like--I'd be happy to help you get this working in a test against your own data. I can't promise instant response but I generally do check for activity on SO daily so would at least be able to respond within a day or two in most cases.
See a live demo at SQL Fiddle