可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试):
问题:
This question relates to the answer given in this post.
I want to convert the output from a tree analysis in Weka into a hierarchical table of decision splits and leaf-values (as per the post linked above). I can parse the Weka output to extract the fac
, split
and val
values but I'm struggling to parse the output and generate the correct hierachyid
values.
First thing I note is that the tree description don't map one-to-one with the records in decisions
. There are 20 lines in the Weka output and 21 records in the decisions
table. This is because there are 11 leaf-nodes and 10 splits — each record in decisions
is either a leaf-node or a split.
The Weka output lines correspond to either zero, one or two records in decisions
. For example Ruleset #8 corresponds to no records; ruleset #1 corresponds to one record; ruleset #4 corresponds to two records.
I have the following example output
# Ruleset
1 fac_a < 64
2 | fac_d < 71.5
3 | | fac_a < 49.5
4 | | | fac_d < 23.5 : 19.44 (13/43.71) [13/77.47]
5 | | | fac_d >= 23.5 : 24.25 (32/23.65) [16/49.15]
6 | | fac_a >= 49.5 : 30.8 (10/17.68) [5/22.44]
7 | fac_d >= 71.5 : 33.6 (25/53.05) [15/47.35]
8 fac_a >= 64
9 | fac_d < 83.5
10 | | fac_a < 91
11 | | | fac_e < 93.5
12 | | | | fac_d < 45 : 31.9 (16/23.25) [3/64.14]
13 | | | | fac_d >= 45
14 | | | | | fac_e < 21.5 : 44.1 (5/16.58) [2/21.39]
15 | | | | | fac_e >= 21.5
16 | | | | | | fac_a < 77.5 : 33.45 (4/2.89) [1/0.03]
17 | | | | | | fac_a >= 77.5 : 39.46 (7/10.21) [1/11.69]
18 | | | fac_e >= 93.5 : 45.97 (2/8.03) [1/107.71]
19 | | fac_a >= 91 : 42.26 (9/9.57) [4/69.03]
20 | fac_d >= 83.5 : 47.1 (9/30.24) [6/40.15]
I can determine if a Weak output line generates a split
record in decisions
by parsing for the substring <
. I can determine if a line generates a val
record in decisions
by parsing for the :
. However, I'm struggling to generate the appropriate hierachyid
for both types of record in the decisions
table.
The desired code to autogenerate for this example would be:
insert decisions values
(cast('/0/' as hierarchyid), 'a', 64,null),
(cast('/0/0/' as hierarchyid), 'd', 71.5,null),
(cast('/0/0/0/' as hierarchyid), 'a', 49.5,null),
(cast('/0/0/0/0/' as hierarchyid), 'd', 23.5,null),
(cast('/0/0/0/0/0/' as hierarchyid), NULL, NULL,19.44),
(cast('/0/0/0/0/1/' as hierarchyid), NULL, NULL, 24.25),
(cast('/0/0/0/1/' as hierarchyid), NULL, NULL, 30.8),
(cast('/0/0/1/' as hierarchyid), NULL, NULL, 33.6),
(cast('/0/1/' as hierarchyid), 'd', 83.5,null),
(cast('/0/1/0/' as hierarchyid), 'a', 91,null),
(cast('/0/1/1/' as hierarchyid), NULL, NULL, 47.1),
(cast('/0/1/0/0/' as hierarchyid), 'e', 93.5,null),
(cast('/0/1/0/0/0/' as hierarchyid), 'd', 45,null),
(cast('/0/1/0/0/0/0/' as hierarchyid), null,null,31.9),
(cast('/0/1/0/0/0/1/' as hierarchyid), 'e', 21.5,null),
(cast('/0/1/0/0/0/1/0/' as hierarchyid), null,null,44.1),
(cast('/0/1/0/0/0/1/1/' as hierarchyid), 'a', 77.5,null),
(cast('/0/1/0/0/0/1/1/0/' as hierarchyid), NULL,NULL,33.45),
(cast('/0/1/0/0/0/1/1/1/' as hierarchyid), NULL,NULL,39.46),
(cast('/0/1/0/0/1/' as hierarchyid), NULL,NULL,45.97),
(cast('/0/1/0/1/' as hierarchyid), NULL,NULL, 42.26);
go
What algorithm can I apply to generate the strings such as /0/1/0/0/0/1/1/0/
that I need to attach to each split
or val
record in the decisions
table?
回答1:
As you noted, each of your Weka output lines corresponds to 0, 1, or 2 INSERT statements. I'm restating some of what you said in case it helps you or someone else reading.
Summary
Output lines with < and without . are pure branch nodes (IFs) and correspond to 1 INSERT
with null for the column [val].
Output lines with < and : are both branch and assignment nodes, so they correspond to 2 INSERT
s. One with null [val], and one with the hierarchyid extended by 0/
and with non-null [val].
Output lines with >= and without . are ELSE nodes in your tree. The >= comparison information is redundant in your source and those lines require no INSERT statement.
In this example, no INSERT
statement is needed for the >= branching (source lines 8, 13, 15), because the >= condition is necessarily true at that point in the decision tree. Those lines of your output are like ELSE statements, where you've redundantly stated what must be true about the factor value at that point. (The decisions could be made correctly even without the ">= ##.#" information from the tree in those lines.)
Algorithm outline
Go through your Weka output in order.
- If the line you're at is indented from the previous one,
INSERT
once (append '0\' to the hierarchyid) for the decision (put NULL in [val]),
- If the Weka line has also got
:
in it, INSERT
another row in the table (appending a second 0\
) for the assignment
- If the line you're at is not indented from the previous one, skip it if it has no
:
in it
- If it has
:
and is an assignment, find it's "sibling" in the decision tree (the most recent row above it at the same indentation level). The sibling's hierarchyid will end in '0\', because it's a < comparison. Change the 0\
to 1\
and INSERT
with a non-null [val].
Hope that helps and can be done practically from what you have.
Here's another set of INSERT statements that reference the line of your Weka output.
create table decisions (
did hierarchyid primary key,
fac char,
split decimal(10,4),
val decimal(10,4),
sourceline int
)
insert decisions values
(cast('/0/' as hierarchyid), 'a', 64,null,1),
(cast('/0/0/' as hierarchyid), 'd', 71.5,null,2),
(cast('/0/0/0/' as hierarchyid), 'a', 49.5,null,3),
(cast('/0/0/0/0/' as hierarchyid), 'd', 23.5,null,4),
(cast('/0/0/0/0/0/' as hierarchyid), NULL, NULL,19.44,4),
(cast('/0/0/0/0/1/' as hierarchyid), NULL, NULL, 24.25,5),
(cast('/0/0/0/1/' as hierarchyid), NULL, NULL, 30.8,6),
(cast('/0/0/1/' as hierarchyid), NULL, NULL, 33.6,7),
(cast('/0/1/' as hierarchyid), 'd', 83.5,null,9),
(cast('/0/1/0/' as hierarchyid), 'a', 91,null,10),
(cast('/0/1/1/' as hierarchyid), NULL, NULL, 47.1,20),
(cast('/0/1/0/0/' as hierarchyid), 'e', 93.5,null,11),
(cast('/0/1/0/0/0/' as hierarchyid), 'd', 45,null,12),
(cast('/0/1/0/0/0/0/' as hierarchyid), null,null,31.9,12),
(cast('/0/1/0/0/0/1/' as hierarchyid), 'e', 21.5,null,14),
(cast('/0/1/0/0/0/1/0/' as hierarchyid), null,null,44.1,14),
(cast('/0/1/0/0/0/1/1/' as hierarchyid), 'a', 77.5,null,16),
(cast('/0/1/0/0/0/1/1/0/' as hierarchyid), NULL,NULL,33.45,16),
(cast('/0/1/0/0/0/1/1/1/' as hierarchyid), NULL,NULL,39.46,17),
(cast('/0/1/0/0/1/' as hierarchyid), NULL,NULL,45.97,18),
(cast('/0/1/0/1/' as hierarchyid), NULL,NULL, 42.26,19);
回答2:
Here's SQL code that may work to turn your Weka output into the rows for the [decisions] table.
Obviously, SQL isn't the natural language to use, but it's what I had open and handy near the rest of the SQL for this question. Ultimately, they key idea is to implement a stack to keep track of the hierarchy. This is terribly kludgy, so I'd examine and test it well before using the idea in whatever language you use for your data-munging script. The overall idea isn't as awful as this looks. The worst of the code is string manipulation; that can be slicked up a great deal if you use a language with regular expression support.
I also junked the hierarchyid type, following Itzik's improvements (noted in the other thread).
Hope this helps.
You'll note that I make no use of the indentation in the Weka output. Instead, I'm making relatively strong assumptions about the nature of the rules and their order. (Every new nested comparison uses the < operator, for example, and a >= with the same value appears later. I also make assumptions about exact numbers of spaces and names like fac_x, some of which the use of regular expressions will obviate.)
create table ruleset (
id int primary key,
therule varchar(200)
);
insert into ruleset values
(1,'fac_a < 64'),
(2,'| fac_d < 71.5'),
(3,'| | fac_a < 49.5'),
(4,'| | | fac_d < 23.5 : 19.44 (13/43.71) [13/77.47]'),
(5,'| | | fac_d >= 23.5 : 24.25 (32/23.65) [16/49.15]'),
(6,'| | fac_a >= 49.5 : 30.8 (10/17.68) [5/22.44]'),
(7,'| fac_d >= 71.5 : 33.6 (25/53.05) [15/47.35]'),
(8,'fac_a >= 64'),
(9,'| fac_d < 83.5'),
(10,'| | fac_a < 91'),
(11,'| | | fac_e < 93.5'),
(12,'| | | | fac_d < 45 : 31.9 (16/23.25) [3/64.14]'),
(13,'| | | | fac_d >= 45'),
(14,'| | | | | fac_e < 21.5 : 44.1 (5/16.58) [2/21.39]'),
(15,'| | | | | fac_e >= 21.5'),
(16,'| | | | | | fac_a < 77.5 : 33.45 (4/2.89) [1/0.03]'),
(17,'| | | | | | fac_a >= 77.5 : 39.46 (7/10.21) [1/11.69]'),
(18,'| | | fac_e >= 93.5 : 45.97 (2/8.03) [1/107.71]'),
(19,'| | fac_a >= 91 : 42.26 (9/9.57) [4/69.03]'),
(20,'| fac_d >= 83.5 : 47.1 (9/30.24) [6/40.15]')
go
declare @ruleid int = 0;
declare @rulevar char;
declare @rulecomp decimal(10,4);
declare @ruleassign varchar(200);
declare @last int = (select max(id) from ruleset);
declare @rule varchar(200);
declare @resultindentlevel int = 0;
declare @stack table (
id int identity(1,1) primary key,
hier varchar(200),
resultindentlevel int
);
insert into @stack values ('',0);
declare @results table (
hier varchar(200),
line varchar(200)
);
while @ruleid < @last begin
set @ruleid += 1;
set @rule = (select therule+space(1) from ruleset where id=@ruleid);
declare @c char = case when @rule like '%[<]%' then '0' else '1' end;
if @rule not like '%[<:]%' continue;
declare @varpos int = charindex('f',@rule)+4;
set @rulevar = substring(@rule,@varpos,1);
set @rulecomp =
substring(@rule,@varpos+4,charindex(space(1),@rule,@varpos+5)-@varpos-4);
declare @peek varchar(200) =
(select top (1) hier from @stack order by id desc)
--select * from @stack;
if @rule not like '%>%' begin -- handle new condition
set @peek += @c;
if exists (select hier from @results where hier=@peek)
set @peek=left(@peek,len(@peek)-1)+'1';
insert into @results
select @peek,@peek+'|'+@rulevar+'|'+ltrim(str(@rulecomp,15,4))+'||';
insert into @stack values (@peek,0);
end
declare @colon int = charindex(':',@rule);
if @colon > 0 begin -- handle assignment value
set @ruleassign = substring(@rule,@colon+2,200);
insert into @results select @peek+@c,@peek+@c + '|'+@rulevar+'||'+@ruleassign;
end
if @rule like '%>%' delete from @stack where id = (select max(id) from @stack)
end;
update @results set line = ''''+replace(rtrim(line),'|',''',''')+'''';
update @results set line = replace(line,'''''','NULL');
select line from @results;
go
回答3:
I have a version for you that will allow any number of factors and tree depth (with only slight modifications necessary to demo more). I don't know what the performance will be like, but it is potentially good if appropriate indexes are added.
First we load the raw data:
CREATE TABLE dbo.WekaTree (
ID int,
Ruleset varchar(70)
);
INSERT dbo.WekaTree (ID, Ruleset)
VALUES
(1, 'fac_a < 64'),
(2, '| fac_d < 71.5'),
(3, '| | fac_a < 49.5'),
(4, '| | | fac_d < 23.5 : 19.44 (13/43.71) [13/77.47]'),
(5, '| | | fac_d >= 23.5 : 24.25 (32/23.65) [16/49.15]'),
(6, '| | fac_a >= 49.5 : 30.8 (10/17.68) [5/22.44]'),
(7, '| fac_d >= 71.5 : 33.6 (25/53.05) [15/47.35]'),
(8, 'fac_a >= 64'),
(9, '| fac_d < 83.5'),
(10, '| | fac_a < 91'),
(11, '| | | fac_e < 93.5'),
(12, '| | | | fac_d < 45 : 31.9 (16/23.25) [3/64.14]'),
(13, '| | | | fac_d >= 45'),
(14, '| | | | | fac_e < 21.5 : 44.1 (5/16.58) [2/21.39]'),
(15, '| | | | | fac_e >= 21.5'),
(16, '| | | | | | fac_a < 77.5 : 33.45 (4/2.89) [1/0.03]'),
(17, '| | | | | | fac_a >= 77.5 : 39.46 (7/10.21) [1/11.69]'),
(18, '| | | fac_e >= 93.5 : 45.97 (2/8.03) [1/107.71]'),
(19, '| | fac_a >= 91 : 42.26 (9/9.57) [4/69.03]'),
(20, '| fac_d >= 83.5 : 47.1 (9/30.24) [6/40.15]')
;
Then we parse this into a RuleSets
table that encodes the tree in the form needed for a data-probing query:
WITH A AS (SELECT A = 1 UNION ALL SELECT 1),
B AS (SELECT A = 1 FROM A, A B),
C AS (SELECT A = 1 FROM B, B C),
N AS (SELECT Num = Row_Number() OVER (ORDER BY (SELECT 1)) FROM C, C D),
Data AS (
SELECT
ID,
Ruleset,
Depth = Len(Ruleset) - Len(Replace(Ruleset, '|', '')) + 1,
Data = Replace(Ruleset, '| ', '')
FROM
dbo.WekaTree
), Depths AS (
SELECT
D.ID,
D.Ruleset,
D.Depth,
F.Factor,
O.Operator,
V.Value,
V.Remainder
FROM
Data D
CROSS APPLY (
SELECT
Factor = Left(D.Data, CharIndex(' ', D.Data) - 1),
OperatorString = Substring(D.Data, CharIndex(' ', D.Data) + 1, 8000)
) F
CROSS APPLY (
SELECT
Operator = Left(F.OperatorString, CharIndex(' ', F.OperatorString) - 1),
ValueString = Substring(F.OperatorString, CharIndex(' ', F.OperatorString) + 1, 8000)
) O
CROSS APPLY (
SELECT
Value = Convert(decimal(10,2), Left(O.ValueString, CharIndex(' ', O.ValueString + ' ') - 1)),
Remainder = Substring(O.ValueString, CharIndex(' ', O.ValueString + ' ') + 3, 8000)
) V
)
SELECT
D.ID,
D.Remainder,
H.Factor,
H.Operator,
H.Value
INTO
dbo.Rulesets
FROM
Depths D
OUTER APPLY (
SELECT
X.Factor,
X.Operator,
Value = Min(X.Value * M.Multiplier) * M.Multiplier
FROM
N
CROSS APPLY (
SELECT TOP 1
*
FROM
Depths D2
WHERE
N.Num = D2.Depth
AND D.ID >= D2.ID
ORDER BY
D2.ID DESC
) X
CROSS APPLY (
SELECT 1 WHERE X.Operator = '<'
UNION ALL SELECT -1 WHERE X.Operator = '>='
) M (Multiplier)
WHERE
N.Num <= D.Depth
GROUP BY
X.Factor,
X.Operator,
M.Multiplier
) H
WHERE
D.Remainder <> ''
ORDER BY
D.ID,
H.Factor,
H.Operator
;
Here's what the resulting data looks like (only leaf node IDs are needed and present):
ID Remainder Factor Operator Value
---- --------------------------- ------ -------- ---------------------------------------
4 19.44 (13/43.71) [13/77.47] fac_a < 49.5
4 19.44 (13/43.71) [13/77.47] fac_d < 23.5
5 24.25 (32/23.65) [16/49.15] fac_a < 49.5
5 24.25 (32/23.65) [16/49.15] fac_d < 71.5
5 24.25 (32/23.65) [16/49.15] fac_d >= 23.5
6 30.8 (10/17.68) [5/22.44] fac_a < 64.0
6 30.8 (10/17.68) [5/22.44] fac_a >= 49.5
6 30.8 (10/17.68) [5/22.44] fac_d < 71.5
7 33.6 (25/53.05) [15/47.35] fac_a < 64.0
7 33.6 (25/53.05) [15/47.35] fac_d >= 71.5
12 31.9 (16/23.25) [3/64.14] fac_a < 91.0
12 31.9 (16/23.25) [3/64.14] fac_a >= 64.0
12 31.9 (16/23.25) [3/64.14] fac_d < 45.0
12 31.9 (16/23.25) [3/64.14] fac_e < 93.5
14 44.1 (5/16.58) [2/21.39] fac_a < 91.0
14 44.1 (5/16.58) [2/21.39] fac_a >= 64.0
14 44.1 (5/16.58) [2/21.39] fac_d < 83.5
14 44.1 (5/16.58) [2/21.39] fac_d >= 45.0
14 44.1 (5/16.58) [2/21.39] fac_e < 21.5
16 33.45 (4/2.89) [1/0.03] fac_a < 77.5
16 33.45 (4/2.89) [1/0.03] fac_a >= 64.0
16 33.45 (4/2.89) [1/0.03] fac_d < 83.5
16 33.45 (4/2.89) [1/0.03] fac_d >= 45.0
16 33.45 (4/2.89) [1/0.03] fac_e < 93.5
16 33.45 (4/2.89) [1/0.03] fac_e >= 21.5
17 39.46 (7/10.21) [1/11.69] fac_a < 91.0
17 39.46 (7/10.21) [1/11.69] fac_a >= 77.5
17 39.46 (7/10.21) [1/11.69] fac_d < 83.5
17 39.46 (7/10.21) [1/11.69] fac_d >= 45.0
17 39.46 (7/10.21) [1/11.69] fac_e < 93.5
17 39.46 (7/10.21) [1/11.69] fac_e >= 21.5
18 45.97 (2/8.03) [1/107.71] fac_a < 91.0
18 45.97 (2/8.03) [1/107.71] fac_a >= 64.0
18 45.97 (2/8.03) [1/107.71] fac_d < 83.5
18 45.97 (2/8.03) [1/107.71] fac_e >= 93.5
19 42.26 (9/9.57) [4/69.03] fac_a >= 91.0
19 42.26 (9/9.57) [4/69.03] fac_d < 83.5
20 47.1 (9/30.24) [6/40.15] fac_a >= 64.0
20 47.1 (9/30.24) [6/40.15] fac_d >= 83.5
And I've created some fake sample probe data. Note that here, the factors are in rows, not in columns. If you have fac_a
through fac_z
and then fac_aa
through fac_zz
, you're still in business.
WITH A AS (SELECT A = 1 UNION ALL SELECT 1),
B AS (SELECT A = 1 FROM A, A B),
C AS (SELECT A = 1 FROM B, B C),
N AS (SELECT Num = Row_Number() OVER (ORDER BY (SELECT 1)) - 1 FROM B, C, C D)
SELECT
N.Num,
F.Factor,
V.Value
INTO
dbo.LookupData
FROM
N
CROSS JOIN (VALUES
(1, 'fac_a'), (4, 'fac_b'), (16, 'fac_c'), (64, 'fac_d'), (256, 'fac_e')
) F (Mult, Factor)
INNER JOIN (VALUES
(0, 25), (1, 50), (2, 75), (3, 100)
) V (Pattern, Value)
ON (N.Num / F.Mult) % 4 = V.Pattern
WHERE
N.Num <= 1023
;
Example probe data:
Num Factor Value
------ ------ -----------
0 fac_a 25
0 fac_b 25
0 fac_c 25
0 fac_d 25
0 fac_e 25
1 fac_a 50
1 fac_b 25
1 fac_c 25
1 fac_d 25
1 fac_e 25
2 fac_a 75
2 fac_b 25
2 fac_c 25
2 fac_d 25
2 fac_e 25
...
1021 fac_a 50
1021 fac_b 100
1021 fac_c 100
1021 fac_d 100
1021 fac_e 100
1022 fac_a 75
1022 fac_b 100
1022 fac_c 100
1022 fac_d 100
1022 fac_e 100
1023 fac_a 100
1023 fac_b 100
1023 fac_c 100
1023 fac_d 100
1023 fac_e 100
Finally, here's the query that shows the innermost ID row from the Weka Tree that matches the conditions of the probe row. Please keep in mind that I have not created suitable indexes here, and you should do so. Using the values 25, 50, 75, and 100 for each of the factors, this creates every possibly combination:
WITH Matches AS (
SELECT
L.Num,
R.ID
FROM
dbo.LookupData L
INNER JOIN dbo.Rulesets R
ON L.Factor = R.Factor
GROUP BY
L.Num,
R.ID
HAVING
Min(CASE WHEN (
R.Operator = '<'
AND L.Value < R.Value
) OR (
R.Operator = '>='
AND L.Value >= R.Value
) THEN 1 ELSE 0 END) = 1
)
SELECT
L.*,
W.*
FROM
dbo.LookupData L
INNER JOIN Matches M
ON L.Num = M.Num
LEFT JOIN dbo.WekaTree W
ON M.ID = W.ID
ORDER BY
L.Num
;
Example results:
Num Factor Value ID Ruleset
--- ------ ----- -- -------------------------------------------------------
0 fac_a 25 5 | | | fac_d >= 23.5 : 24.25 (32/23.65) [16/49.15]
0 fac_b 25 5 | | | fac_d >= 23.5 : 24.25 (32/23.65) [16/49.15]
0 fac_c 25 5 | | | fac_d >= 23.5 : 24.25 (32/23.65) [16/49.15]
0 fac_d 25 5 | | | fac_d >= 23.5 : 24.25 (32/23.65) [16/49.15]
0 fac_e 25 5 | | | fac_d >= 23.5 : 24.25 (32/23.65) [16/49.15]
1 fac_a 50 6 | | fac_a >= 49.5 : 30.8 (10/17.68) [5/22.44]
1 fac_b 25 6 | | fac_a >= 49.5 : 30.8 (10/17.68) [5/22.44]
1 fac_c 25 6 | | fac_a >= 49.5 : 30.8 (10/17.68) [5/22.44]
1 fac_d 25 6 | | fac_a >= 49.5 : 30.8 (10/17.68) [5/22.44]
1 fac_e 25 6 | | fac_a >= 49.5 : 30.8 (10/17.68) [5/22.44]
2 fac_a 75 12 | | | | fac_d < 45 : 31.9 (16/23.25) [3/64.14]
2 fac_b 25 12 | | | | fac_d < 45 : 31.9 (16/23.25) [3/64.14]
2 fac_c 25 12 | | | | fac_d < 45 : 31.9 (16/23.25) [3/64.14]
2 fac_d 25 12 | | | | fac_d < 45 : 31.9 (16/23.25) [3/64.14]
2 fac_e 25 12 | | | | fac_d < 45 : 31.9 (16/23.25) [3/64.14]
Please feel free to ask any questions you like--I'd be happy to help you get this working in a test against your own data. I can't promise instant response but I generally do check for activity on SO daily so would at least be able to respond within a day or two in most cases.
See a live demo at SQL Fiddle