这个问题涉及到在给出的答案这一职位 。
我想从一个了Weka树分析的输出转换成决定分割和叶值的分级表(按上面链接后)。 我可以解析的Weka输出提取fac
, split
和val
值,但我挣扎解析输出,并产生正确的hierachyid
值。
我注意到的第一件事是这棵树的描述不映射一个对一个与记录decisions
。 有20条线在Weka的输出和21个记录在decisions
表。 这是因为有11叶节点和10个分割-在每个记录中decisions
可以是一个叶节点或分裂。
所述的Weka输出线对应于在任一零个,一个或两个记录decisions
。 例如规则集#8对应于没有记录; 规则集#1对应于一个记录; 规则集#4对应于两个记录。
我有以下示例输出
# Ruleset
1 fac_a < 64
2 | fac_d < 71.5
3 | | fac_a < 49.5
4 | | | fac_d < 23.5 : 19.44 (13/43.71) [13/77.47]
5 | | | fac_d >= 23.5 : 24.25 (32/23.65) [16/49.15]
6 | | fac_a >= 49.5 : 30.8 (10/17.68) [5/22.44]
7 | fac_d >= 71.5 : 33.6 (25/53.05) [15/47.35]
8 fac_a >= 64
9 | fac_d < 83.5
10 | | fac_a < 91
11 | | | fac_e < 93.5
12 | | | | fac_d < 45 : 31.9 (16/23.25) [3/64.14]
13 | | | | fac_d >= 45
14 | | | | | fac_e < 21.5 : 44.1 (5/16.58) [2/21.39]
15 | | | | | fac_e >= 21.5
16 | | | | | | fac_a < 77.5 : 33.45 (4/2.89) [1/0.03]
17 | | | | | | fac_a >= 77.5 : 39.46 (7/10.21) [1/11.69]
18 | | | fac_e >= 93.5 : 45.97 (2/8.03) [1/107.71]
19 | | fac_a >= 91 : 42.26 (9/9.57) [4/69.03]
20 | fac_d >= 83.5 : 47.1 (9/30.24) [6/40.15]
我可以确定是否弱输出线生成split
在记录decisions
通过解析为子串<
。 我能确定的线路生成val
在记录decisions
通过解析为:
。 不过,我挣扎生成相应的hierachyid
两种类型的记录的decisions
表。
所需的代码来自动生成这个例子将是:
insert decisions values
(cast('/0/' as hierarchyid), 'a', 64,null),
(cast('/0/0/' as hierarchyid), 'd', 71.5,null),
(cast('/0/0/0/' as hierarchyid), 'a', 49.5,null),
(cast('/0/0/0/0/' as hierarchyid), 'd', 23.5,null),
(cast('/0/0/0/0/0/' as hierarchyid), NULL, NULL,19.44),
(cast('/0/0/0/0/1/' as hierarchyid), NULL, NULL, 24.25),
(cast('/0/0/0/1/' as hierarchyid), NULL, NULL, 30.8),
(cast('/0/0/1/' as hierarchyid), NULL, NULL, 33.6),
(cast('/0/1/' as hierarchyid), 'd', 83.5,null),
(cast('/0/1/0/' as hierarchyid), 'a', 91,null),
(cast('/0/1/1/' as hierarchyid), NULL, NULL, 47.1),
(cast('/0/1/0/0/' as hierarchyid), 'e', 93.5,null),
(cast('/0/1/0/0/0/' as hierarchyid), 'd', 45,null),
(cast('/0/1/0/0/0/0/' as hierarchyid), null,null,31.9),
(cast('/0/1/0/0/0/1/' as hierarchyid), 'e', 21.5,null),
(cast('/0/1/0/0/0/1/0/' as hierarchyid), null,null,44.1),
(cast('/0/1/0/0/0/1/1/' as hierarchyid), 'a', 77.5,null),
(cast('/0/1/0/0/0/1/1/0/' as hierarchyid), NULL,NULL,33.45),
(cast('/0/1/0/0/0/1/1/1/' as hierarchyid), NULL,NULL,39.46),
(cast('/0/1/0/0/1/' as hierarchyid), NULL,NULL,45.97),
(cast('/0/1/0/1/' as hierarchyid), NULL,NULL, 42.26);
go
我可以申请哪种算法来生成字符串,例如/0/1/0/0/0/1/1/0/
我需要附加到每个split
或val
在记录decisions
表?
Answer 1:
正如你提到的,每个Weka的输出线对应于0,1,或2 INSERT语句。 我重申一些你的情况下,说什么可以帮助您或他人阅读。
摘要
输出线与<和不带。 是纯分支节点(IFS),并对应于1个INSERT
具有空的列[VAL]。
输出线与<和:均为分支和分配节点,因此它们对应于2 INSERT
秒。 一个具有空[VAL],和一个与所述HIERARCHYID通过扩展0/
并用非空[VAL]。
输出线与> =和没有。 在你的树ELSE节点。 该> =比较信息是在源冗余和这些行不需要INSERT语句。
在这个例子中,没有INSERT
需要用于> =分支(源极线8,13,15)语句,因为> =条件是在决策树点不一定正确。 你输出的那些线像ELSE语句,你到哪儿去冗余说什么要对在该点的系数值为true。 (该决定可以正确地进行,即使没有“> = ##。#”从这些行树信息。)
算法大纲
通过您的Weka的输出才能走。
- 如果你在从上一个缩进行,
INSERT
一次(追加“0 \”到HIERARCHYID)的决定(放[VAL] NULL) - 如果Weka的线路也得到了
:
在里面, INSERT
表中的另一行(追加第二个0\
)的分配 - 如果你在该行没有从以前的一个缩进,跳过它,如果它没有
:
在它 - 如果有
:
和是一个赋值,发现它是决策树(最近的排它上面有相同的缩进水平)“兄弟姐妹”。 同级的HIERARCHYID将“0 \”末端,因为它是一个<比较。 改变0\
至1\
和INSERT
与一个非空[VAL]。
希望帮助,可以从你有什么实际完成的。
这里是另一套引用您的Weka输出行INSERT语句。
create table decisions (
did hierarchyid primary key,
fac char,
split decimal(10,4),
val decimal(10,4),
sourceline int
)
insert decisions values
(cast('/0/' as hierarchyid), 'a', 64,null,1),
(cast('/0/0/' as hierarchyid), 'd', 71.5,null,2),
(cast('/0/0/0/' as hierarchyid), 'a', 49.5,null,3),
(cast('/0/0/0/0/' as hierarchyid), 'd', 23.5,null,4),
(cast('/0/0/0/0/0/' as hierarchyid), NULL, NULL,19.44,4),
(cast('/0/0/0/0/1/' as hierarchyid), NULL, NULL, 24.25,5),
(cast('/0/0/0/1/' as hierarchyid), NULL, NULL, 30.8,6),
(cast('/0/0/1/' as hierarchyid), NULL, NULL, 33.6,7),
(cast('/0/1/' as hierarchyid), 'd', 83.5,null,9),
(cast('/0/1/0/' as hierarchyid), 'a', 91,null,10),
(cast('/0/1/1/' as hierarchyid), NULL, NULL, 47.1,20),
(cast('/0/1/0/0/' as hierarchyid), 'e', 93.5,null,11),
(cast('/0/1/0/0/0/' as hierarchyid), 'd', 45,null,12),
(cast('/0/1/0/0/0/0/' as hierarchyid), null,null,31.9,12),
(cast('/0/1/0/0/0/1/' as hierarchyid), 'e', 21.5,null,14),
(cast('/0/1/0/0/0/1/0/' as hierarchyid), null,null,44.1,14),
(cast('/0/1/0/0/0/1/1/' as hierarchyid), 'a', 77.5,null,16),
(cast('/0/1/0/0/0/1/1/0/' as hierarchyid), NULL,NULL,33.45,16),
(cast('/0/1/0/0/0/1/1/1/' as hierarchyid), NULL,NULL,39.46,17),
(cast('/0/1/0/0/1/' as hierarchyid), NULL,NULL,45.97,18),
(cast('/0/1/0/1/' as hierarchyid), NULL,NULL, 42.26,19);
Answer 2:
这里的SQL代码,可能工作把你的Weka输出入行了[判决]表。
显然,SQL是不使用自然语言,但它是我不得不打开,方便附近的SQL对于这个问题的其余部分。 最终,他们核心思想是实现堆栈跟踪的层次。 这是可怕的缺憾,所以我会检查并使用在你使用任何语言为您的数据,改写(munging)剧本的想法之前,测试好。 总的想法是不是可怕,因为这看起来。 最坏的代码都是字符串操作; 如果您使用正则表达式支持语言,可以梳妆打扮很大。
我也junked的HIERARCHYID型,以下伊茨克的改进(在其他线程说明)。
希望这可以帮助。
你会注意到,我做在了Weka输出没有用的压痕。 相反,我正在做有关的规则和顺序的性质比较强的假设。 (每一个新的嵌套比较使用<操作者,例如,且a> =具有相同值后出现。我还做出关于对空间确切号码和姓名等fac_x,其中的一些使用正则表达式的会消除的假设。)
create table ruleset (
id int primary key,
therule varchar(200)
);
insert into ruleset values
(1,'fac_a < 64'),
(2,'| fac_d < 71.5'),
(3,'| | fac_a < 49.5'),
(4,'| | | fac_d < 23.5 : 19.44 (13/43.71) [13/77.47]'),
(5,'| | | fac_d >= 23.5 : 24.25 (32/23.65) [16/49.15]'),
(6,'| | fac_a >= 49.5 : 30.8 (10/17.68) [5/22.44]'),
(7,'| fac_d >= 71.5 : 33.6 (25/53.05) [15/47.35]'),
(8,'fac_a >= 64'),
(9,'| fac_d < 83.5'),
(10,'| | fac_a < 91'),
(11,'| | | fac_e < 93.5'),
(12,'| | | | fac_d < 45 : 31.9 (16/23.25) [3/64.14]'),
(13,'| | | | fac_d >= 45'),
(14,'| | | | | fac_e < 21.5 : 44.1 (5/16.58) [2/21.39]'),
(15,'| | | | | fac_e >= 21.5'),
(16,'| | | | | | fac_a < 77.5 : 33.45 (4/2.89) [1/0.03]'),
(17,'| | | | | | fac_a >= 77.5 : 39.46 (7/10.21) [1/11.69]'),
(18,'| | | fac_e >= 93.5 : 45.97 (2/8.03) [1/107.71]'),
(19,'| | fac_a >= 91 : 42.26 (9/9.57) [4/69.03]'),
(20,'| fac_d >= 83.5 : 47.1 (9/30.24) [6/40.15]')
go
declare @ruleid int = 0;
declare @rulevar char;
declare @rulecomp decimal(10,4);
declare @ruleassign varchar(200);
declare @last int = (select max(id) from ruleset);
declare @rule varchar(200);
declare @resultindentlevel int = 0;
declare @stack table (
id int identity(1,1) primary key,
hier varchar(200),
resultindentlevel int
);
insert into @stack values ('',0);
declare @results table (
hier varchar(200),
line varchar(200)
);
while @ruleid < @last begin
set @ruleid += 1;
set @rule = (select therule+space(1) from ruleset where id=@ruleid);
declare @c char = case when @rule like '%[<]%' then '0' else '1' end;
if @rule not like '%[<:]%' continue;
declare @varpos int = charindex('f',@rule)+4;
set @rulevar = substring(@rule,@varpos,1);
set @rulecomp =
substring(@rule,@varpos+4,charindex(space(1),@rule,@varpos+5)-@varpos-4);
declare @peek varchar(200) =
(select top (1) hier from @stack order by id desc)
--select * from @stack;
if @rule not like '%>%' begin -- handle new condition
set @peek += @c;
if exists (select hier from @results where hier=@peek)
set @peek=left(@peek,len(@peek)-1)+'1';
insert into @results
select @peek,@peek+'|'+@rulevar+'|'+ltrim(str(@rulecomp,15,4))+'||';
insert into @stack values (@peek,0);
end
declare @colon int = charindex(':',@rule);
if @colon > 0 begin -- handle assignment value
set @ruleassign = substring(@rule,@colon+2,200);
insert into @results select @peek+@c,@peek+@c + '|'+@rulevar+'||'+@ruleassign;
end
if @rule like '%>%' delete from @stack where id = (select max(id) from @stack)
end;
update @results set line = ''''+replace(rtrim(line),'|',''',''')+'''';
update @results set line = replace(line,'''''','NULL');
select line from @results;
go
Answer 3:
我对你的版本将允许任意数量的因素和树的深度(与需要演示的只是轻微的修改)。 我不知道性能将是什么样的,但如果添加适当的索引是一种潜在良好。
首先,我们加载的原始数据:
CREATE TABLE dbo.WekaTree (
ID int,
Ruleset varchar(70)
);
INSERT dbo.WekaTree (ID, Ruleset)
VALUES
(1, 'fac_a < 64'),
(2, '| fac_d < 71.5'),
(3, '| | fac_a < 49.5'),
(4, '| | | fac_d < 23.5 : 19.44 (13/43.71) [13/77.47]'),
(5, '| | | fac_d >= 23.5 : 24.25 (32/23.65) [16/49.15]'),
(6, '| | fac_a >= 49.5 : 30.8 (10/17.68) [5/22.44]'),
(7, '| fac_d >= 71.5 : 33.6 (25/53.05) [15/47.35]'),
(8, 'fac_a >= 64'),
(9, '| fac_d < 83.5'),
(10, '| | fac_a < 91'),
(11, '| | | fac_e < 93.5'),
(12, '| | | | fac_d < 45 : 31.9 (16/23.25) [3/64.14]'),
(13, '| | | | fac_d >= 45'),
(14, '| | | | | fac_e < 21.5 : 44.1 (5/16.58) [2/21.39]'),
(15, '| | | | | fac_e >= 21.5'),
(16, '| | | | | | fac_a < 77.5 : 33.45 (4/2.89) [1/0.03]'),
(17, '| | | | | | fac_a >= 77.5 : 39.46 (7/10.21) [1/11.69]'),
(18, '| | | fac_e >= 93.5 : 45.97 (2/8.03) [1/107.71]'),
(19, '| | fac_a >= 91 : 42.26 (9/9.57) [4/69.03]'),
(20, '| fac_d >= 83.5 : 47.1 (9/30.24) [6/40.15]')
;
然后,我们分析到这个RuleSets
编码中所需的数据探测查询形式的树表:
WITH A AS (SELECT A = 1 UNION ALL SELECT 1),
B AS (SELECT A = 1 FROM A, A B),
C AS (SELECT A = 1 FROM B, B C),
N AS (SELECT Num = Row_Number() OVER (ORDER BY (SELECT 1)) FROM C, C D),
Data AS (
SELECT
ID,
Ruleset,
Depth = Len(Ruleset) - Len(Replace(Ruleset, '|', '')) + 1,
Data = Replace(Ruleset, '| ', '')
FROM
dbo.WekaTree
), Depths AS (
SELECT
D.ID,
D.Ruleset,
D.Depth,
F.Factor,
O.Operator,
V.Value,
V.Remainder
FROM
Data D
CROSS APPLY (
SELECT
Factor = Left(D.Data, CharIndex(' ', D.Data) - 1),
OperatorString = Substring(D.Data, CharIndex(' ', D.Data) + 1, 8000)
) F
CROSS APPLY (
SELECT
Operator = Left(F.OperatorString, CharIndex(' ', F.OperatorString) - 1),
ValueString = Substring(F.OperatorString, CharIndex(' ', F.OperatorString) + 1, 8000)
) O
CROSS APPLY (
SELECT
Value = Convert(decimal(10,2), Left(O.ValueString, CharIndex(' ', O.ValueString + ' ') - 1)),
Remainder = Substring(O.ValueString, CharIndex(' ', O.ValueString + ' ') + 3, 8000)
) V
)
SELECT
D.ID,
D.Remainder,
H.Factor,
H.Operator,
H.Value
INTO
dbo.Rulesets
FROM
Depths D
OUTER APPLY (
SELECT
X.Factor,
X.Operator,
Value = Min(X.Value * M.Multiplier) * M.Multiplier
FROM
N
CROSS APPLY (
SELECT TOP 1
*
FROM
Depths D2
WHERE
N.Num = D2.Depth
AND D.ID >= D2.ID
ORDER BY
D2.ID DESC
) X
CROSS APPLY (
SELECT 1 WHERE X.Operator = '<'
UNION ALL SELECT -1 WHERE X.Operator = '>='
) M (Multiplier)
WHERE
N.Num <= D.Depth
GROUP BY
X.Factor,
X.Operator,
M.Multiplier
) H
WHERE
D.Remainder <> ''
ORDER BY
D.ID,
H.Factor,
H.Operator
;
以下是得到的数据是什么样子(只需要叶子节点ID和现在):
ID Remainder Factor Operator Value
---- --------------------------- ------ -------- ---------------------------------------
4 19.44 (13/43.71) [13/77.47] fac_a < 49.5
4 19.44 (13/43.71) [13/77.47] fac_d < 23.5
5 24.25 (32/23.65) [16/49.15] fac_a < 49.5
5 24.25 (32/23.65) [16/49.15] fac_d < 71.5
5 24.25 (32/23.65) [16/49.15] fac_d >= 23.5
6 30.8 (10/17.68) [5/22.44] fac_a < 64.0
6 30.8 (10/17.68) [5/22.44] fac_a >= 49.5
6 30.8 (10/17.68) [5/22.44] fac_d < 71.5
7 33.6 (25/53.05) [15/47.35] fac_a < 64.0
7 33.6 (25/53.05) [15/47.35] fac_d >= 71.5
12 31.9 (16/23.25) [3/64.14] fac_a < 91.0
12 31.9 (16/23.25) [3/64.14] fac_a >= 64.0
12 31.9 (16/23.25) [3/64.14] fac_d < 45.0
12 31.9 (16/23.25) [3/64.14] fac_e < 93.5
14 44.1 (5/16.58) [2/21.39] fac_a < 91.0
14 44.1 (5/16.58) [2/21.39] fac_a >= 64.0
14 44.1 (5/16.58) [2/21.39] fac_d < 83.5
14 44.1 (5/16.58) [2/21.39] fac_d >= 45.0
14 44.1 (5/16.58) [2/21.39] fac_e < 21.5
16 33.45 (4/2.89) [1/0.03] fac_a < 77.5
16 33.45 (4/2.89) [1/0.03] fac_a >= 64.0
16 33.45 (4/2.89) [1/0.03] fac_d < 83.5
16 33.45 (4/2.89) [1/0.03] fac_d >= 45.0
16 33.45 (4/2.89) [1/0.03] fac_e < 93.5
16 33.45 (4/2.89) [1/0.03] fac_e >= 21.5
17 39.46 (7/10.21) [1/11.69] fac_a < 91.0
17 39.46 (7/10.21) [1/11.69] fac_a >= 77.5
17 39.46 (7/10.21) [1/11.69] fac_d < 83.5
17 39.46 (7/10.21) [1/11.69] fac_d >= 45.0
17 39.46 (7/10.21) [1/11.69] fac_e < 93.5
17 39.46 (7/10.21) [1/11.69] fac_e >= 21.5
18 45.97 (2/8.03) [1/107.71] fac_a < 91.0
18 45.97 (2/8.03) [1/107.71] fac_a >= 64.0
18 45.97 (2/8.03) [1/107.71] fac_d < 83.5
18 45.97 (2/8.03) [1/107.71] fac_e >= 93.5
19 42.26 (9/9.57) [4/69.03] fac_a >= 91.0
19 42.26 (9/9.57) [4/69.03] fac_d < 83.5
20 47.1 (9/30.24) [6/40.15] fac_a >= 64.0
20 47.1 (9/30.24) [6/40.15] fac_d >= 83.5
我也创造了一些假的取样探头的数据。 注意,这里的因素是行 ,而不是在列 。 如果你有fac_a
通过fac_z
然后fac_aa
通过fac_zz
,你还在经营。
WITH A AS (SELECT A = 1 UNION ALL SELECT 1),
B AS (SELECT A = 1 FROM A, A B),
C AS (SELECT A = 1 FROM B, B C),
N AS (SELECT Num = Row_Number() OVER (ORDER BY (SELECT 1)) - 1 FROM B, C, C D)
SELECT
N.Num,
F.Factor,
V.Value
INTO
dbo.LookupData
FROM
N
CROSS JOIN (VALUES
(1, 'fac_a'), (4, 'fac_b'), (16, 'fac_c'), (64, 'fac_d'), (256, 'fac_e')
) F (Mult, Factor)
INNER JOIN (VALUES
(0, 25), (1, 50), (2, 75), (3, 100)
) V (Pattern, Value)
ON (N.Num / F.Mult) % 4 = V.Pattern
WHERE
N.Num <= 1023
;
例如探测数据:
Num Factor Value
------ ------ -----------
0 fac_a 25
0 fac_b 25
0 fac_c 25
0 fac_d 25
0 fac_e 25
1 fac_a 50
1 fac_b 25
1 fac_c 25
1 fac_d 25
1 fac_e 25
2 fac_a 75
2 fac_b 25
2 fac_c 25
2 fac_d 25
2 fac_e 25
...
1021 fac_a 50
1021 fac_b 100
1021 fac_c 100
1021 fac_d 100
1021 fac_e 100
1022 fac_a 75
1022 fac_b 100
1022 fac_c 100
1022 fac_d 100
1022 fac_e 100
1023 fac_a 100
1023 fac_b 100
1023 fac_c 100
1023 fac_d 100
1023 fac_e 100
最后,这里是一个说明从Weka的树相匹配的探测器行的条件,最里面的ID行查询。 请保持在我还没有创建合适的索引这里的头脑,你应该这样做。 使用的值25,50,75,和100的每个的因素,这产生可能的每个组合:
WITH Matches AS (
SELECT
L.Num,
R.ID
FROM
dbo.LookupData L
INNER JOIN dbo.Rulesets R
ON L.Factor = R.Factor
GROUP BY
L.Num,
R.ID
HAVING
Min(CASE WHEN (
R.Operator = '<'
AND L.Value < R.Value
) OR (
R.Operator = '>='
AND L.Value >= R.Value
) THEN 1 ELSE 0 END) = 1
)
SELECT
L.*,
W.*
FROM
dbo.LookupData L
INNER JOIN Matches M
ON L.Num = M.Num
LEFT JOIN dbo.WekaTree W
ON M.ID = W.ID
ORDER BY
L.Num
;
实施例的结果:
Num Factor Value ID Ruleset
--- ------ ----- -- -------------------------------------------------------
0 fac_a 25 5 | | | fac_d >= 23.5 : 24.25 (32/23.65) [16/49.15]
0 fac_b 25 5 | | | fac_d >= 23.5 : 24.25 (32/23.65) [16/49.15]
0 fac_c 25 5 | | | fac_d >= 23.5 : 24.25 (32/23.65) [16/49.15]
0 fac_d 25 5 | | | fac_d >= 23.5 : 24.25 (32/23.65) [16/49.15]
0 fac_e 25 5 | | | fac_d >= 23.5 : 24.25 (32/23.65) [16/49.15]
1 fac_a 50 6 | | fac_a >= 49.5 : 30.8 (10/17.68) [5/22.44]
1 fac_b 25 6 | | fac_a >= 49.5 : 30.8 (10/17.68) [5/22.44]
1 fac_c 25 6 | | fac_a >= 49.5 : 30.8 (10/17.68) [5/22.44]
1 fac_d 25 6 | | fac_a >= 49.5 : 30.8 (10/17.68) [5/22.44]
1 fac_e 25 6 | | fac_a >= 49.5 : 30.8 (10/17.68) [5/22.44]
2 fac_a 75 12 | | | | fac_d < 45 : 31.9 (16/23.25) [3/64.14]
2 fac_b 25 12 | | | | fac_d < 45 : 31.9 (16/23.25) [3/64.14]
2 fac_c 25 12 | | | | fac_d < 45 : 31.9 (16/23.25) [3/64.14]
2 fac_d 25 12 | | | | fac_d < 45 : 31.9 (16/23.25) [3/64.14]
2 fac_e 25 12 | | | | fac_d < 45 : 31.9 (16/23.25) [3/64.14]
请随时问你喜欢的任何问题 - 我倒是很乐意帮助你对自己的数据的测试得到这个工作。 我不能保证即时响应,但我一般不检查活动的SO每天都将至少能够在大多数情况下,一两天内作出答复。
请参阅SQL小提琴现场演示
文章来源: Convert Weka tree into hierachyid for SQL hierachical table