I would like to retrieve a column of how many differences in letters in each row. For instance
If you have a a value "test" and another row has a value "testing ", then the differences is 4 letter between "test" and "testing ". The data of the column would be value 4
I have reflected about it and I don't know where to begin
id || value || category || differences
--------------------------------------------------
1 || test || 1 || 4
2 || testing || 1 || null
11 || candy || 2 || -3
12 || ca || 2 || null
In this scenario and context it is no difference between "Test" and "rest".
I think what you are looking for is a measure of edit difference, rather than just counting prefix similarity, for which there are a few common algorithms. Levenshtein's method is one that I've used before and I've seen it implemented as TSQL functions. The answers to this SO question suggest a couple of implementations in TSQL that you might just be able to take and use as-is.
(though take time to test the code and understand the method rather than just copying the code and using it, so that you can understand the output if something seems to go wrong - otherwise you could be creating some technical debt you'll have to pay back later)
Exactly which distance calculation method you want will depend on how you want to count certain things, for instance do you count a substitution as one change or a delete and an insert, and if your strings are long enough for it to matter do you want to consider substring moves, and so forth.
I think you just want len()
and lead()
:
select t.id, t.value, t.category,
(len(lead(value) over (partition by t.category order by t.id) -
len(value)
) as difference
from t;
You read a next record with LEAD
. Then compare the strings with LIKE or other string functions:
select
id, value, category,
case when value like next_value + '%' or next_value like value + '%'
then len(next_value) - len(value)
end as differences
from
(
select id, value, category, lead(value) over (order by id) as next_value
from mytable
) this_and_next;
If you only want to compare values within the same category use a partition clause:
lead(value) over (partition by category order by id)
UPDATE: Please see DhruvJoshi's answer on SQL Server's LEN
. This function doesn't count trailing blanks, as I assumed, so you need his trick in case you want to have them counted. Here is the doc on LEN
confirming this behaviour: https://technet.microsoft.com/en-us/library/ms190329(v=sql.105).aspx
create table #temp
(
id int,
value varchar(30),
category int
)
insert into #temp
select 1,'test',1
union all
select 2,'testing',1
union all
select 1,'Candy',2
union all
select 2,'Ca',2
;with cte
as
(
select id,value,category,lead(value) over (partition by category order by id) as nxtvalue
from #temp
)
select id,value,category,len(replace(nxtvalue,value,'')) as differences
from cte
you can also use self joining query like below:
--create table tbl (id int, value nvarchar(100), category int);
--insert into tbl values
--(1,N'test',1)
--,(2,N' testing',1)
--,(11,N'candy',2)
--,(12,N'ca',2);
select A.*, LEN(B.value)-LEN(A.value) as difference
from tbl A LEFT JOIN tbl B on A.id +1 =B.id and A.category=B.category
--drop table tbl
Update: I noticed that you have oddly positioned the space at the end. SQL server most times does not count the trailing spaces when calculating length. So here's the hack on above query
select A.*, LEN(B.value+'>')-LEN(A.value+'>') as difference
from tbl A LEFT JOIN tbl B on A.id +1 =B.id and A.category=B.category
As pointed out in comments, that Id's may not be consecutive, in such cases
try this :
create table #temp ( rownum int PRIMARY KEY IDENTITY(1,1), id int, value nvarchar(100), category int)
insert into #temp (id, value, category)
select id, value, category from tbl order by id asc
select A.id, A.value, A.category, LEN(B.value+'>')-LEN(A.value+'>') as difference
from #temp A LEFT JOIN #temp B on A.rownum +1 =B.rownum and A.category=B.category