The number of differences in a column

2020-05-09 19:33发布

问题:

I would like to retrieve a column of how many differences in letters in each row. For instance

If you have a a value "test" and another row has a value "testing ", then the differences is 4 letter between "test" and "testing ". The data of the column would be value 4

I have reflected about it and I don't know where to begin

id    ||  value     || category   || differences 
--------------------------------------------------
 1    ||  test      || 1          || 4
 2    ||  testing  || 1          || null   
11    ||  candy     || 2          || -3       
12    ||  ca        || 2          || null      

In this scenario and context it is no difference between "Test" and "rest".

回答1:

I think what you are looking for is a measure of edit difference, rather than just counting prefix similarity, for which there are a few common algorithms. Levenshtein's method is one that I've used before and I've seen it implemented as TSQL functions. The answers to this SO question suggest a couple of implementations in TSQL that you might just be able to take and use as-is.

(though take time to test the code and understand the method rather than just copying the code and using it, so that you can understand the output if something seems to go wrong - otherwise you could be creating some technical debt you'll have to pay back later)

Exactly which distance calculation method you want will depend on how you want to count certain things, for instance do you count a substitution as one change or a delete and an insert, and if your strings are long enough for it to matter do you want to consider substring moves, and so forth.



回答2:

I think you just want len() and lead():

select t.id, t.value, t.category,
       (len(lead(value) over (partition by t.category order by t.id) -
        len(value)
       ) as difference
from t;


回答3:

You read a next record with LEAD. Then compare the strings with LIKE or other string functions:

select
  id, value, category,
  case when value like next_value + '%' or next_value like value + '%' 
       then len(next_value) - len(value)
  end as differences
from
(
  select id, value, category, lead(value) over (order by id) as next_value 
  from mytable
) this_and_next;

If you only want to compare values within the same category use a partition clause:

lead(value) over (partition by category order by id)

UPDATE: Please see DhruvJoshi's answer on SQL Server's LEN. This function doesn't count trailing blanks, as I assumed, so you need his trick in case you want to have them counted. Here is the doc on LEN confirming this behaviour: https://technet.microsoft.com/en-us/library/ms190329(v=sql.105).aspx



回答4:

create table #temp
(
id int,
value varchar(30),
category int
)

insert into #temp
select 1,'test',1
union all
select 2,'testing',1
union all
select 1,'Candy',2
union all
select 2,'Ca',2

;with cte
as
(
select id,value,category,lead(value) over (partition by category order by id) as nxtvalue
from #temp
)
select id,value,category,len(replace(nxtvalue,value,'')) as differences
from cte


回答5:

you can also use self joining query like below:

--create table tbl (id int,  value nvarchar(100), category int);
--insert into tbl values
--(1,N'test',1)
--,(2,N' testing',1)
--,(11,N'candy',2)      
--,(12,N'ca',2);
select A.*, LEN(B.value)-LEN(A.value) as difference
from tbl A LEFT JOIN tbl B on A.id +1 =B.id and A.category=B.category
--drop table tbl

Update: I noticed that you have oddly positioned the space at the end. SQL server most times does not count the trailing spaces when calculating length. So here's the hack on above query

select A.*, LEN(B.value+'>')-LEN(A.value+'>') as difference
from tbl A LEFT JOIN tbl B on A.id +1 =B.id and A.category=B.category

As pointed out in comments, that Id's may not be consecutive, in such cases try this :

create table #temp ( rownum int PRIMARY KEY IDENTITY(1,1), id int, value nvarchar(100), category int)
insert into #temp (id, value, category)
select id, value, category from tbl order by id asc


    select A.id, A.value, A.category, LEN(B.value+'>')-LEN(A.value+'>') as difference
    from #temp A LEFT JOIN #temp B on A.rownum +1 =B.rownum and A.category=B.category