Finding difference between strings in Javascript

2019-04-21 07:39发布

问题:

I'd like to compare two strings (a before and after) and detect exactly where and what changed between them.

For any change, I want to know:

  1. Starting position of the change (inclusive, starting at 0)
  2. Ending position of the change (inclusive, starting at 0) relative to the previous text
  3. The "change"

Assume that strings will change in only one place at a time (for example, never "Bill" -> "Kiln").

Additionally, I need the start and end positions to reflect the type of change:

  • If deletion, the start and end position should be the start and end positions of the deleted text, respectively
  • If replacement, the start and end position should be the start and end positions of the "deleted" text, respectively (the change will be the "added" text)
  • If insertion, the start and end positions should be the same; the entry point of the text
  • If no change, let start and end positions remain zero, with an empty change

For example:

"0123456789" -> "03456789"  
Start: 1, End: 2, Change: "" (deletion)

"03456789" -> "0123456789"  
Start: 1, End: 1, Change: "12" (insertion)

"Hello World!" -> "Hello Aliens!"  
Start: 6, End: 10, Change: "Aliens" (replacement)

"Hi" -> "Hi"  
Start: 0, End: 0, Change: "" (no change)

I was able to somewhat detect the positions of the changed text, but it doesn't work in all cases because in order to do that accurately, I need to know what kind of change is made.

var OldText = "My edited string!";
var NewText = "My first string!";

var ChangeStart = 0;
var NewChangeEnd = 0;
var OldChangeEnd = 0;
console.log("Comparing start:");
for (var i = 0; i < NewText.length; i++) {
    console.log(i + ": " + NewText[i] + " -> " + OldText[i]);
    if (NewText[i] != OldText[i]) {
        ChangeStart = i;
        break;
    }
}
console.log("Comparing end:");
// "Addition"?
if (NewText.length > OldText.length) {
    for (var i = 1; i < NewText.length; i++) {
        console.log(i + "(N: " + (NewText.length - i) + " O: " + (OldText.length - i) + ": " + NewText.substring(NewText.length - i, NewText.length - i + 1) + " -> " + OldText.substring(OldText.length - i, OldText.length - i + 1));
        if (NewText.substring(NewText.length - i, NewText.length - i + 1) != OldText.substring(OldText.length - i, OldText.length - i + 1)) {
            NewChangeEnd = NewText.length - i;
            OldChangeEnd = OldText.length - i;
            break;
        }
    }
// "Deletion"?
} else if (NewText.length < OldText.length) {
    for (var i = 1; i < OldText.length; i++) {
        console.log(i + "(N: " + (NewText.length - i) + " O: " + (OldText.length - i) + ": " + NewText.substring(NewText.length - i, NewText.length - i + 1) + " -> " + OldText.substring(OldText.length - i, OldText.length - i + 1));
        if (NewText.substring(NewText.length - i, NewText.length - i + 1) != OldText.substring(OldText.length - i, OldText.length - i + 1)) {
            NewChangeEnd = NewText.length - i;
            OldChangeEnd = OldText.length - i;
            break;
        }
    }
// Same length...
} else {
    // Do something
}
console.log("Change start: " + ChangeStart);
console.log("NChange end : " + NewChangeEnd);
console.log("OChange end : " + OldChangeEnd);
console.log("Change: " + OldText.substring(ChangeStart, OldChangeEnd + 1));

How do I tell whether or not an insertion, deletion, or replacement took place?


I've searched and came up with a few other similar questions, but they don't seem to help.

回答1:

I have gone through your code and your logic for matching string makes sense to me. It logs ChangeStart, NewChangeEnd and OldChangeEnd correctly and the algorithm flows alright. You just want to know if an insertion, deletion or replacement took place. Here's how I would go about it.

First of all, you need to make sure that after you have got the first point of mis-match i.e. ChangeStart when you then traverse the strings from the end, the index shouldn't cross ChangeStart.

I'll give you an example. Consider the following strings:

 var NewText = "Hello Worllolds!";
 var OldText = "Hello Worlds!";

 ChangeStart -> 10 //Makes sense
 OldChangeEnd -> 8
 NewChangeEnd -> 11

 console.log("Change: " + NewText.substring(ChangeStart, NewChangeEnd + 1)); 
 //Ouputs "lo"

The problem in this case is when it starts matching from the back, the flow is something like this:

 Comparing end: 
  1(N: 12 O: 12: ! -> !) 
  2(N: 11 O: 11: s -> s) 
  3(N: 10 O: 10: d -> d)  -> You need to stop here!

 //Although there is not a mismatch, but we have reached ChangeStart and 
 //we have already established that characters from 0 -> ChangeStart-1 match
 //That is why it outputs "lo" instead of "lol"

Assuming, what I just said makes sense, you just need to modify your for loops like so:

 if (NewText.length > OldText.length) {
 for (var i = 1; i < NewText.length && ((OldText.length-i)>=ChangeStart); i++) {
  ...

    NewChangeEnd = NewText.length - i -1;
    OldChangeEnd = OldText.length - i -1;
  if(//Mismatch condition reached){
         //break..That code is fine.
    }
 }

This condition -> (OldText.length-i)>=ChangeStart takes care of the anomaly that I mentioned and therefore the for loop automatically terminates if this condition is reached. However, just as I mentioned there might be situations where this condition is reached before a mis-match is encountered like I just demonstrated. So you need to update values of NewChangeEnd and OldChangeEnd as 1 less than the matched value. In case of a mis-match, you store the values appropriately.

Instead of an else -if we could just wrap those two conditions in a situation where we know NewText.length > OldText.length is definitely not true i.e. it is either a replacement or a deletion. Again NewText.length > OldText.length also means it could be a replacement or an insertion as per your examples, which makes sense. So the else could be something like:

else {
for (var i = 1; i < OldText.length && ((OldText.length-i)>=ChangeStart); i++) { 

    ...
    NewChangeEnd = NewText.length - i -1;
    OldChangeEnd = OldText.length - i -1;
  if(//Mismatch condition reached){
         //break..That code is fine.
    }
 }

If you have understood the minor changes thus far, identifying the specific cases is really simple:

  1. Deletion - Condition -> ChangeStart > NewChangeEnd. Deleted string from ChangeStart -> OldChangeEnd.

Deleted text -> OldText.substring(ChangeStart, OldChangeEnd + 1);

  1. Insertion - Condition -> ChangeStart > OldChangeEnd. Inserted string at ChangeStart.

Inserted text -> NewText.substring(ChangeStart, NewChangeEnd + 1);

  1. Replacement - If NewText != OldText and the above two conditions are not met, then it is a replacement.

Text in old string that got replaced -> OldText.substring(ChangeStart, OldChangeEnd + 1);

The replacement text -> NewText.substring(ChangeStart, NewChangeEnd + 1);

Start and end positions in the OldText that got replaced -> ChangeStart -> OldChangeEnd

I have created a jsfiddle incorporating the changes that I have mentioned in your code. You might want to check it out. Hope it gets you started in the right direction.



回答2:

I had a similar problem and solved it with the following:

function diff(oldText, newText) {

  // Find the index at which the change began
  var s = 0;
  while(s < oldText.length && s < newText.length && oldText[s] == newText[s]) {
    s++;
  }

  // Find the index at which the change ended (relative to the end of the string)
  var e = 0;
  while(e < oldText.length &&
        e < newText.length &&
        oldText.length - e > s &&
        newText.length - e > s &&
        oldText[oldText.length - 1 - e] == newText[newText.length - 1 - e]) {
    e++;
  }

  // The change end of the new string (ne) and old string (oe)
  var ne = newText.length - e;
  var oe = oldText.length - e;

  // The number of chars removed and added
  var removed = oe - s;
  var added = ne - s;

  var type;
  switch(true) {
    case removed == 0 && added > 0:  // It's an 'add' if none were removed and at least 1 added
      type = 'add';
      break;
    case removed > 0 && added == 0:  // It's a 'remove' if none were added and at least one removed
      type = 'remove';
      break;
    case removed > 0 && added > 0:   // It's a replace if there were both added and removed characters
      type = 'replace';
      break;
    default:
      type = 'none';                 // Otherwise there was no change
      s = 0;
  }

  return { type: type, start: s, removed: removed, added: added };
}

Note, this didn't solve my actual problem though. My issue was that I had an editor with paragraphs, each modelled with text and a collection of markups defined with a start and end index e.g. bold from char 1 to char 5. I was using this to detect changes to the string so I could shift the markup indices accordingly. But consider the string:

xxxxxxxx

The diff function approach can't tell the difference between a character added outside the bold or within it.

In the end, I took a completely different approach - I just parsed the HTML produced by the editor and used that to determine start and end indices of markups.



回答3:

Made my own slightly more performant version inspired by the same tactics as above (looking for differences from front to back and back to front)

function compareText(oldText, newText)
{
    var difStart,difEndOld,difEndNew;

    //from left to right - look up the first index where characters are different
    for(let i=0;i<oldText.length;i++)
    {
        if(oldText.charAt(i) !== newText.charAt(i))
        {
            difStart = i;
            break;
        }
    }

    //from right to left - look up the first index where characters are different
    //first calc the last indices for both strings
    var oldMax = oldText.length - 1;
    var newMax = newText.length - 1;
    for(let i=0;i<oldText.length;i++)
    {
        if(oldText.charAt(oldMax-i) !== newText.charAt(newMax-i))
        {
            //with different string lengths, the index will differ for the old and the new text
            difEndOld = oldMax-i;
            difEndNew = newMax-i;
            break;
        }
    }

    var removed = oldText.substr(difStart,difEndOld-difStart+1);
    var added = newText.substr(difStart,difEndNew-difStart+1);

    return [difStart,added,removed];
}