byte[] array pattern search

2019-01-01 03:44发布

Anyone know a good and effective way to search/match for a byte pattern in an byte[] array and then return the positions.

For example

byte[] pattern = new byte[] {12,3,5,76,8,0,6,125};

byte[] toBeSearched = new byte[] {23,36,43,76,125,56,34,234,12,3,5,76,8,0,6,125,234,56,211,122,22,4,7,89,76,64,12,3,5,76,8,0,6,125}

26条回答
情到深处是孤独
2楼-- · 2019-01-01 03:57

You can put the byte array into String and run match by IndexOf. Or you can at least reuse existing algorithms on string matching.

    [STAThread]
    static void Main(string[] args)
    {
        byte[] pattern = new byte[] {12,3,5,76,8,0,6,125};
        byte[] toBeSearched = new byte[] {23,36,43,76,125,56,34,234,12,3,5,76,8,0,6,125,234,56,211,122,22,4,7,89,76,64,12,3,5,76,8,0,6,125};
        string needle, haystack;

        unsafe 
        {
            fixed(byte * p = pattern) {
                needle = new string((SByte *) p, 0, pattern.Length);
            } // fixed

            fixed (byte * p2 = toBeSearched) 
            {
                haystack = new string((SByte *) p2, 0, toBeSearched.Length);
            } // fixed

            int i = haystack.IndexOf(needle, 0);
            System.Console.Out.WriteLine(i);
        }
    }
查看更多
后来的你喜欢了谁
3楼-- · 2019-01-01 04:01

My version of Foubar's answer above, which avoids searching past the end of the haystack, and allows specifying a starting offset. Assumes needle is not empty or longer than the haystack.

public static unsafe long IndexOf(this byte[] haystack, byte[] needle, long startOffset = 0)
{ 
    fixed (byte* h = haystack) fixed (byte* n = needle)
    {
        for (byte* hNext = h + startOffset, hEnd = h + haystack.LongLength + 1 - needle.LongLength, nEnd = n + needle.LongLength; hNext < hEnd; hNext++)
            for (byte* hInc = hNext, nInc = n; *nInc == *hInc; hInc++)
                if (++nInc == nEnd)
                    return hNext - h;
        return -1;
    }
}
查看更多
深知你不懂我心
4楼-- · 2019-01-01 04:01

I've created a new function using the tips from my answer and the answer by Alnitak.

public static List<Int32> LocateSubset(Byte[] superSet, Byte[] subSet)
{
    if ((superSet == null) || (subSet == null))
    {
       throw new ArgumentNullException();
    }
    if ((superSet.Length < subSet.Length) || (superSet.Length == 0) || (subSet.Length == 0))
    {
        return new List<Int32>();
    }
    var result = new List<Int32>();
    Int32 currentIndex = 0;
    Int32 maxIndex =  superSet.Length - subSet.Length;
    while (currentIndex < maxIndex)
    {
         Int32 matchCount = CountMatches(superSet, currentIndex, subSet);
         if (matchCount ==  subSet.Length)
         {
            result.Add(currentIndex);
         }
         currentIndex++;
         if (matchCount > 0)
         {
            currentIndex += matchCount - 1;
         }
    }
    return result;
}

private static Int32 CountMatches(Byte[] superSet, int startIndex, Byte[] subSet)
{
    Int32 currentOffset = 0;
    while (currentOffset < subSet.Length)
    {
        if (superSet[startIndex + currentOffset] != subSet[currentOffset])
        {
            break;
        }
        currentOffset++;
    }
    return currentOffset;
}

the only part I'm not so happy about is the

         currentIndex++;
         if (matchCount > 0)
         {
            currentIndex += matchCount - 1;
         }

part... I would have like to use a if else to avoid the -1, but this results in a better branch prediction (although I'm not that sure it will matter that much)..

查看更多
一个人的天荒地老
5楼-- · 2019-01-01 04:01

Here's my (not the most performant) solution. It relies on the fact that bytes/latin-1 conversion is lossless, which is not true for bytes/ASCII or bytes/UTF8 conversions.

It's advantages are that It Works (tm) for any byte values (some other solutions work incorrectly with bytes 0x80-0xff) and can be extended to perform more advanced regex matching.

using System;
using System.Collections.Generic;
using System.Text;
using System.Text.RegularExpressions;

class C {

  public static void Main() {
    byte[] data = {0, 100, 0, 255, 100, 0, 100, 0, 255};
    byte[] pattern = {0, 255};
    foreach (int i in FindAll(data, pattern)) {
      Console.WriteLine(i);
    }
  }

  public static IEnumerable<int> FindAll(
    byte[] haystack,
    byte[] needle
  ) {
    // bytes <-> latin-1 conversion is lossless
    Encoding latin1 = Encoding.GetEncoding("iso-8859-1");
    string sHaystack = latin1.GetString(haystack);
    string sNeedle = latin1.GetString(needle);
    for (Match m = Regex.Match(sHaystack, Regex.Escape(sNeedle));
         m.Success; m = m.NextMatch()) {
      yield return m.Index;
    }
  }
}
查看更多
刘海飞了
6楼-- · 2019-01-01 04:01

I'm a little late to the party How about using Boyer Moore algorithm but search for bytes instead of strings. c# code below.

EyeCode Inc.

class Program {
    static void Main(string[] args) {
        byte[] text         =  new byte[] {12,3,5,76,8,0,6,125,23,36,43,76,125,56,34,234,12,4,5,76,8,0,6,125,234,56,211,122,22,4,7,89,76,64,12,3,5,76,8,0,6,123};
        byte[] pattern      = new byte[] {12,3,5,76,8,0,6,125};

        BoyerMoore tmpSearch = new BoyerMoore(pattern,text);

        Console.WriteLine(tmpSearch.Match());
        Console.ReadKey();
    }

    public class BoyerMoore {

        private static int ALPHABET_SIZE = 256;

        private byte[] text;
        private byte[] pattern;

        private int[] last;
        private int[] match;
        private int[] suffix;

        public BoyerMoore(byte[] pattern, byte[] text) {
            this.text = text;
            this.pattern = pattern;
            last = new int[ALPHABET_SIZE];
            match = new int[pattern.Length];
            suffix = new int[pattern.Length];
        }


        /**
        * Searches the pattern in the text.
        * returns the position of the first occurrence, if found and -1 otherwise.
        */
        public int Match() {
            // Preprocessing
            ComputeLast();
            ComputeMatch();

            // Searching
            int i = pattern.Length - 1;
            int j = pattern.Length - 1;    
            while (i < text.Length) {
                if (pattern[j] == text[i]) {
                    if (j == 0) { 
                        return i;
                    }
                    j--;
                    i--;
                } 
                else {
                  i += pattern.Length - j - 1 + Math.Max(j - last[text[i]], match[j]);
                  j = pattern.Length - 1;
              }
            }
            return -1;    
          }  


        /**
        * Computes the function last and stores its values in the array last.
        * last(Char ch) = the index of the right-most occurrence of the character ch
        *                                                           in the pattern; 
        *                 -1 if ch does not occur in the pattern.
        */
        private void ComputeLast() {
            for (int k = 0; k < last.Length; k++) { 
                last[k] = -1;
            }
            for (int j = pattern.Length-1; j >= 0; j--) {
                if (last[pattern[j]] < 0) {
                    last[pattern[j]] = j;
                }
            }
        }


        /**
        * Computes the function match and stores its values in the array match.
        * match(j) = min{ s | 0 < s <= j && p[j-s]!=p[j]
        *                            && p[j-s+1]..p[m-s-1] is suffix of p[j+1]..p[m-1] }, 
        *                                                         if such s exists, else
        *            min{ s | j+1 <= s <= m 
        *                            && p[0]..p[m-s-1] is suffix of p[j+1]..p[m-1] }, 
        *                                                         if such s exists,
        *            m, otherwise,
        * where p is the pattern and m is its length.
        */
        private void ComputeMatch() {
            /* Phase 1 */
            for (int j = 0; j < match.Length; j++) { 
                match[j] = match.Length;
            } //O(m) 

            ComputeSuffix(); //O(m)

            /* Phase 2 */
            //Uses an auxiliary array, backwards version of the KMP failure function.
            //suffix[i] = the smallest j > i s.t. p[j..m-1] is a prefix of p[i..m-1],
            //if there is no such j, suffix[i] = m

            //Compute the smallest shift s, such that 0 < s <= j and
            //p[j-s]!=p[j] and p[j-s+1..m-s-1] is suffix of p[j+1..m-1] or j == m-1}, 
            //                                                         if such s exists,
            for (int i = 0; i < match.Length - 1; i++) {
                int j = suffix[i + 1] - 1; // suffix[i+1] <= suffix[i] + 1
                if (suffix[i] > j) { // therefore pattern[i] != pattern[j]
                    match[j] = j - i;
                } 
                else {// j == suffix[i]
                    match[j] = Math.Min(j - i + match[i], match[j]);
                }
            }

            /* Phase 3 */
            //Uses the suffix array to compute each shift s such that
            //p[0..m-s-1] is a suffix of p[j+1..m-1] with j < s < m
            //and stores the minimum of this shift and the previously computed one.
            if (suffix[0] < pattern.Length) {
                for (int j = suffix[0] - 1; j >= 0; j--) {
                    if (suffix[0] < match[j]) { match[j] = suffix[0]; }
                }
                {
                    int j = suffix[0];
                    for (int k = suffix[j]; k < pattern.Length; k = suffix[k]) {
                        while (j < k) {
                            if (match[j] > k) {
                                match[j] = k;
                            }
                            j++;
                        }
                    }
                }
            }
        }


        /**
        * Computes the values of suffix, which is an auxiliary array, 
        * backwards version of the KMP failure function.
        * 
        * suffix[i] = the smallest j > i s.t. p[j..m-1] is a prefix of p[i..m-1],
        * if there is no such j, suffix[i] = m, i.e. 

        * p[suffix[i]..m-1] is the longest prefix of p[i..m-1], if suffix[i] < m.
        */
        private void ComputeSuffix() {        
            suffix[suffix.Length-1] = suffix.Length;            
            int j = suffix.Length - 1;
            for (int i = suffix.Length - 2; i >= 0; i--) {  
                while (j < suffix.Length - 1 && !pattern[j].Equals(pattern[i])) {
                    j = suffix[j + 1] - 1;
                }
                if (pattern[j] == pattern[i]) { 
                    j--; 
                }
                suffix[i] = j + 1;
            }
        }

    }

}
查看更多
像晚风撩人
7楼-- · 2019-01-01 04:03

Speed isn't everything. Did you check them for consistency?

I didn't test all the code listed here. I tested my own code (which wasn't totally consistent, I admit) and IndexOfSequence. I found that for many tests IndexOfSequence was quite a bit faster than my code but with repeated testing I found that it was less consistent. In particular it seems to have the most trouble with finding patterns at the end of the array but it will miss them in the middle of the array too sometimes.

My test code isn't designed for efficiency, I just wanted to have a bunch of random data with some known strings inside. That test pattern is roughly like a boundary marker in an http form upload stream. That's what I was looking for when I ran across this code so I figured I'd test it with the kind of data I'll be searching for. It appears that the longer the pattern is the more often IndexOfSequence will miss a value.

private static void TestMethod()
{
    Random rnd = new Random(DateTime.Now.Millisecond);
    string Pattern = "-------------------------------65498495198498";
    byte[] pattern = Encoding.ASCII.GetBytes(Pattern);

    byte[] testBytes;
    int count = 3;
    for (int i = 0; i < 100; i++)
    {
        StringBuilder TestString = new StringBuilder(2500);
        TestString.Append(Pattern);
        byte[] buf = new byte[1000];
        rnd.NextBytes(buf);
        TestString.Append(Encoding.ASCII.GetString(buf));
        TestString.Append(Pattern);
        rnd.NextBytes(buf);
        TestString.Append(Encoding.ASCII.GetString(buf));
        TestString.Append(Pattern);
        testBytes = Encoding.ASCII.GetBytes(TestString.ToString());

        List<int> idx = IndexOfSequence(ref testBytes, pattern, 0);
        if (idx.Count != count)
        {
            Console.Write("change from {0} to {1} on iteration {2}: ", count, idx.Count, i);
            foreach (int ix in idx)
            {
                Console.Write("{0}, ", ix);
            }
            Console.WriteLine();
            count = idx.Count;
        }
    }

    Console.WriteLine("Press ENTER to exit");
    Console.ReadLine();
}

(obviously I converted IndexOfSequence from an extension back into a normal method for this testing)

Here's a sample run of my output:

change from 3 to 2 on iteration 1: 0, 2090,
change from 2 to 3 on iteration 2: 0, 1045, 2090,
change from 3 to 2 on iteration 3: 0, 1045,
change from 2 to 3 on iteration 4: 0, 1045, 2090,
change from 3 to 2 on iteration 6: 0, 2090,
change from 2 to 3 on iteration 7: 0, 1045, 2090,
change from 3 to 2 on iteration 11: 0, 2090,
change from 2 to 3 on iteration 12: 0, 1045, 2090,
change from 3 to 2 on iteration 14: 0, 2090,
change from 2 to 3 on iteration 16: 0, 1045, 2090,
change from 3 to 2 on iteration 17: 0, 1045,
change from 2 to 3 on iteration 18: 0, 1045, 2090,
change from 3 to 1 on iteration 20: 0,
change from 1 to 3 on iteration 21: 0, 1045, 2090,
change from 3 to 2 on iteration 22: 0, 2090,
change from 2 to 3 on iteration 23: 0, 1045, 2090,
change from 3 to 2 on iteration 24: 0, 2090,
change from 2 to 3 on iteration 25: 0, 1045, 2090,
change from 3 to 2 on iteration 26: 0, 2090,
change from 2 to 3 on iteration 27: 0, 1045, 2090,
change from 3 to 2 on iteration 43: 0, 1045,
change from 2 to 3 on iteration 44: 0, 1045, 2090,
change from 3 to 2 on iteration 48: 0, 1045,
change from 2 to 3 on iteration 49: 0, 1045, 2090,
change from 3 to 2 on iteration 50: 0, 2090,
change from 2 to 3 on iteration 52: 0, 1045, 2090,
change from 3 to 2 on iteration 54: 0, 1045,
change from 2 to 3 on iteration 57: 0, 1045, 2090,
change from 3 to 2 on iteration 62: 0, 1045,
change from 2 to 3 on iteration 63: 0, 1045, 2090,
change from 3 to 2 on iteration 72: 0, 2090,
change from 2 to 3 on iteration 73: 0, 1045, 2090,
change from 3 to 2 on iteration 75: 0, 2090,
change from 2 to 3 on iteration 76: 0, 1045, 2090,
change from 3 to 2 on iteration 78: 0, 1045,
change from 2 to 3 on iteration 79: 0, 1045, 2090,
change from 3 to 2 on iteration 81: 0, 2090,
change from 2 to 3 on iteration 82: 0, 1045, 2090,
change from 3 to 2 on iteration 85: 0, 2090,
change from 2 to 3 on iteration 86: 0, 1045, 2090,
change from 3 to 2 on iteration 89: 0, 2090,
change from 2 to 3 on iteration 90: 0, 1045, 2090,
change from 3 to 2 on iteration 91: 0, 2090,
change from 2 to 1 on iteration 92: 0,
change from 1 to 3 on iteration 93: 0, 1045, 2090,
change from 3 to 1 on iteration 99: 0,

I don't mean to pick on IndexOfSequence, it just happened to be the one I started working with today. I noticed at the end of the day that it seemed to be missing patterns in the data so I wrote my own pattern matcher tonight. It's not as fast though. I'm going to tweak it a bit more to see if I can get it 100% consistent before I post it.

I just wanted to remind everyone that they should test things like this to make sure they give good, repeatable results before you trust them in production code.

查看更多
登录 后发表回答