strpos() with multiple needles?

2020-02-10 02:20发布

问题:

I am looking for a function like strpos() with two significant differences:

  1. To be able to accept multiple needles. I mean thousands of needles at ones.
  2. To search for all occurrences of the needles in the haystack and to return an array of starting positions.

Of course it has to be an efficient solution not just a loop through every needle. I have searched through this forum and there were similar questions to this one, like:

  • Using an array as needles in strpos
  • Define multiple needles using stripos
  • Can't search an array in PHP in_array for the presence of multiple needles

but nether of them was what I am looking for. I am using strpos just to illustrate my question better, probably something entirely different has to be used for this purpose.

I am aware of Zend_Search_Lucene and I am interested if it can be used to achieve this and how (just the general idea)?

Thanks a lot for Your help and time!

回答1:

Here's some sample code for my strategy:

function strpos_array($haystack, $needles, $offset=0) {
    $matches = array();

    //Avoid the obvious: when haystack or needles are empty, return no matches
    if(empty($needles) || empty($haystack)) {
        return $matches;
    }

    $haystack = (string)$haystack; //Pre-cast non-string haystacks
    $haylen = strlen($haystack);

    //Allow negative (from end of haystack) offsets
    if($offset < 0) {
        $offset += $heylen;
    }

    //Use strpos if there is no array or only one needle
    if(!is_array($needles)) {
        $needles = array($needles);
    }

    $needles = array_unique($needles); //Not necessary if you are sure all needles are unique

    //Precalculate needle lengths to save time
    foreach($needles as &$origNeedle) {
        $origNeedle = array((string)$origNeedle, strlen($origNeedle));
    }

    //Find matches
    for(; $offset < $haylen; $offset++) {
        foreach($needles as $needle) {
            list($needle, $length) = $needle;
            if($needle == substr($haystack, $offset, $length)) {
                $matches[] = $offset;
                break;
            }
        }
    }

    return($matches);
}

I've implemented a simple brute force method above that will work with any combination of needles and haystacks (not just words). For possibly faster algorithms check out:

  • Aho–Corasick string matching algorithm

Other Solution

function strpos_array($haystack, $needles, $theOffset=0) {
    $matches = array();

    if(empty($haystack) || empty($needles)) {
        return $matches;
    }

    $haylen = strlen($haystack);

    if($theOffset < 0) {  // Support negative offsets
        $theOffest += $haylen;
    }

    foreach($needles as $needle) {
        $needlelen = strlen($needle);
        $offset = $theOffset;

        while(($match = strpos($haystack, $needle, $offset)) !== false) {
            $matches[] = $match;
            $offset = $match + $needlelen;
            if($offset >= $haylen) {
                break;
            }
        }
    }

    return $matches;
}


回答2:

try preg match for multiple

if (preg_match('/word|word2/i', $str))

Checking for multiple strpos values



回答3:

I know this doesn't answer the OP's question but wanted to comment since this page is at the top of Google for strpos with multiple needles. Here's a simple solution to do so (again, this isn't specific to the OP's question - sorry):

    $img_formats = array('.jpg','.png');
    $missing = array();
    foreach ( $img_formats as $format )
        if ( stripos($post['timer_background_image'], $format) === false ) $missing[] = $format;
    if (count($missing) == 2)
        return array("save_data"=>$post,"error"=>array("message"=>"The background image must be in a .jpg or .png format.","field"=>"timer_background_image"));

If 2 items are added to the $missing array that means that the input doesn't satisfy any of the image formats in the $img_formats array. At that point you know that you can return an error, etc. This could easily be turned into a little function:

    function m_stripos( $haystack = null, $needles = array() ){
        //return early if missing arguments 
        if ( !$needles || !$haystack ) return false; 
        // create an array to evaluate at the end
        $missing = array(); 
        //Loop through needles array, and add to $missing array if not satisfied
        foreach ( $needles as $needle )
            if ( stripos($haystack, $needle) === false ) $missing[] = $needle;
        //If the count of $missing and $needles is equal, we know there were no matches, return false..
        if (count($missing) == count($needles)) return false; 
        //If we're here, be happy, return true...
        return true;
    }

Back to our first example using then the function instead:

    $needles = array('.jpg','.png');
    if ( !m_strpos( $post['timer_background_image'], $needles ) )
        return array("save_data"=>$post,"error"=>array("message"=>"The background image must be in a .jpg or .png format.","field"=>"timer_background_image"));

Of course, what you do after the function returns true or false is up to you.



回答4:

It seems you are searching for whole words. In this case, something like this might help. As it uses built-in functions, it should be faster than custom code, but you have to profile it:

$words = str_word_count($str, 2);

$word_position_map = array();

foreach($words as $position => $word) {
    if(!isset($word_position_map[$word])) {
        $word_position_map[$word] = array();
    }
    $word_position_map[$word][] = $position;
}

// assuming $needles is an array of words
$result = array_intersect_key($word_position_map, array_flip($needles));

Storing the information (like the needles) in the right format will improve the runtime ( e.g. as you don't have to call array_flip).

Note from the str_word_count documentation:

For the purpose of this function, 'word' is defined as a locale dependent string containing alphabetic characters, which also may contain, but not start with "'" and "-" characters.

So make sure you set the locale right.



回答5:

You could use a regular expression, they support OR operations. This would however make it fairly slow, compared to strpos.