PHP - Compare multidimensional sub-arrays to each

2020-07-26 06:56发布

问题:

Introduction - This question has been updated the 27th May 2018:

I have 1 PHP multidimensional-array, containing 6 sub-arrays, each containing 20 sub-sub-arrays, which in turn, each contain 2 sub-sub-arrays, one being a string (header), the other being an unspecified number of keywords (keywords).

I am looking to compare each of the 120 sub-sub-arrays to the 100 other sub-sub-arrays contained in the remainint 5 sub-arrays. So that sub-sub-array1 in sub-array1 is compared to sub-array1 to and including sub-array20 in sub-array2 to and including sub-array6, and so forth.

If enough keywords in two sub-sub-arrays are deemed identical and headers are as well, both using Levenshtein distance, the sub-sub-arrays will be merged.


Example script

I have written a script doing exactly this, but for two separate arrays to demonstrate my goal:

<?php
// Variable deciding maximum Levenshtein distance between two words. Can be changed to lower / increase threshhold for whether two keywords are deemed identical.
$lev_point_value = 3;

// Variable deciding minimum amount of identical (passed the $lev_point_value variable) keywords needed to merge arrays. Can be changed to lower / increase threshhold for how many keywords two arrays must have in common to be merged.
$merge_tag_value = 4;

// Variable deciding minimum Levenshtein distance between two headers needed to merge arrays. Can be changed to lower / increase threshhold for whether two titles are deemed identical.
$merge_head_value = 22;

// Array1 - A story about a monkey, includes at header and keywords.
$array1 = array (
        "header" => "This is a story about a monkey.",
        'keywords' => array( "Trees", "Monkey", "Flying", "Drink", "Vacation", "Coconut", "Big", "Bonobo", "Climbing"
    ));

// Array1 - Another, but slightly different story about a monkey, includes at header and keywords.
$array2 = array (
        "header" => "This is another, but different story, about a monkey.",
        'keywords' => array( "Monkey", "Big", "Trees", "Bonobo", "Fun", "Dance", "Cow", "Coconuts"
    ));

// Function comparing keywords between two arrays. Uses levenshtein distance lesser than $lev_point_value. Each pass increases $merged_tag, which is then returned.
function sim_tag_index($array1, $array2, $lev_point_value) {
    $merged_tag = 0;
    foreach ($array1['keywords'] as $item1){
        foreach ($array2["keywords"] as $item2){
            if (levenshtein($item1, $item2) <= $lev_point_value) {
            $merged_tag++;
            };
         }
    };
    return $merged_tag;
}

// Function comparing headers between two arrays using levenshtein distance, which is then returned as $merged_head.
function sim_header_index($array1, $array2) {
    $merged_head = (levenshtein($array1['header'], $array2['header']));
    return $merged_head;
}

// Function running sim_tag_index against $merge_tag_value, if it passes, then running sim_tag_index against $merge_head_value, if this passes aswell, merge arrays.
function merge_on_sim($array1, $array2, $merge_tag_value, $merge_head_value, $lev_point_value) {
    $group = array();
    if (sim_tag_index($array1, $array2, $lev_point_value) >= $merge_tag_value) {
        if (sim_header_index($array1, $array2) >= $merge_head_value) {
            $group = (array_unique(array_merge($array1["keywords"],$array2["keywords"])));
        }
    }
    return $group;
}

// Printing function merge_on_sim.
print_r (merge_on_sim($array1, $array2, $merge_tag_value, $merge_head_value, $lev_point_value));
?>

Question:

How can I expand or rewrite my script to go through multiple sub-sub-arrays, comparing them to all other sub-sub-arrays, found in other sub-arrays, and then merge sub-sub-arrays that are deemed identical enough?


Multidimensional Array Structure

$array = array (
    // Sub-array 1
    array (
        // Story 'Monkey 1' - Has identical sub-sub-arrays 'Monkey 2' and 'Monkey 3' and will be merged with them.
        array (
            "header" => "This is a story about a monkey.",
            'keywords' => array( "Trees", "Monkey", "Flying", "Drink", "Vacation", "Coconut", "Big", "Bonobo", "Climbing")
        ),
        // Story 'Cat 1' - Has identical sub-sub-array 'Cat 2' and will be merged with it.
        array (
            "header" => "Here's a catarific story about a cat",
            'keywords' => array( "meauw", "raaaw", "kitty", "growup", "Fun", "claws", "fish", "salmon")
        )
    ),
    // Sub-array 2
    array ( 
        // Story 'Monkey 2' - Has identical sub-sub-arrays 'Monkey 1' and 'Monkey 3' and will be merged with them.
        array (
            "header" => "This is another, but different story, about a monkey.",
            'keywords' => array( "Monkey", "Big", "Trees", "Bonobo", "Fun", "Dance", "Cow", "Coconuts")
        ),
        // Story 'Cat 2' - Has identical sub-sub-array 'Cat 1' and will be merged with it.
        array (
            "header" => "Here's a different story about a cat",
            'keywords' => array( "meauwe", "ball", "cat", "kitten", "claws", "sleep", "fish", "purr")
        )
    ),
    // Sub-array 3
    array ( 
        // Story 'Monkey 3' - Has identical sub-sub-arrays 'Monkey 1' and 'Monkey 2' and will be merged with them.
        array (
            "header" => "This is a third story about a monkey.",
            'keywords' => array( "Jungle", "tree", "monkey", "Bonobo", "Fun", "Dance", "climbing", "Coconut", "pretty")
        ),
        // Story 'Fireman 1' - Has no identical sub-sub-arrays and will not be merged.
        array (
            "header" => "This is a story about a fireman",
            'keywords' => array( "fire", "explosion", "burning", "rescue", "happy", "help", "water", "car")
        )
    )
);

Wanted Multidimensional Array

$array = array (
    // Story 'Monkey 1', 'Monkey 2' and 'Monkey 3' merged.
    array (
        "header" => array( "This is a story about a monkey.", "This is another, but different story, about a monkey.", "This is a third story about a monkey."),
        'keywords' => array( "Trees", "Monkey", "Flying", "Drink", "Vacation", "Coconut", "Big", "Bonobo", "Climbing", "Fun", "Dance", "Cow", "Coconuts", "Jungle", "tree", "pretty")
    ),
    // Story 'Cat 1' and 'Cat 2' merged.
    array (
        "header" => array( "Here's a catarific story about a cat", "Here's a different story about a cat"),
        'keywords' => array( "meauw", "raaaw", "kitty", "growup", "Fun", "claws", "fish", "salmon", "ball", "cat", "kitten", "sleep", "fish", "purr")
    )
);

回答1:

I'll give it a go!

I use preg_grep to find items that are the same in the other subarrays. Then I use count to see how many matching keywords there is.
And that is where the threshold is. Currently I set it to 2, that means two matching keywords is a match.

//flatten array to make it simpler
$new =[];
foreach($array as $subarr){
    $new = array_merge($new, $subarr);
}

$threshold = 2;
$merged=[];
foreach($new as $key => $story){
    // create regex pattern to find similar items
    $words = "/" . implode("|", $story["keywords"]) . "/i";
    foreach($new as $key2 => $story2){
        // only loop new items and items that has not been merged already
        if($key != $key2 && $key2 > $key && !in_array($key2, $merged)){
            // If the count of words from preg_grep is above threshold it's mergable
            if(count(preg_grep($words, $story2["keywords"])) > $threshold){
                // debug
                //echo $key . " " . $key2 . "\n";
                //echo $story["header"] . " = " . $story2["header"] ."\n\n";

                // if the item does not exist create it first to remove notices
                if(!isset($res[$key])) $res[$key] = ["header" => [], "keywords" =>[]];

                // add headers
                $res[$key]["header"][] = $story["header"];
                $res[$key]["header"][] = $story2["header"];

                // only keep unique 
                $res[$key]["header"] = array_unique($res[$key]["header"]);

                // add keywords and remove duplicates
                $res[$key]["keywords"] = array_merge($res[$key]["keywords"], $story["keywords"], $story2["keywords"]);
                $res[$key]["keywords"] = array_unique($res[$key]["keywords"]);

                // add key2 to merged so that we don't merge this again.
                $merged[] = $key2;
            }
        }
    }
}

var_dump($res);

https://3v4l.org/6cKRq

Output is as your "wanted" in question.



回答2:

As I understand your situation, you have just one multidimensional array. The first level has 6 arrays, every first level array contains 20 (second level) arrays. Each second level array contains 1 (third level) array with 2 elements: string and keywords.

You wish to compare each string and keywords against all other strings and keywordss that exist in the multidimensional array.

Let's do some thoughtful looping with array_diff_key() & range()

foreach($L1arrays as $index=>$L1focus){  // 6 arrays with indexes 0 - 5
    // now we want to check against all siblings that haven't been checked yet
    foreach(array_diff_key($L1arrays,range(0,$index)) as $L1comparison){  // no dupe loops
       // When $index=0, second foreach will only loop through $L1arrays indexed 1-5
       // When $index=1, second foreach will only loop through $L1arrays indexed 2-5
       // When $index=2, second foreach will only loop through $L1arrays indexed 3-5
       // When $index=3, second foreach will only loop through $L1arrays indexed 4-5
       // When $index=4, second foreach will only loop through $L1arrays[5]
       // When $index=5, it will not loop because it has already been compared
       //                to all other indexes via previous loops.

       /*
          ... in this section you can loop through the 20 sub-arrays of $L1focus
              and compare them against the 20 sub-arrays of each $L1comparison
       */
    }
}

Now to discuss the comparison logic. I don't understand from the current information in your question how you wish to distinguish whether one array is sufficiently similar to another. I wish you would have put more effort in creating a realistic Wanted Multidimensional Array -- the second array has a duplicate key of 0 and has two empty elements in it, so it is rather unhelpful.

Do you wish to compare keywords or string values, or both? How strict is the comparison (what is the 'similarity threshold')?

PHP is a great performer for this type of work. I have removed the javascript tag from your question.

Please do a bit of work/research (leveraging my looping suggestion) and refine exactly how you wish to compare the deepest array values and edit your question. To call my attention to this question again, write a comment starting with @mickmackusa

Good luck!