Introduction - This question has been updated the 27th May 2018:
I have 1 PHP multidimensional-array, containing 6 sub-arrays, each containing 20 sub-sub-arrays, which in turn, each contain 2 sub-sub-arrays, one being a string (header), the other being an unspecified number of keywords (keywords).
I am looking to compare each of the 120 sub-sub-arrays to the 100 other sub-sub-arrays contained in the remainint 5 sub-arrays. So that sub-sub-array1 in sub-array1 is compared to sub-array1 to and including sub-array20 in sub-array2 to and including sub-array6, and so forth.
If enough keywords in two sub-sub-arrays are deemed identical and headers are as well, both using Levenshtein distance, the sub-sub-arrays will be merged.
Example script
I have written a script doing exactly this, but for two separate arrays to demonstrate my goal:
<?php
// Variable deciding maximum Levenshtein distance between two words. Can be changed to lower / increase threshhold for whether two keywords are deemed identical.
$lev_point_value = 3;
// Variable deciding minimum amount of identical (passed the $lev_point_value variable) keywords needed to merge arrays. Can be changed to lower / increase threshhold for how many keywords two arrays must have in common to be merged.
$merge_tag_value = 4;
// Variable deciding minimum Levenshtein distance between two headers needed to merge arrays. Can be changed to lower / increase threshhold for whether two titles are deemed identical.
$merge_head_value = 22;
// Array1 - A story about a monkey, includes at header and keywords.
$array1 = array (
"header" => "This is a story about a monkey.",
'keywords' => array( "Trees", "Monkey", "Flying", "Drink", "Vacation", "Coconut", "Big", "Bonobo", "Climbing"
));
// Array1 - Another, but slightly different story about a monkey, includes at header and keywords.
$array2 = array (
"header" => "This is another, but different story, about a monkey.",
'keywords' => array( "Monkey", "Big", "Trees", "Bonobo", "Fun", "Dance", "Cow", "Coconuts"
));
// Function comparing keywords between two arrays. Uses levenshtein distance lesser than $lev_point_value. Each pass increases $merged_tag, which is then returned.
function sim_tag_index($array1, $array2, $lev_point_value) {
$merged_tag = 0;
foreach ($array1['keywords'] as $item1){
foreach ($array2["keywords"] as $item2){
if (levenshtein($item1, $item2) <= $lev_point_value) {
$merged_tag++;
};
}
};
return $merged_tag;
}
// Function comparing headers between two arrays using levenshtein distance, which is then returned as $merged_head.
function sim_header_index($array1, $array2) {
$merged_head = (levenshtein($array1['header'], $array2['header']));
return $merged_head;
}
// Function running sim_tag_index against $merge_tag_value, if it passes, then running sim_tag_index against $merge_head_value, if this passes aswell, merge arrays.
function merge_on_sim($array1, $array2, $merge_tag_value, $merge_head_value, $lev_point_value) {
$group = array();
if (sim_tag_index($array1, $array2, $lev_point_value) >= $merge_tag_value) {
if (sim_header_index($array1, $array2) >= $merge_head_value) {
$group = (array_unique(array_merge($array1["keywords"],$array2["keywords"])));
}
}
return $group;
}
// Printing function merge_on_sim.
print_r (merge_on_sim($array1, $array2, $merge_tag_value, $merge_head_value, $lev_point_value));
?>
Question:
How can I expand or rewrite my script to go through multiple sub-sub-arrays, comparing them to all other sub-sub-arrays, found in other sub-arrays, and then merge sub-sub-arrays that are deemed identical enough?
Multidimensional Array Structure
$array = array (
// Sub-array 1
array (
// Story 'Monkey 1' - Has identical sub-sub-arrays 'Monkey 2' and 'Monkey 3' and will be merged with them.
array (
"header" => "This is a story about a monkey.",
'keywords' => array( "Trees", "Monkey", "Flying", "Drink", "Vacation", "Coconut", "Big", "Bonobo", "Climbing")
),
// Story 'Cat 1' - Has identical sub-sub-array 'Cat 2' and will be merged with it.
array (
"header" => "Here's a catarific story about a cat",
'keywords' => array( "meauw", "raaaw", "kitty", "growup", "Fun", "claws", "fish", "salmon")
)
),
// Sub-array 2
array (
// Story 'Monkey 2' - Has identical sub-sub-arrays 'Monkey 1' and 'Monkey 3' and will be merged with them.
array (
"header" => "This is another, but different story, about a monkey.",
'keywords' => array( "Monkey", "Big", "Trees", "Bonobo", "Fun", "Dance", "Cow", "Coconuts")
),
// Story 'Cat 2' - Has identical sub-sub-array 'Cat 1' and will be merged with it.
array (
"header" => "Here's a different story about a cat",
'keywords' => array( "meauwe", "ball", "cat", "kitten", "claws", "sleep", "fish", "purr")
)
),
// Sub-array 3
array (
// Story 'Monkey 3' - Has identical sub-sub-arrays 'Monkey 1' and 'Monkey 2' and will be merged with them.
array (
"header" => "This is a third story about a monkey.",
'keywords' => array( "Jungle", "tree", "monkey", "Bonobo", "Fun", "Dance", "climbing", "Coconut", "pretty")
),
// Story 'Fireman 1' - Has no identical sub-sub-arrays and will not be merged.
array (
"header" => "This is a story about a fireman",
'keywords' => array( "fire", "explosion", "burning", "rescue", "happy", "help", "water", "car")
)
)
);
Wanted Multidimensional Array
$array = array (
// Story 'Monkey 1', 'Monkey 2' and 'Monkey 3' merged.
array (
"header" => array( "This is a story about a monkey.", "This is another, but different story, about a monkey.", "This is a third story about a monkey."),
'keywords' => array( "Trees", "Monkey", "Flying", "Drink", "Vacation", "Coconut", "Big", "Bonobo", "Climbing", "Fun", "Dance", "Cow", "Coconuts", "Jungle", "tree", "pretty")
),
// Story 'Cat 1' and 'Cat 2' merged.
array (
"header" => array( "Here's a catarific story about a cat", "Here's a different story about a cat"),
'keywords' => array( "meauw", "raaaw", "kitty", "growup", "Fun", "claws", "fish", "salmon", "ball", "cat", "kitten", "sleep", "fish", "purr")
)
);
As I understand your situation, you have just one multidimensional array. The first level has 6 arrays, every first level array contains 20 (second level) arrays. Each second level array contains 1 (third level) array with 2 elements:
string
andkeywords
.You wish to compare each
string
andkeywords
against all otherstring
s andkeywords
s that exist in the multidimensional array.Let's do some thoughtful looping with array_diff_key() & range()
Now to discuss the comparison logic. I don't understand from the current information in your question how you wish to distinguish whether one array is sufficiently similar to another. I wish you would have put more effort in creating a realistic
Wanted Multidimensional Array
-- the second array has a duplicate key of 0 and has two empty elements in it, so it is rather unhelpful.Do you wish to compare
keywords
orstring
values, or both? How strict is the comparison (what is the 'similarity threshold')?PHP is a great performer for this type of work. I have removed the javascript tag from your question.
Please do a bit of work/research (leveraging my looping suggestion) and refine exactly how you wish to compare the deepest array values and edit your question. To call my attention to this question again, write a comment starting with
@mickmackusa
Good luck!
I'll give it a go!
I use preg_grep to find items that are the same in the other subarrays. Then I use count to see how many matching keywords there is.
And that is where the threshold is. Currently I set it to
2
, that means two matching keywords is a match.https://3v4l.org/6cKRq
Output is as your "wanted" in question.