I need to compare file system wildcard expressions to see whether their results would overlap, by only examining/comparing the expressions.
For sake of example, we are building a utility that would sort files from one (or more locations) into a separate folders based on file system wildcard expressions. For example: *.txt goes into folder a, *.doc goes into folder b, and so on. The wildcard characters we would support would be * and ?
I want to be able to determine from just analyzing the wildcard expressions, whether they would conflict/overlap.
For example, if I have the following expressions:
*.x.y
*.y
They would conflict (overlap) because the second expression *.y would include *.x.y results. (ex. A.x.y would match both expressions)
I am approaching this by building a tree structure using the using all of the expressions, figuring that the very act of building a tree will fail if the expressions conflict.
For example:
*.x
a.b
a.c
b.d
might create a tree like
+-*-.-x
|
start +--+
| +-b
| |
+-a-.-+-c
|
|
+-b-.-d
If I try to add the pattern b.x, the tree would be successful following the *.x path, and thereby say that the pattern already exists.
Am I heading in the correct direction? Or is there an known algorithm for attacking this?
To check whether two wildcard patterns could match the same filename, you can look at this problem as creating a grid of comparisons between pairs of characters, and then checking whether there exists a diagonal path. The illustration below shows how wildcard patterns ab?.c??
and a*bc.*
can be checked for possible conflict:
When a match between two identical literal characters is found, you move diagonally to the next characters to check. (indicated with green arrow)
When a literal character and a single-character wild card ?
are encountered, there are two possibilities: either the wild card matches the character (move diagonally), or the wildcard matches empty space, and you skip over it. (indicated with purple arrows)
When a multi-character wild card *
is encountered, three possibilities need to be taken into account: the wild card matches an empty space before the other character, the wild card matches the other character, or the wild card matches multiple characters. (indicated with blue arrows)
code example 1 (iterative)
Below is a simple javascript implementation which iterates over the grid diagonally, marks cells which can be reached from the current cell, and then checks whether the bottom right cell is reachable. Run the code snippet to see a few examples. (update: top-to-bottom left-to-right will do fine instead of diagonally)
function wildConflict(wild1, wild2) {
var grid = [[true]], width = wild1.length, height = wild2.length;
for (var x = 1; x <= width; x++) grid[x] = [];
for (var y = 0; y < height; y++) {
for (var x = 0; x < width; x++) {
if (grid[x][y]) {
var a = wild1.charAt(x), b = wild2.charAt(y);
if (a == '*' || b == '*' || a == '?' || b == '?' || a == b) grid[x + 1][y + 1] = true;
if (a == '*' || b == '*' || a == '?') grid[x + 1][y] = true;
if (a == '*' || b == '*' || b == '?') grid[x][y + 1] = true;
}
}
}
return grid[width][height] == true;
}
var a = ["a", "a", "a*", "abc", "a*", "*.x.y", "*.x.y", "a*b*", "a*bc.*", "a?c.de"];
var b = ["a", "b", "abc", "a?", "*b", "*.y", "*.x", "a*c*", "ab?.c??", "ac.d??"];
for (var i in a) document.write(""" + a[i] + "" ↔ "" + b[i] + "" → " + wildConflict(a[i], b[i]) + "<BR>");
code example 2 (recursive)
A simple recursive implementation has the drawback of potentially checking some character pairs more than once. It doesn't need the 2D-array, but the recursions obviously use memory too.
Note that when a multi-character wild card *
is encountered, the algorithm recurses with only two possibilities: jump over the one character, or jump over the other character; jumping over both characters (i.e. the wild card matches exactly one character) is taken care of in the next step, when the wild card is compared to the next character.
function wildConflict(wild1, wild2) {
var w1 = wild1.split(''), w2 = wild2.split('');
return conflict(0, 0);
function conflict(p1, p2) {
if (p1 == w1.length || p2 == w2.length) {
if ((p1 == w1.length && p2 == w2.length)
|| (p1 == w1.length - 1 && (w1[p1] == '*' || w1[p1] == '?'))
|| (p2 == w2.length - 1 && (w2[p2] == '*' || w2[p2] == '?'))) {
return true;
}
else return false; // premature end
}
else if (w1[p1] == '*' || w2[p2] == '*' || (w1[p1] == '?' && w2[p2] == '?')) {
return conflict(p1 + 1, p2) || conflict(p1, p2 + 1);
}
else if (w1[p1] == '?') {
return conflict(p1 + 1, p2) || conflict(p1 + 1, p2 + 1);
}
else if (w2[p2] == '?') {
return conflict(p1, p2 + 1) || conflict(p1 + 1, p2 + 1);
}
else if (w1[p1] == w2[p2]) {
return conflict(p1 + 1, p2 + 1);
}
else return false; // unequal literals
}
}
var x = ["a", "a", "a*", "abc", "a*", "*.x.y", "*.x.y", "a*b*", "a*bc.*", "a?c.de"];
var y = ["a", "b", "abc", "a?", "*b", "*.y", "*.x", "a*c*", "ab?.c??", "ac.d??"];
for (var i in x) document.write(""" + x[i] + "" ↔ "" + y[i] + "" → " + wildConflict(x[i], y[i]) + "<BR>");
Turn each wildcard expression into a finite automaton that matches it.
Compute the intersection of the finite automatons.
Use dynamic programming to see whether the intersection can ever match.
If you don't recognize these concepts, see Algorithm for exclusion of numbers for an attempt of mine to explain it a few years ago. (At that point for counting things that matched a collection of regular expressions, but the principles are identical.)
I think you might be able to turn the patterns into regular expressions and then see if they match eachother? Solution here is based on the rules for Directory.GetFiles on MSDN -- I feel like there's SOMETHING wrong with it but I'm not sure what.
here's a basic implementation
private bool Equivalent(string patternOne, string patternTwo)
{
// convert both patterns to regexes based on rules for Directory.GetFiles
var expressionOne = FilePatternToRegex(patternOne);
var expressionTwo = FilePatternToRegex(patternTwo);
// if either regex matches the opposite pattern, we've got a conflict
return expressionTwo.IsMatch(patternOne) || expressionOne.IsMatch(patternTwo);
}
Regex FilePatternToRegex(string pattern)
{
// separate extension and filename
var extension = Path.GetExtension(pattern);
var filename = Path.GetFileNameWithoutExtension(pattern);
// escape filename
filename = EscapeFilePattern(filename);
// 3 character extensions are a special case -- should be greedy eg xls matches xlsx
// extension.Length == 4 bc its dot AND 3 characters
if (extension.Length == 4 && !extension.Contains("*") && !extension.Contains("?"))
{
extension = extension + ".*";
}
else
{
// all other extension lengths just get escaped like normal regexes
extension = EscapeFilePattern(extension);
}
// our final pattern should also only match at the string start/end
var finalPattern = "\\A" + filename + extension + "\\z";
return new Regex(finalPattern);
}
string EscapeFilePattern(string pattern)
{
// escape star and question mark bc they are filepattern significant
pattern = pattern.Replace("*", "%S%").Replace("?", "%Q%");
// escape all other special regex characters
pattern = Regex.Escape(pattern);
// turn star and question mark into their regex equivalents
pattern = pattern.Replace("%S%", ".+").Replace("%Q%", ".");
return pattern;
}
EDIT: based on further discussion in comments, this is broken. Proof using code sample that it's broken:
var dir = new DirectoryInfo(Environment.CurrentDirectory).CreateSubdirectory(Guid.NewGuid().ToString());
var path = Path.Combine(dir.FullName, "abc");
File.WriteAllText(path, "*");
// verify both patterns match our file
Assert.AreEqual(path, dir.GetFiles("a*c*")[0].FullName);
Assert.AreEqual(path, dir.GetFiles("a*b*")[0].FullName);
// current regex based solution thinks they are NOT equivalent
// when they are
Assert.IsFalse(Equivalent("a*c*", "a*b*"));