A way to use RegEx to find a set of filenames path

2019-03-20 16:12发布

Good morning guys

Is there a good way to use regular expression in C# in order to find all filenames and their paths within a string variable?

For example, if you have this string:

string s = @"Hello John

these are the files you have to send us today: <file>C:\Development\Projects 2010\Accounting\file20101130.csv</file>, <file>C:\Development\Projects 2010\Accounting\orders20101130.docx</file>

also we would like you to send <file>C:\Development\Projects 2010\Accounting\customersupdated.xls</file>

thank you";

The result would be:

C:\Development\Projects 2010\Accounting\file20101130.csv
C:\Development\Projects 2010\Accounting\orders20101130.docx
C:\Development\Projects 2010\Accounting\customersupdated.xls

EDITED: Considering what told @Jim, I edited the string adding tags in order to make it easier to extract needed file names from string!

3条回答
可以哭但决不认输i
2楼-- · 2019-03-20 16:51

Here's something I came up with:

using System;
using System.Text.RegularExpressions;

public class Test
{

    public static void Main()
    {
        string s = @"Hello John these are the files you have to send us today: 
            C:\projects\orders20101130.docx also we would like you to send 
            C:\some\file.txt, C:\someother.file and d:\some file\with spaces.ext  

            Thank you";

        Extract(s);

    }

    private static readonly Regex rx = new Regex
        (@"[a-z]:\\(?:[^\\:]+\\)*((?:[^:\\]+)\.\w+)", RegexOptions.IgnoreCase);

    static void Extract(string text)
    {
        MatchCollection matches = rx.Matches(text);

        foreach (Match match in matches)
        {
            Console.WriteLine("'{0}'", match.Value);
        }
    }

}

Produces: (see on ideone)

'C:\projects\orders20101130.docx', file: 'orders20101130.docx'
'C:\some\file.txt', file: 'file.txt'
'C:\someother.file', file: 'someother.file'
'd:\some file\with spaces.ext', file: 'with spaces.ext'

The regex is not extremely robust (it does make a few assumptions) but it worked for your examples as well.


Here is a version of the program if you use <file> tags. Change the regex and Extract to:

private static readonly Regex rx = new Regex
    (@"<file>(.+?)</file>", RegexOptions.IgnoreCase);

static void Extract(string text)
{
    MatchCollection matches = rx.Matches(text);

    foreach (Match match in matches)
    {
        Console.WriteLine("'{0}'", match.Groups[1]);
    }
}

Also available on ideone.

查看更多
虎瘦雄心在
3楼-- · 2019-03-20 17:09

If you use <file> tag and the final text could be represented as well formatted xml document (as far as being inner xml, i.e. text without root tags), you probably can do:

var doc = new XmlDocument();
doc.LoadXml(String.Concat("<root>", input, "</root>"));

var files = doc.SelectNodes("//file"):

or

var doc = new XmlDocument();

doc.AppendChild(doc.CreateElement("root"));
doc.DocumentElement.InnerXml = input;

var nodes = doc.SelectNodes("//file");

Both method really works and are highly object-oriented, especially the second one.

And will bring rather more performance.

See also - Don't parse (X)HTML using RegEx

查看更多
老娘就宠你
4楼-- · 2019-03-20 17:17

If you put some constraints on your filename requirements, you can use code similar to this:

string s = @"Hello John

these are the files you have to send us today: C:\Development\Projects 2010\Accounting\file20101130.csv, C:\Development\Projects 2010\Accounting\orders20101130.docx

also we would like you to send C:\Development\Projects 2010\Accounting\customersupdated.xls

thank you";

Regex regexObj = new Regex(@"\b[a-z]:\\(?:[^<>:""/\\|?*\n\r\0-\37]+\\)*[^<>:""/\\|?*\n\r\0-\37]+\.[a-z0-9\.]{1,5}", RegexOptions.IgnorePatternWhitespace|RegexOptions.IgnoreCase);
MatchCollection fileNameMatchCollection = regexObj.Matches(s);
foreach (Match fileNameMatch in fileNameMatchCollection)
{
    MessageBox.Show(fileNameMatch.Value);
}

In this case, I limited extensions to a length of 1-5 characters. You can obviously use another value or restrict the characters allowed in filename extensions further. The list of valid characters is taken from the MSDN article Naming Files, Paths, and Namespaces.

查看更多
登录 后发表回答