Regular expression to convert mark down to HTML

2019-04-08 06:30发布

问题:

How would you write a regular expression to convert mark down into HTML? For example, you would type in the following:

This would be *italicized* text and this would be **bold** text

This would then need to be converted to:

This would be <em>italicized</em> text and this would be <strong>bold</strong> text

Very similar to the mark down edit control used by stackoverflow.

Clarification

For what it is worth, I am using C#. Also, these are the only real tags/markdown that I want to allow. The amount of text being converted would be less than 300 characters or so.

回答1:

The best way is to find a version of the Markdown library ported to whatever language you are using (you did not specify in your question).


Now that you have clarified that you only want STRONG and EM to be processed, and that you are using C#, I recommend you take a look at Markdown.NET to see how those tags are implemented. As you can see, it is in fact two expressions. Here is the code:

private string DoItalicsAndBold (string text)
{
    // <strong> must go first:
    text = Regex.Replace (text, @"(\*\*|__) (?=\S) (.+?[*_]*) (?<=\S) \1", 
                          new MatchEvaluator (BoldEvaluator),
                          RegexOptions.IgnorePatternWhitespace | RegexOptions.Singleline);

    // Then <em>:
    text = Regex.Replace (text, @"(\*|_) (?=\S) (.+?) (?<=\S) \1",
                          new MatchEvaluator (ItalicsEvaluator),
                          RegexOptions.IgnorePatternWhitespace | RegexOptions.Singleline);
    return text;
}

private string ItalicsEvaluator (Match match)
{
    return string.Format ("<em>{0}</em>", match.Groups[2].Value);
}

private string BoldEvaluator (Match match)
{
    return string.Format ("<strong>{0}</strong>", match.Groups[2].Value);
}


回答2:

A single regex won't do. Every text markup will have it's own html translator. Better look into how the existing converters are implemented to get an idea on how it works.

http://en.wikipedia.org/wiki/Markdown#See_also



回答3:

I don't know about C# specifically, but in perl it would be:
s/
\*\*(.*?)\*\*/
\< bold>$1\</bold>/g
s/
\*(.*?)\*/
\< em>$1\</em>/g



回答4:

I came across the following post that recommends to not do this. In my case though I am looking to keep it simple, but thought I would post this per jop's recommendation in case someone else wanted to do this.