C# - Remove spaces in HTML source in between marku

I am currently working on a program that allows me to enter HTML source code into a RichTextBox control and removes the spaces from in between markups. The only problem is, I am not sure how I can differentiate between the spaces BETWEEN the markups and the spaces INSIDE the markups. Obviously, removing the spaces inside the markups would be bad. Any ideas as to how I can tell the difference?

Example: (before white space is removed)

<p>blahblahblah</p>                  <p>blahblahblah</p>

Example: (after white space is removed)

<p>blahblahblah</p><p>blahblahblah</p>

标签： c# html whitespace

7条回答

Viruses.

2楼-- · 2020-06-04 12:26

I would be tempted to use a regex to match any whitespace between an end tag and the next begin tag. Regex pattern matching would avoid you having to write logic yourself.

0人赞添加讨论(0) 举报

做个烂人

3楼-- · 2020-06-04 12:28

the solution in the link that Rasik sent here it's a solution for you too

Regex.Replace(html, @"\s*(<[^>]+>)\s*", "$1", RegexOptions.Singleline);

The regular take the markup as it is and the around space characters and change it with the markup.

Edit: A better solution that work for Micheal example

Regex.Replace(txtSource.Text,
            @"\s*(?<capture><(?<markUp>\w+)>.*<\/\k<markUp>>)\s*", "${capture}", RegexOptions.Singleline);

this regular expression will detect the markup tags and don't change what it's inside and remove the spaces out side. There's some other cases to look to it too. Like the markup without ending tags.

0人赞添加讨论(0) 举报

对你真心纯属浪费

4楼-- · 2020-06-04 12:35

You could attempt to use a regular expression to strip out the whitespace. However, the expression would have to be rather complex to differentiate between opening and closing tags and to handle nested tags.

Instead, you might parse the HTML input using a library like the Html Agility Pack and then rebuild the HTML string from the document model. This will not only strip out extra white space, it will also validate the HTML (even automatically correct common mistakes).

0人赞添加讨论(0) 举报

\"骚年 ilove

5楼-- · 2020-06-04 12:35

My solution (similar to how Linarize works in the XML Tools plug-in in Notepad ++)

   internal static class CONST
   {
      internal static Regex linarize_regex = new Regex(@"[\r\n]+[\x20\t]*", RegexOptions.CultureInvariant | RegexOptions.Compiled);
      internal static Regex tag_linarize_regex = new Regex(@"(?<tag><[^>]*?>)[\r\n]+[\x20\t]*", RegexOptions.CultureInvariant | RegexOptions.Compiled);
   }
   internal static class UTILS
   {
      internal static string linarize_html(string html)
      {
         try
            {
               html = CONST.tag_linarize_regex.Replace(html, "${tag}");
               html = CONST.linarize_regex.Replace(html, " ");
               return html;
            }
            catch (Exception)
            {
               return html;
            }
      }
   }

0人赞添加讨论(0) 举报

够拽才男人

6楼-- · 2020-06-04 12:42

I am not sure which Programming language you are using. But you can do as following in C# using Regular Expression.

public static string TrimSpaces(string str)
{
return System.Text.RegularExpressions.Regex.Replace(str, @"^\s+", string.Empty);
}

Also, look into another stackoverflow thread may be this will help.

Using regular expression to trim html

0人赞添加讨论(0) 举报

forever°为你锁心

7楼-- · 2020-06-04 12:44

Technically speaking, all spaces are part of some HTML element. The top-most element, i.e., the document, "owns" the spaces between separate<p>nodes in your example, for instance.

So I think you're asking if you can remove the space between nodes at the same level. In this case you'll need to keep track of the element nesting level and the previous element. For example, a series of<td>elements that occur within the same<tr>element, wherein you can detect the end of one</td>and the beginning of the next<td>element, and ignore all the whitespace in between.

You may be able to simplify the process and simply ignore any whitespace between a closing</x>tag and the next opening tag<y> (but there may be some difficulties with this approach that I can't think of off the top of my head).

0人赞添加讨论(0) 举报

1 2 下一页

C# - Remove spaces in HTML source in between marku

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间