How can I find and remove CSS references in HTML h

2019-07-20 01:20发布

问题:

I have created a service to join, minify and compress css-references on a CMS system. Example:

Before :

<link href="/Files/css1.css" rel="stylesheet" type="text/css"/>
<link href="/Files/css2.css" rel="stylesheet" type="text/css"/>
<link href="/Files/css3.css" rel="stylesheet" type="text/css" media="all"/>

Now you can write:

<link href="/min.ashx?files=/Files/css1.css,/Files/css2.css,/Files/css3.css" rel="stylesheet" type="text/css" />

My next task is to take all references in head section AUTOMATICALLY and replace them by one single line, as seen in the example.

I should only replace those that falls with in these rules:

  • Href starts with '/Files/', to avoid trying to load externals externals
  • Only the ones with attribute media or with a media="all" should be included, as the resulting css-file will only have one setting.

I have acces to the raw html of the page, but is stuck on sucsfully locating the references, not knowing if I should parse to xml or use regex or such..

can anyone point me in the right direction?

回答1:

Use HTML Agility Pack. Rough plan of attack:

  1. Load the html content into an HtmlDocument object.

  2. Find the link nodes in the HtmlDocument object via XPath

    var nodes = doc.DocumentBody.SelectNodes("//head/link[@type='text/css']");

  3. Retrieve the hrefs from those nodes

    string href = nodes[0].Attributes["href"].Value;

  4. Then replace the nodes with the new node.



回答2:

You can find the links that match your rules with regex:

<link href="(/Files/[^"]+)" .* media

It will give you the file path inside the quotes, e.g. '/Files/css1.css'. You can use that result to build up the string you wanted.

C# friendly regex:

@"<link href=""(/Files/[^""]+)"" .* media"

Use the Regex.Match method to get the groupings: http://msdn.microsoft.com/en-us/library/twcw2f1c.aspx