ASP.NET - Parse / Query HTML Before Transmission a

2019-09-11 04:13发布

问题:

As a web developer I feel too much of my time is spent on CSS. I am trying to come up with a solution where I can write re-usable CSS i.e. classes and reference these classes in the HTML without additional code in ASPX or ASCX files etc. or code-behind files. I want an intermediary which links up HTML elements with CSS classes.

What I want to achieve:

  • Modify HTML immediately before transmission
  • Select elements in the HTML
  • Based on rules defined elsewhere (e.g. in a text file relating to the page currently being processed):
  • Add a CSS class reference to multiple HTML elements
  • Add multiple CSS class references to a single HTML element

How I envisage this working:

  1. Extend ASP.NET functions which generate final HTML
  2. Grab all the HTML as a string
  3. Pass the string into a contructor for an object with querying (e.g. XPATH) methods
  4. Go through list of global rules e.g. for child ul of first div then class = "navigation"
  5. Go through list of page specific rules e.g. for child ul of first div then class &= " home"
  6. Get processed HTML from object e.g. obj.ToString
  7. ASP.NET to resume page generation using processed HTML

So what I need to know is:

  1. Where / how can I extend ASP.NET page generation functions (to get all HTML of page)
  2. What classes have element / node querying methods and access to attributes

Thanks for your help in advance.

P.S. I am developing ASP.NET web forms websites with VB.net code-behinds running on ISS 7

回答1:

Check out my CsQuery project: https://github.com/jamietre/csquery or on nuget as "CsQuery".

This is a C# (.NET 4) port of jQuery. In basic performance tests (included in the project test suite) selectors are about 100 times faster than HTML Agility Pack + Fizzler (a css selector add-on for HAP); it's plenty fast for manipulating the output stream in real time on a typical web site. If you are amazon.com or something, of course, YMMV.

My initial purpose in developing this was to manipulate HTML from a content management system. Once I had it up and running, I found that using CSS selectors and the jQuery API is a whole lot more fun than using web controls and started using it as a primary HTML manipulation tool for server-rendered pages, and built it out to cover pretty much all of CSS, jQuery and the browser DOM. I haven't touched a web control since.

To intercept HTML in webforms with CsQuery you do this in the page codebehind:

using CsQuery;
using CsQuery.Web;

protected override void Render(HtmlTextWriter writer)
{

    var csqContext = WebForms.CreateFromRender(Page, base.Render, writer);

    // CQ object is like a jQuery object. The "Dom" property of the context
    // returned above represents the output of this page.

    CQ doc = csqContext.Dom;

    doc["li > a"].AddClass("foo");

    // write it
    csqContext.Render();
}

To do the same thing in ASP.NET MVC please see this blog post describing that.

There is basic documentation for CsQuery on GitHub. Apart from getting HTML in and out, it works pretty much like jQuery. The WebForms object above is just to help you handle interacting with the HtmlTextWriter object and the Render method. The general-purpose usage is very simple:

var doc = CQ.Create(htmlString);

// or (useful for scraping and testing)
var doc = CQ.CreateFromUrl(url);

// do stuff with doc, a CQ object that acts like a jQuery object

doc["table tr:first"].Append("<td>A new cell</td>");

Additonally, pretty much the entire browser DOM is available using the same methods you use in a browser. The indexer [0] returns the first element in the selection set like jquery; if you are used to write javascript to manipulate HTML it should be very familiar.

// "Select" method is the same as the property indexer [] we used above.
// I go back and forth between them to emphasise their interchangeability.

var element = dom.Select("div > input[type=checkbox]:first-child")[0];
a.Checked=true;

Of course in C# you have a wealth of other general-purpose tools like LINQ at your disposal. Alternatively:

var element = dom["div > input[type=checkbox]:first-child"].Single();

a.Checked=true; 

When you're done manipulating the document, you'll probably want to get the HTML out:

string html = doc.Render();

That's all there is to it. There are a vast number of methods on the CQ object, covering all the jQuery DOM manipulation techniques. There are also utility methods for handling JSON, and it has extensive support for dynamic and anonymous types to make passing data structures (e.g. a set of CSS classes) as easy as possible -- much like jQuery.

Some More Advanced Stuff

I don't recommend doing this unless you are familiar with lower-level tinkering with asp.net's http workflow. There's nothing at all undoable but there will be a learning curve if you've never heard of an HttpHandler.

If you want to skip the WebForms engine altogether, you can create an IHttpHandler that automatically parses HTML files. This would definitely perform better than overlaying on a the ASPX engine -- who knows, maybe even faster than doing a similar amount of server-side processing with web controls. You can then then register your handler using web.config for specific extensions (like htm and html).

Yet another way to automatically intercept is with routing. You can use the MVC routing library in a webforms app with no trouble, here's one description of how to do this. Then you can create a route that matches whatever pattern you want (again, perhaps *.html) and pass handling off to a custom IHttpHandler or class. In this case, you're doing everything: you will need to look at the path, load the file from the file system, parse it with CsQuery, and stream the response.

Using either mechanism, you'll need a way to tell your project what code to run for each page, of course. That is, just because you've created a nifty HTML parser, how do you then tell it to run the correct "code behind" for that page?

MVC does this by just locating a controller with the name of "PageNameController.cs" and calling a method that matches the name of the parameter. You could do whatever you want; e.g. you could add an element:

<script type="controller" src="MyPageController"></script>

Your generic handler code could look for such an element, and then use reflection to locate the correct named class & method to call. This is pretty involved, and beyond the scope of this answer; but if you're looking to build a whole new framework or something this is how you would go about it.



回答2:

Intercepting the content of the page prior to it being sent is rather simple. I did this a while back on a project that compressed content on the fly: http://optimizerprime.codeplex.com/ (It's ugly, but it did its job and you might be able to salvage some of the code). Anyway, what you want to do is the following:

1) Create a Stream object that saves the content of the page until Flush is called. For instance I used this in my compression project: http://optimizerprime.codeplex.com/SourceControl/changeset/view/83171#1795869 Like I said before, it's not pretty. But my point being you'll need to create your own Stream class that will do what you want (in this case give you the string output of the page, parse/modify the string, and then output it to the user).

2) Assign the page's filter object to it. (Page.Response.Filter) Note that you need to do it rather early on so you can catch all of the content. I did this with a HTTP Module that ran on the PreRequestHandlerExecute event. But if you did something like this:

    protected override void OnPreInit(EventArgs e)
    {
        this.Response.Filter = new MyStream();
        base.OnPreInit(e);
    }

That would also most likely work.

3) You should be able to use something like Html Agility Pack to parse the HTML and modify it from there.

That to me seems like the easiest approach.