How much work should the constructor for an HTML p

2019-01-30 14:41发布

站内文章 / 前沿技术

14 0

爷的心禁止访问

女 | 书童

私信

可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

How much work is it reasonable for an object constructor to do? Should it simply initialize fields and not actually perform any operations on data, or is it okay to have it perform some analysis?

Background: I was writing a class which is responsible for parsing an HTML page and returning various information based on the parsed information. The design of the class is such that the class' constructor does the parsing, throwing an exception if an error occurs. Once the instance is initialized, the parsed values are available without further processing via accessors. Something like:

public class Parser {

    public Parser(final String html) throws ParsingException {
        /* Parsing logic that sets private fields */
        /* that throws an error if something is erroneous.*/
    }

    public int getNumOfWhatevers() { return private field; }
    public String getOtherValue()  { return other private field; }
}

After designing the class I started to wonder if this was correct OO practice. Should the parsing code be placed within a void parseHtml() method and the accessors only return valid values once this method is called? I feel as though my implementation is correct, but I can't help but feel that some OO purists might find it incorrect for some reason and that an implementation such as the following would be better:

public class Parser {

    public Parser(final String html) {
        /* Remember html for later parsing. */
    }

    public void parseHtml() throws ParsingException { 
        /* Parsing logic that sets private fields */
        /* that throws an error if something is erroneous.*/
    }

    public int getNumOfWhatevers() { return private field; }
    public String getOtherValue()  { return other private field; }
}

Are there instances where initialization code, such as parsing information, should not occur within the constructor, or am I just being silly and second-guessing myself?

What are the benefits/drawbacks of splitting the parsing from the constructor?

Thoughts? Insights?

回答1:

I normally follow one easy principle:

Everything that is mandatory for the correct existence and behavior of the class instance should be passed and done into the constructor.

Every other activity is done by other methods.

The constructor should never:

use other methods of the class with the purpose of using overriding behavior
act on its private attributes via methods

Because I learned the hard way that while you are in the constructor, the object is in a incoherent, intermediate state which is too dangerous to handle. Some of this unexpected behavior could be expected from your code, some could be from the language architecture and compiler decisions. Never guess, stay safe, be minimal.

In your case, I would use a Parser::parseHtml(file) method. The instantiation of the parser and the parsing are two different operations. When you instance a parser, the constructor puts it in the condition to perform its job (parsing). Then you use its method to perform the parsing. You then have two choices:

Either you allow the parser to contain the results of the parsing, and give the clients an interface to retrieve the parsed information (e.g. Parser::getFooValue()). The methods will return Null if you haven't performed parsing yet, or if the parsing failed.
or your Parser::parseHtml() returns a ParsingResult instance, containing what the Parser found.

The second strategy grants you better granularity, as the Parser is now stateless, and the client needs to interact with the methods of the ParsingResult interface. The Parser interface remains sleek and simple. The internals of the Parser class will tend to follow the Builder pattern.

You comment: "I feel as though returning an instance of a parser that hasn't parsed anything (as you suggest), a constructor that's lost its purpose. There's no use in initializing a parser without the intent of actually parsing the information. So if parsing is going to happen for sure, should we parse as early as possible and report and errors early, such as during the construction of the parser? I feel as though initializing a parser with invalid data should result in an error being thrown."

Not really. If you return an instance of a Parser, of course it's going to parse. In Qt, when you instantiate a button, of course it's going to be shown. However, you have the method QWidget::show() to manually call before something is visible to the user.

Any object in OOP has two concerns: initialization, and operation (ignore finalization, it's not on discussion right now). If you keep these two operations together, you both risk trouble (having an incomplete object operating) and you lose flexibility. There are plenty of reasons why you would perform intermediate setup of your object before calling parseHtml(). Example: suppose you want to configure your Parser to be strict (so to fail if a given column in a table contains a string instead of an integer) or permissive. Or to register a listener object which is warned every time a new parsing is performed or ended (think GUI progress bar). These are optional information, and if your architecture puts the constructor as the übermethod that does everything, you end up having a huge list of optional method parameters and conditions to handle into a method which is inherently a minefield.

"Caching should not be the responsibility of a parser. If data is to be cached, a separate cache class should be created to provide that functionality."

On the opposite. If you know that you are going to use the parsing functionality on a lot of files, and there's a significant chance that the files are going to be accessed and parsed again later on, it is internal responsability of the Parser to perform smart caching of what it already saw. From the client perspective, it is totally oblivious if this caching is performed or not. He is still callling the parsing, and still obtaining a result object. but it is getting the answer much faster. I think there's no better demonstration of separation of concerns than this. You boost performance with absolutely no change in the contract interface or the whole software architecture.

However, note that I am not advocating that you should never use a constructor call to perform parsing. I am just claiming that it's potentially dangerous and you lose flexibility. There are plenty of examples out there where the constructor is at the center of the actual activity of the object, but there is also plenty of examples of the opposite. Example (although biased, it arises from C style): in python, I would consider very weird something like this

f = file()
f.setReadOnly()
f.open(filename)

instead of the actual

f = file(filename,"r")

But I am sure there are IO access libraries using the first approach (with the second as a sugar-syntax approach).

Edit: finally, remember that while it's easy and compatible to add in the future a constructor "shortcut", it is not possible to remove this functionality if you find it dangerous or problematic. Additions to the interface are much easier than removals, for obvious reasons. Sugary behavior must be weighted against future support you have to provide to that behavior.

回答2:

"Should the parsing code be placed within a void parseHtml() method and the accessors only return valid values once this method is called?"

Yes.

"The design of the class is such that the class' constructor does the parsing"

This prevents customization, extension, and -- most importantly -- dependency injection.

There will be times when you want to do the following

Construct a parser.
Add Features to the parser: Business Rules, Filters, Better Algorithms, Strategies, Commands, whatever.
Parse.

Generally, it's best to do as little as possible in a constructor so that you are free to extend or modify.

Edit

"Couldn't extensions simply parse the extra information in their constructors?"

Only if they don't have any kind of features that need to be injected. If you want to add features -- say a different strategy for constructing the parse tree -- your subclasses have to also manage this feature addition before they parse. It may not amount to a simple super() because the superclass does too much.

"Also, parsing in the constructor allows me to fail early"

Kind of. Failing during construction is a weird use case. Failing during construction makes it difficult to construct a parser like this...

class SomeClient {
    parser p = new Parser();
    void aMethod() {...}
}

Usually a construction failure means you're out of memory. There's rarely a good reason to catch construction exceptions because you're doomed anyway.

You're forced to build the parser in a method body because it has too complex arguments.

In short, you've removed options from the clients of your parser.

"It's inadvisable to inherit from this class to replace an algorithm."

That's funny. Seriously. It's an outrageous claim. No algorithm is optimal for all possible use cases. Often a high-performance algorithm uses a lot of memory. A client may want to replace the algorithm with a slower one that uses less memory.

You can claim perfection, but it's rare. Subclasses are the norm, not an exception. Someone will always improve on your "perfection". If you limit their ability to subclass your parser, they'll simply discard it for something more flexible.

"I don't see needing step 2 as described in the answer."

A bold statement. Dependencies, Strategies and related injection design patterns are common requirements. Indeed, they're so essential for unit testing that a design which makes it difficult or complex often turns out to be a bad design.

Limiting the ability to subclass or extend your parser is a bad policy.

Bottom Line.

Assume nothing. Write a class with as few assumptions about it's use cases as possible. Parsing at construction time makes too many assumptions about client use cases.

回答3:

A constructor should do whatever is necessary to put that instance into a runnable, valid, ready-to-use state. If that means some validation or analysis, I'd say it belongs there. Just be careful about how much the constructor does.

There might be other places in your design where validation fits as well.

If the input values are coming from a UI, I'd say that it should have a hand in ensuring valid input.

If the input values are being unmarshalled from an incoming XML stream, I'd think about using schemas to validate it.

回答4:

I'd probably just pass enough to initialize the object and then have a 'parse' method. The idea is that expensive operations should be as obvious as possible.

回答5:

You should try to keep the constructor from doing unnecessary work. In the end, it all depends on what the class should do, and how it should be used.

For instance, will all the accessors be called after constructing your object? If not, then you've processed data unnecessarily. Also, there's a bigger risk of throwing a "senseless" exception (oh, while trying to create the parser, I got an error because the file was malformed, but I didn't even ask it to parse anything...)

On second thought, you might need the access to this data fast after it is built, but you may take long building the object. It might be ok in this case.

Anyway, if the building process is complicated, I'd suggest using a creational pattern (factory, builder).

回答6:

It is good rule of thumb to only initialize fields in constructors, and otherwise do as little as possible to initialize the Object. Using Java as an example, you could run into problems if you call methods in your constructor, especially if you subclass your Object. This is because, due to the order of operations in the instantiation of Objects, instance variables will not be evaluated until after the super constructor has finished. If you try to access the field during the super constructor's process, you will throw an Exception

Suppose you have a superclass

class Test {

   Test () {
      doSomething();
   }

   void doSomething() {
     ...
   }
 }

and you have a subclass:

class SubTest extends Test {
    Object myObj = new Object();

    @Override
    void doSomething() {
        System.out.println(myObj.toString()); // throws a NullPointerException          
    }
 }

This is an example specific to Java, and while different languages handle this sort of ordering differently, it serves to drive the point home.

edit as an answer to your comment:

Though I would normally shy away from methods in constructors, in this case you have a few options:

In your constructor, set the HTML string as a field in your Class, and parse every time your getters are called. This most likely will not be very efficient.
Set the HTML as a field on your object, and then introduce a dependency on parse(), with it needing to be called either right after the constructor is finished or include some sort of lazy parsing by adding something like 'ensureParsed()' at the head of your accessors. I dont like this all that much, as you could have the HTML around after you've parsed, and your ensureParsed() call could be coded to set all of your parsed fields, thereby introducing a side-effect to your getter.
You could call parse() from your constructor and run the risk of throwing an exception. As you say, you are setting the fields to initialize the Object, so this is really OK. With regard to the Exception, stating that there was an illegal argument passed into a constructor is acceptable. If you do this, you should be careful to ensure that you understand the way that your language handles the creation of Objects as discussed above. To follow up with the Java example above, you can do this without fear if you ensure that only private methods (and therefore not eligible for overriding by subclasses) are called from within a constructor.

回答7:

Misko Hevery has a nice story on this subject, from a unit testing perspective, here.

回答8:

The constructor should create a valid object. If in your case that requires reading and parsing information, than so be it.

If the object can be used for other purposes without parsing the information first, than consider making two constructors, or a separate method.

回答9:

A constructor should set up the object to be used.

So whatever that is. That may include taking action on some data or just setting fields. It will change from each class.

In the case you are speaking of an Html Parser, I would opt for creating the class, and then calling a Parse Html method. The reason for this is it gives you a furture opportunity to set items in the class for parsing the Html.

回答10:

In this particular case, I would say there is two classes here: A parser and a parse result.

public class Parser {
    public Parser() {
        // Do what is necessary to construct a parser.
        // Perhaps we need to initialize a Unicode library, UTF-8 decoder, etc
    }
    public virtual ParseResult parseHTMLString(final string html) throws ParsingException
    {
        // Parser would do actual work here
        return new ParseResult(1, 2);
    }
}
public class ParseResult
{
    private int field1;
    private int field2;
    public ParseResult(int _field1, int _field2)
    {
        field1 = _field1;
        field2 = _field2;
    }
    public int getField1()
    {
        return field1;
    }
    public int getField2()
    {
        return field2;
    }
}

If your parser could work on partial sets of data, I'd suspect it would be suitable to add another class into the mix. Possibly a PartialParseResult?

回答11:

I think when you create a class ($obj = new class), the class should not affect the page at all, and should be relatively low processing.

For instance:

If you have a user class, it should be checking for incoming login/logout parameters, along with cookies, and assigning them to class variables.

If you have a database class, it should make the connection to the database so it is ready when you are going to start a query.

If you have a class that deals with a particular form, it should go get the form values.

In a lot of my classes, I check for certain parameters to define an 'action', like add, edit or delete.

All of these things don't really affect the page, so it wouldn't matter too much if you created them or not. They are simply ready for when you are going to call that first method.

回答12:

I would not do the parsing in the constructor. I would do everything necessary to validate the constructor parameters, and to ensure that the HTML can be parsed when needed.

But I'd have the accessor methods do the parsing if the HTML is not parsed by the time they need it to be. The parsing can wait until that time - it does not need to be done in the constructor.

Suggested code, for discussion purposes:

public class MyHtmlScraper {
    private TextReader _htmlFileReader;
    private bool _parsed;

    public MyHtmlScraper(string htmlFilePath) {
        _htmlFileReader = new StreamReader(htmlFilePath);
        // If done in the constructor, DoTheParse would be called here
    }

    private string _parsedValue1;
    public string Accessor1 {
        get {
            EnsureParsed();
            return _parsedValue1;
        }
    }

    private string _parsedValue2;
    public string Accessor2 {
        get {
            EnsureParsed();
            return _parsedValue2;
        }
    }

    private void EnsureParsed(){
        if (_parsed) return;
        DoTheParse();
        _parsed = true;
    }

    private void DoTheParse() {
        // parse the file here, using _htmlFileReader
        // parse into _parsedValue1, 2, etc.
    }
}

With this code in front of us, we can see there's very little difference between doing all the parsing in the constructor, and doing it on demand. There's a test of a boolean flag, and the setting of the flag, and the extra calls to EnsureParsed in each accessor. I'd be surprised if that extra code were not inlined.

This isn't a huge big deal, but my inclination is to do as little as possible in the constructor. That allows for scenarios where the construction needs to be fast. These will no doubt be situations you have not considered, like deserialization.

Again, it's not a huge big deal, but you can avoid doing the work in the constructor, and it's not expensive to do the work elsewhere. I admit, it's not like you're off doing network I/O in the constructor (unless, of course, a UNC file path is passed in), and you're not going to have to wait long in the constructor (unless there are networking problems, or you generalize the class to be able to read the HTML from places other than a file, some of which might be slow).

But since you don't have to do it in the constructor, my advice is simply - don't.

And if you do, it could be years before it causes an issue, if at all.

回答13:

Why not just pass the parser to the constructor? This would allow you to change the implementation without changing the model:

public interface IParser
{
    Dictionary<string, object> ParseDocument(string document);
}

public class HtmlParser : IParser
{
    // Properties, etc...

    public Dictionary<string, object> ParseDocument(string document){
         //Do what you need to, return the collection of properties
         return someDictionaryOfHtmlObjects;
    }
}

public class HtmlScrapper
{
    // Properties, etc...

    public HtmlScrapper(IParser parser, string HtmlDocument){
         //Set your properties
    }

    public void ParseDocument(){
         this.myDictionaryOfHtmlObjects = 
                  parser.ParseDocument(this.htmlDocument);
    }

}

This should give you some flexibility in changing/improving how your application performs without needing to rewrite this class.

回答14:

In my case, the entire contents of the HTML file are passed through a String. The string is no longer required once it is parsed and is fairly large (a few hundred kilobytes). So it would be best to not keep it in memory. The object shouldn't be used for other cases. It was designed to parse a certain page. Parsing something else should prompt the creation of a different object to parse that.

It sounds very much as though your object isn't really a parser. Does it just wrap a call to a parser and presents the results in a (presumably) more usable fashion? Because of this, you need to call the parser in the constructor as your object would be in a non-useful state otherwise.

I'm not sure how the "object-oriented" part helps here. If there's only one object and it can only process one specific page then it's not clear why it needs to be an object. You could do this just as easily in procedural (i.e. non-OO) code.

For languages that only have objects (e.g. Java) you could just create a static method in a class that had no accessible constructor and then invoke the parser and return all of the parsed values in a Map or similar collection

回答15:

A possible option is to move the parsing code to a seperate function, make the constructor private, and have a static function parse( html ) that constructs the object and immediately calls the parse function.
This way you avoid the problems with parsing in the constructur (inconsistent state, problems when calling overridden functions, ...). But the client code still gets all the advantages (one call to get the parsed html or an 'early' error).

回答16:

As quite a few have commented the general rule is to only do initialization in constructors and never use say virtual methods (you will get a compiler warning if you try pay attention to that warning :) ). In you specific case I wouldn't go for the parHTML method either. an object should be in a valid state when it's constructed you should have to do stuff to the object before you can really use it.

Personally I'd go for a factory method. Exposing a class with no public constructors and create it using a factory method instead. Let you're factory method do the parsing and pass the parsed result to a private/protected constructor.

take a look at System.Web.WebRequest if you wanna see a sample of some similiar logic.

回答17:

I agree with the posters here arguing minimal work in the constructor, really just putting the object into a non-zombie state, then have verb functions like parseHTML();

One point I'd like to make, although I don't want to cause a flame war, is consider the case of a non-exception environment. I know you're talking about C#, but I try to keep my programming models as similar as possible between c++ and c#. For various reasons, I don't use exceptions in C++ (think embedded video game programming), I use return code errors.

In this case, I can't throw exceptions in a constructor, so I tend to not have a constructor do anything which could fail. I leave that to the accessor functions.

回答18:

In general, a constructor should:

Initialize all the fields.
Leave the resulting object in a valid state.

However, I would not use the constructor in the way you have. Parsing should be separated from using the parsing results.

Generally when I write a parser I write it as a singleton. I don't store any fields in the object except the single instance; instead, I only use local variables within the methods. Theoretically these could just be static (class-level) methods, but that would mean that I couldn't make them virtual.

回答19:

I personally put nothing in constructors and have a set of initialization functions. I find standard constructor methods have limited and cumbersome reuse.