Gather info from xhtml in java: parser + visitor?

2019-09-06 18:18发布

问题:

I have to write a piece of code that loads a remote web page, search for the links, visit those pages and gather some info from certain tags...

How would you do this? Is the visitor pattern of any help here? If so, how could I use it?

Thanks

回答1:

Some comments/suggestions

  • Not sure if the visitor patter is a good fit over here. A typical scenario for the visitor pattern is when the operation algorithm differs depending on the object on which the algorithm is applied.
  • The crude way to solve this is to embed the algorithm in the concerned Object itself but this amounts to mixing data and operation (against the spirit of Separation of Concern)
  • Visitor pattern helps us here to separate the the algorithm from the data on which its applied.
  • Please check out an example for better understanding of visitor pattern.

In your case

  • Objects are the web page, Links and Operations are Visit, parse, extract information.
  • The same set of operations are applied on all the web pages and links.
  • So here the operation algorithm is not changing for different web pages and links and hence the visitor pattern is not suitable.
  • Technically you can still use visitor pattern, but that's not what it is for.

For you problem,

  • I think its not very complicated design problem. Some patterns might seem to solve the problem like Command Pattern ( Commands: extractLinkFromPage, visitLinkAndParseTags), but IMO, it will be overkill for this simple problem.
  • I would suggest a simple way of hosting the logic in a utility class and using the same from your calling program,
 class WebUtility{
 public List<String> parseLinks(String remotePageAddress){
 //Parse links
 }   
 public TageInfo extractTageInfo(String pageURL){
 //Extract the Tag information 
 }
 }

Here the TagInfo class will be a pojo as per your requirement.

This class is stateless and can be used as singleton. Optionally you can make the constructor private and method static.

Once you have this, you can invoke parseLinks to get the links and then loop through the list of links to get the tag information from each link by invoking extractTageInfo method.