Creating a bot/crawler

2019-04-02 07:26发布

I would like to make a small bot in order to automatically and periodontally surf on a few partner website. This would save several hours to a lot of employees here.

The bot must be able to :

  • connect to this website, on some of them log itself as a user, access and parse a particular information on the website.
  • The bot must be integrated to our website and change it's settings (used user…) with data of our website. Eventually it must sum up the parse information.
  • Preferably this operation must be done from the client side, not on the server.

I tried dart last month and loved it… I would like to do it in dart.

But I am a bit lost : Can I use a Document class object for each website I want to parse? Could be headless or should I use the chrome/dartium api to controle the webbrowser (i'd like to avoid this) ?

I've been reading this thread : https://groups.google.com/a/dartlang.org/forum/?fromgroups=#!searchin/misc/crawler/misc/TkUYKZXjoEg/Lj5uoH3vPgIJ Does using https://github.com/dart-lang/html5lib is a good idea for my case?

1条回答
聊天终结者
2楼-- · 2019-04-02 07:42

There are two parts to this.

  1. Get the page from the remote site.
  2. Read the page into a class that you can parse.

For the first part, if you are planning on running this client-side, you are likely to run into cross-site issues, in that your page, served from server X, cannot request pages from server Y, unless the correct headers are set.

See: CORS with Dart, how do I get it to work? and Dart application and cross domain policy or the site in question needs to be returning the correct CORS headers.

Assuming that you can actually get the pages from the remote site client-side, you can use HttpRequest to retrieve the actual content:

// snippet of code...
new HttpRequest.get("http://www.example.com", (req) {
  // process the req.responseText
});

You can also use HttpRequest.getWithCredentials. If the site has some custom login, then you will probably problems (as you will likely be having to Http POST the username and password from your site into their server -

This is when the second part comes in. You can process your HTML using the DocumentFragment.html(...) constructor, which gives you a nodes collection that you can iterate and recurse through. The example below shows this for a static block of html, but you could use the data returned from the HttpRequest above.

import 'dart:html';

void main() {
  var d = new DocumentFragment.html("""
    <html>
      <head></head>
      <body>Foo</body>
    </html>
  """);

  // print the content of the top level nods
  d.nodes.forEach((node) => print(node.text)); // prints "Foo"
  // real-world - use recursion to go down the hierarchy.

}

I'm guessing (not having written a spider before) that you'd be wanting to pull out specific tags at specific locations / depths to sum as your results, and also add urls in <a> hyperlinks to a queue that your bot will navigate into.

查看更多
登录 后发表回答