公告
财富商城
积分规则
提问
发文
2019-04-07 04:53发布
我想做一个坏孩纸
I want to know how to parse the robots.txt in java.
Is there already any code?
There is also a new release of crawler-commons:
https://github.com/crawler-commons/crawler-commons
The library aims to implement functionality common to any web crawler and this includes a very handy robots.txt parser
Heritrix is an open-source web crawler written in Java. Looking through their javadoc, I see that they have a utility class Robotstxt for parsing the robots.txt file.
There's also jrobotx library hosted at SourceForge.
(Full disclosure: I spun off the code that forms that library.)
最多设置5个标签!
There is also a new release of crawler-commons:
https://github.com/crawler-commons/crawler-commons
The library aims to implement functionality common to any web crawler and this includes a very handy robots.txt parser
Heritrix is an open-source web crawler written in Java. Looking through their javadoc, I see that they have a utility class Robotstxt for parsing the robots.txt file.
There's also jrobotx library hosted at SourceForge.
(Full disclosure: I spun off the code that forms that library.)