Working on Storm Crawler 1.13 and elastic search 6.5.2. How to restrict the crawler not to crawl/index the special characters � � � � � ��� �� � •
可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试):
问题:
回答1:
An easy way to do this is to write a ParseFilter like
ParseData pd = parse.get(URL);
String text = pd.getText();
// remove chars
pd.setText(text);
This will get called on documents parsed by JSoup or Tika. Have a look at the parse filters in the repository for examples.