Explicit special characters from crawling

2019-08-31 12:41发布

问题:

Working on Storm Crawler 1.13 and elastic search 6.5.2. How to restrict the crawler not to crawl/index the special characters � � � � � ��� �� � •

回答1:

An easy way to do this is to write a ParseFilter like

        ParseData pd = parse.get(URL);
        String text = pd.getText();
        // remove chars
        pd.setText(text);

This will get called on documents parsed by JSoup or Tika. Have a look at the parse filters in the repository for examples.