“language_model_penalty_non_dict_word” has no effe

2019-05-04 00:33发布

问题:

I'm setting language_model_penalty_non_dict_word through a config file for Tesseract 3.01, but its value doesn't have any effect. I've tried with multiple images, and multiple values for it, but the output for each image is always the same. Another user has noticed the same in a comment in another question.

Edit: After looking inside the source, the variable language_model_penalty_non_dict_word is used only inside the function float LanguageModel::ComputeAdjustedPathCost.

However, this function is never called! It is referenced only by 2 functions - LanguageModel::UpdateBestChoice() and LanguageModel::AddViterbiStateEntry(). I placed breakpoints in those functions, but they weren't being called, as well.

回答1:

After some debugging, I finally found out the reason - the function Wordrec::SegSearch() wasn't being called (and it is up there in the call graph of LanguageModel::ComputeAdjustedPathCost()).

From this code:

  if (enable_new_segsearch) {
    SegSearch(&chunks_record, word->best_choice,
              best_char_choices, word->raw_choice, state);
  } else {
    best_first_search(&chunks_record, best_char_choices, word,
                      state, fixpt, best_state);
  }

So you need to set enable_new_segsearch in the config file:

enable_new_segsearch    1

language_model_penalty_non_freq_dict_word 0.2
language_model_penalty_non_dict_word 0.3