Remove standard english language stop words in Sta

I am using Stanford Topic Modeling Toolbox 0.4.0 for LDA, I noticed that if I want to remove standard english language stop words, I can use a StopWordFilter("en") as the last step the tokenizer, but how do I use it?

import scalanlp.io._;
import scalanlp.stage._;
import scalanlp.stage.text._;
import scalanlp.text.tokenize._;
import scalanlp.pipes.Pipes.global._;

import edu.stanford.nlp.tmt.stage._;
import edu.stanford.nlp.tmt.model.lda._;
import edu.stanford.nlp.tmt.model.llda._;

val source = CSVFile("pubmed-oa-subset.csv") ~> IDColumn(1);

val tokenizer = {
  SimpleEnglishTokenizer() ~>            // tokenize on space and punctuation
  CaseFolder() ~>                        // lowercase everything
  WordsAndNumbersOnlyFilter() ~>         // ignore non-words and non-numbers
  MinimumLengthFilter(3)                 // take terms with >=3 characters
  StopWordFilter("en")                   // how to use it? it's not working.
}

val text = {
  source ~>                              // read from the source file
  Column(4) ~>                           // select column containing text
  TokenizeWith(tokenizer) ~>             // tokenize with tokenizer above
  TermCounter() ~>                       // collect counts (needed below)
  TermMinimumDocumentCountFilter(4) ~>   // filter terms in <4 docs
  TermDynamicStopListFilter(30) ~>       // filter out 30 most common terms
  DocumentMinimumLengthFilter(5)         // take only docs with >=5 terms
}

// turn the text into a dataset ready to be used with LDA
val dataset = LDADataset(text);

// define the model parameters
val params = LDAModelParams(numTopics = 30, dataset = dataset,
  topicSmoothing = 0.01, termSmoothing = 0.01);

// Name of the output model folder to generate
val modelPath = file("lda-"+dataset.signature+"-"+params.signature);

// Trains the model: the model (and intermediate models) are written to the
// output folder.  If a partially trained model with the same dataset and
// parameters exists in that folder, training will be resumed.
TrainCVB0LDA(params, dataset, output=modelPath, maxIterations=1000);

// To use the Gibbs sampler for inference, instead use
// TrainGibbsLDA(params, dataset, output=modelPath, maxIterations=1500);

标签： scala lda stanford-nlp

3条回答

祖国的老花朵

2楼-- · 2019-06-05 23:14

If each line of the stopwords.txt (e.g. download from http://jmlr.org/papers/volume5/lewis04a/a11-smart-stop-list/english.stop) contains a stop word you can use

val text = {
  source ~>                              // read from the source file
  Column(3) ~>                           // select column containing text
  TokenizeWith(tokenizer) ~>             // tokenize with tokenizer above
  TermCounter() ~>                       // collect counts (needed below)
  TermMinimumDocumentCountFilter(4) ~>   // filter terms in <4 docs
  TermStopListFilter(scala.io.Source.fromFile("stopwords.txt").getLines().toList) ~> 
  TermDynamicStopListFilter(30) ~>       // filter out 30 most common terms
  DocumentMinimumLengthFilter(5)         // take only docs with >=5 terms
}

0人赞添加讨论(0) 举报

相关推荐>>

3楼-- · 2019-06-05 23:29

I think it should be in val text:

val text = {
  source ~>                              // read from the source file
  Column(4) ~>                           // select column containing text
  TokenizeWith(tokenizer) ~>             // tokenize with tokenizer above
  TermCounter() ~>                       // collect counts (needed below)
  TermMinimumDocumentCountFilter(4) ~>   // filter terms in <4 docs
  StopWordFilter("en") ~>
  TermDynamicStopListFilter(30) ~>       // filter out 30 most common terms
  DocumentMinimumLengthFilter(5)         // take only docs with >=5 terms
}

You can also use your own stoplist, by putting the following instead of StopWordFilter("en") ~>

  TermStopListFilter(List("a", "and", "but", "if")) ~>

And just add within this list all the chosen stopwords.

0人赞添加讨论(0) 举报

甜甜的少女心

4楼-- · 2019-06-05 23:37

I'm not an expert but I guess you're supposed to do

val tokenizer = {
  SimpleEnglishTokenizer() ~>            // tokenize on space and punctuation
  CaseFolder() ~>                        // lowercase everything
  WordsAndNumbersOnlyFilter() ~>         // ignore non-words and non-numbers
  MinimumLengthFilter(3) ~>              // take terms with >=3 characters
  new StopWordFilter("en")               // remove stop words for the english language
}

0人赞添加讨论(0) 举报

Remove standard english language stop words in Sta

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间