I'm using text classification with naive Bayes and countVectorizer to classify dialects. I read a research paper that the author has used a combination of :
bigrams + trigrams + word-marks vocabulary
He means by word-marks here, the words that are specific to a certain dialect.
How can I tweak those parameters in countVectorizer?
word marks
So those are examples of word marks, but it isn't what I have, because mine are arabic. So I translated them.
word_marks=['love', 'funny', 'happy', 'amazing']
Those are used to classify a text.
Also, in the this post: Understanding the `ngram_range` argument in a CountVectorizer in sklearn
There was this answer :
>>> v = CountVectorizer(ngram_range=(1, 2), vocabulary={"keeps", "keeps the"})
>>> v.fit_transform(["an apple a day keeps the doctor away"]).toarray()
array([[1, 1]]) # unigram and bigram found
I couldn't understand the output, what does [1,1] mean here? and how was he able to use ngram with vocabulary? aren't both of them mutually exclusive?