I have a corpus containing journal data of 15 observations of 3 variables (ID, title, abstract). Using R Studio I read in the data from a .csv file (one line per observation). When performing some text mining operations I got some trouble when using the method stemCompletion. After applying stemCompletion I observed that the results are provided for each stemmed line of the .csv three times. All the other tm methods (e.g. stemDocument) produce only a single result. I'm wondering why this happens and how I could fix the problem
I used the code below:
data.corpus <- Corpus(DataframeSource(data))
data.corpuscopy <- data.corpus
data.corpus <- tm_map(data.corpus, stemDocument)
data.corpus <- tm_map(data.corpus, stemCompletion, dictionary=data.corpuscopy)
The single results after applying stemDocument is e.g.
"> data.corpus[[1]]
physic environ sourc innov investig attribut innov space
investig physic space intersect innov innov relev attribut physic space innov reflect chang natur innov technolog advanc servic mean chang argu develop innov space similar embodi divers set valu collabor open sustain use literatur review interview benchmark examin relationship physic environ innov literatur review interview underlin innov communic human centr process result five attribut innov space present collabor enabl modifi smart attract reflect provid perspect challeng support innov creation develop physic space add conceptu develop innov space outlin physic space innov servic"
And after using stemCompletion the reults appear three times:
"$`1`
physical environment source innovation investigation attributes innovation space investigation physical space intersect innovation innovation relevant attributes physical space innovation reflect changes nature innovation technological advancements service meanwhile changes argues develop innovation space similarity embodies diversified set valuable collaboration open sustainability used literature review interviews benchmarking examine relationships physical environment innovation literature review interviews underline innovation communicative human centred processes result five attributes innovation space present collaboration enablers modifiability smartness attractiveness reflect provide perspectives challenge support innovation creation develop physical space addition conceptual develop innovation space outlines physical space innovation service
physical environment source innovation investigation attributes innovation space investigation physical space intersect innovation innovation relevant attributes physical space innovation reflect changes nature innovation technological advancements service meanwhile changes argues develop innovation space similarity embodies diversified set valuable collaboration open sustainability used literature review interviews benchmarking examine relationships physical environment innovation literature review interviews underline innovation communicative human centred processes result five attributes innovation space present collaboration enablers modifiability smartness attractiveness reflect provide perspectives challenge support innovation creation develop physical space addition conceptual develop innovation space outlines physical space innovation service
physical environment source innovation investigation attributes innovation space investigation physical space intersect innovation innovation relevant attributes physical space innovation reflect changes nature innovation technological advancements service meanwhile changes argues develop innovation space similarity embodies diversified set valuable collaboration open sustainability used literature review interviews benchmarking examine relationships physical environment innovation literature review interviews underline innovation communicative human centred processes result five attributes innovation space present collaboration enablers modifiability smartness attractiveness reflect provide perspectives challenge support innovation creation develop physical space addition conceptual develop innovation space outlines physical space innovation service"
Below is a sample as a reproducable example:
A .csv file containing three observations of three variables:
ID;Text A;Text B
1;Below is the first title;Innovation and Knowledge Management
2;And now the second Title;Organizational Performance and Learning are very important
3;The third title;Knowledge plays an important rule in organizations
And below is the stemming method that I've used
data = read.csv2("Test.csv")
data[,2]=as.character(data[,2])
data[,3]=as.character(data[,3])
corpus <- Corpus(DataframeSource(data))
corpuscopy <- corpus
corpus <- tm_map(corpus, stemDocument)
corpus[[1]]
corpus <- tm_map(corpus, stemCompletion, dictionary=corpuscopy)
inspect(corpus[1:3])
It seems to me like it depends on the number of variables used in the .csv but I have no idea why.