how to do LDA in R -
my task apply lda on dataset of amazon reviews , 50 topics
i have extracted review text in vector , trying apply lda
i have created dtm
matrix <- create_matrix(dat, language="english", removestopwords=true, stemwords=false, stripwhitespace=true, tolower=true) <<documenttermmatrix (documents: 100000, terms: 174632)>> non-/sparse entries: 4096244/17459103756 sparsity : 100% maximal term length: 218 weighting : term frequency (tf)
but when try following error:
lda <- lda(matrix, 30)
error in lda(matrix, 30) : each row of input matrix needs contain @ least 1 non-zero entry
searched solutions , used slam
matrix1 <- rollup(matrix, 2, na.rm=true, fun = sum)
still getting same error
i new can me or suggest me reference study this.it helpful
there no empty rows in original matrix , contains 1 column contain reviews
i have been assigned kind of similar task , learning , doing , have developed , sharing code snippet , hope help.
library("topicmodels") library("tm") func<-function(input){ x<-c("i eat broccoli , bananas.", "i ate banana , spinach smoothie breakfast.", "chinchillas , kittens cute.", "my sister adopted kitten yesterday.", "look @ cute hamster munching on piece of broccoli.") #whole file lowercased #text<-tolower(x) #deleting common words text #text2<-setdiff(text,stopwords("english")) #splitting text vectors each vector word.. #text3<-strsplit(text2," ") # generating structured text i.e. corpus docs<-corpus(vectorsource(x))
creating content transformers i.e functions used modify objects in r..
tospace <- content_transformer(function (x , pattern ) gsub(pattern, " ", x)) #removing special charecters.. docs <- tm_map(docs, tospace, "/") docs <- tm_map(docs, tospace, "@") docs <- tm_map(docs, tospace, "\\|") docs <- tm_map(docs, removenumbers) # remove english common stopwords docs <- tm_map(docs, removewords, stopwords("english")) # remove punctuations docs <- tm_map(docs, removepunctuation) # eliminate white spaces docs <- tm_map(docs, stripwhitespace) docs<-tm_map(docs,removewords,c("\t"," ","")) dtm<- termdocumentmatrix(docs, control = list(removepunctuation = true, stopwords=true)) #print(dtm) freq<-colsums(as.matrix(dtm)) print(names(freq)) ord<-order(freq,decreasing=true) write.csv(freq[ord],"word_freq.csv")
setting parameters lda
burnin<-4000 iter<-2000 thin<-500 seed<-list(2003,5,63,100001,765) nstart<-5 best<-true #number of topics k<-3 # docs topics ldaout<-lda(dtm,k,method="gibbs",control=list(nstart=nstart,seed=seed,best=best,burnin=burnin,iter=iter,thin=thin)) ldaout.topics<-as.matrix(topics(ldaout)) write.csv(ldaout.topics,file=paste("ldagibbs",k,"docstotopics.csv"))
Comments
Post a Comment