R :: tm - Create a table/matrix of term association frequencies and add values to dendrogram -
i got corpus vector of short sentences (n > 50), e.g.:
corpus <- c("looking in r","check whether milk sour or not", "random sentence dubious meaning")
i able print dendrogram
fit <- hclust(d, method="ward") plot(fit, hang=-1) groups <- cutree(fit, k=nc) # "k=" defines number of clusters using rect.hclust(fit, k=nc, border="red") # draw dendrogram red borders around 5 clusters
and correlation matrix
cor_1 <- cor(as.matrix(dtms)) corrplot(cor_1, method = "number")
as far have understood - please correct me here if wrong - findassocs()
i.e. correlation checks whether 2 terms appear in same document?
goal: don't want see correlation, frequency of 2 terms appear in same document not adjacent each other (bigramtokenizer won't work). example: term , term b appear in 5 different documents across corpus regardless of distance.
ideally want create frequency matrix similar 1 above , add frequencies dendrogram if possible (akin pvclust()
prints numbers)
any ideas on how achieve this?
i think asking how co-occurrence matrix terms, cells number of documents in term occurs document. can accomplish magic using matrix cross-product of transpose of matrix itself, after converting matrix of document-term frequencies boolean values indicating whether term occurred in document.
(i've used quanteda package here instead of tm similar approach work documenttermmatrix
object tm.)
# create demonstration documents (txts <- c(paste(letters[c(1, 1:3)], collapse = " "), paste(letters[c(1, 3, 5)], collapse = " "), paste(letters[c(5, 6, 7)], collapse = " "))) ## [1] "a b c" "a c e" "e f g" # convert document-term matrix require(quanteda) dtm <- dfm(txts, verbose = false) dtm ## document-feature matrix of: 3 documents, 6 features. ## 3 x 6 sparse matrix of class "dfmsparse" ## features ## docs b c e f g ## text1 2 1 1 0 0 0 ## text2 1 0 1 1 0 0 ## text3 0 0 0 1 1 1 # convert matrix of co-occcurences rather counts (dtm <- tf(dtm, "boolean")) ## document-feature matrix of: 3 documents, 6 features. ## 3 x 6 sparse matrix of class "dfmsparse" ## features ## docs b c e f g ## text1 1 1 1 0 0 0 ## text2 1 0 1 1 0 0 ## text3 0 0 0 1 1 1 # "feature in document" co-occurrence matrix t(dtm) %*% dtm ## 6 x 6 sparse matrix of class "dgcmatrix" ## b c e f g ## 2 1 2 1 . . ## b 1 1 1 . . . ## c 2 1 2 1 . . ## e 1 . 1 2 1 1 ## f . . . 1 1 1 ## g . . . 1 1 1
note: setup counts term "co-occurring" once in document appears (e.g. b
). if want change that, replace diagonal diagonal minus one.
Comments
Post a Comment