R :: tm - Create a table/matrix of term association frequencies and add values to dendrogram -


i got corpus vector of short sentences (n > 50), e.g.:

corpus <- c("looking in r","check whether milk sour or not", "random sentence dubious meaning") 

i able print dendrogram

fit <- hclust(d, method="ward")    plot(fit, hang=-1) groups <- cutree(fit, k=nc)   # "k=" defines number of clusters using    rect.hclust(fit, k=nc, border="red") # draw dendrogram red borders around 5 clusters  

and correlation matrix

cor_1 <- cor(as.matrix(dtms)) corrplot(cor_1, method = "number") 

as far have understood - please correct me here if wrong - findassocs() i.e. correlation checks whether 2 terms appear in same document?

goal: don't want see correlation, frequency of 2 terms appear in same document not adjacent each other (bigramtokenizer won't work). example: term , term b appear in 5 different documents across corpus regardless of distance.

ideally want create frequency matrix similar 1 above , add frequencies dendrogram if possible (akin pvclust() prints numbers)

enter image description here

any ideas on how achieve this?

i think asking how co-occurrence matrix terms, cells number of documents in term occurs document. can accomplish magic using matrix cross-product of transpose of matrix itself, after converting matrix of document-term frequencies boolean values indicating whether term occurred in document.

(i've used quanteda package here instead of tm similar approach work documenttermmatrix object tm.)

# create demonstration documents (txts <- c(paste(letters[c(1, 1:3)], collapse = " "),             paste(letters[c(1, 3, 5)], collapse = " "),             paste(letters[c(5, 6, 7)], collapse = " "))) ## [1] "a b c" "a c e" "e f g"  # convert document-term matrix require(quanteda) dtm <- dfm(txts, verbose = false) dtm ## document-feature matrix of: 3 documents, 6 features. ## 3 x 6 sparse matrix of class "dfmsparse" ##        features ## docs    b c e f g ##   text1 2 1 1 0 0 0 ##   text2 1 0 1 1 0 0 ##   text3 0 0 0 1 1 1  # convert matrix of co-occcurences rather counts (dtm <- tf(dtm, "boolean")) ## document-feature matrix of: 3 documents, 6 features. ## 3 x 6 sparse matrix of class "dfmsparse" ##        features ## docs    b c e f g ##   text1 1 1 1 0 0 0 ##   text2 1 0 1 1 0 0 ##   text3 0 0 0 1 1 1  # "feature in document" co-occurrence matrix t(dtm) %*% dtm ## 6 x 6 sparse matrix of class "dgcmatrix" ##   b c e f g ## 2 1 2 1 . . ## b 1 1 1 . . . ## c 2 1 2 1 . . ## e 1 . 1 2 1 1 ## f . . . 1 1 1 ## g . . . 1 1 1 

note: setup counts term "co-occurring" once in document appears (e.g. b). if want change that, replace diagonal diagonal minus one.


Comments

Popular posts from this blog

javascript - jQuery: Add class depending on URL in the best way -

caching - How to check if a url path exists in the service worker cache -

Redirect to a HTTPS version using .htaccess -