You can pass in a custom tokenizing function to tm
's DocumentTermMatrix
function, so if you have package tau
installed it's fairly straightforward.
library(tm); library(tau);
tokenize_ngrams <- function(x, n=3) return(rownames(as.data.frame(unclass(textcnt(x,method="string",n=n)))))
texts <- c("This is the first document.", "This is the second file.", "This is the third text.")
corpus <- Corpus(VectorSource(texts))
matrix <- DocumentTermMatrix(corpus,control=list(tokenize=tokenize_ngrams))
Where n
in the tokenize_ngrams
function is the number of words per phrase. This feature is also implemented in package RTextTools
, which further simplifies things.
library(RTextTools)
texts <- c("This is the first document.", "This is the second file.", "This is the third text.")
matrix <- create_matrix(texts,ngramLength=3)
This returns a class of DocumentTermMatrix
for use with package tm
.
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…