Document similarity was calculated using tidytext package and widyr package.
like this..
library(janeaustenr)
library(dplyr)
library(tidytext)
# Comparing Jane Austen novels
austen_words <- austen_books() %>%
unnest_tokens(word, text) %>%
anti_join(stop_words, by = "word") %>%
count(book, word) %>%
ungroup()
# closest books to each other
closest <- austen_words %>%
pairwise_similarity(book, word, n) %>%
arrange(desc(similarity))
closest
closest %>%
filter(item1 == "Emma")
How is the similarity calculated in pairwise_similarity function?
Some words may not appear in common in the two documents. Are these words counted?
Or is it ignoring these words and counting only those words that are common to both documents?
If a word has similar tf-idf scores in both documents, is it considered similar?
question from:
https://stackoverflow.com/questions/65947803/how-to-calculate-similarity-in-pairwise-similarity-function 与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…