2015-11-05 76 views
0

上午DocumentTermMatrix中的R是用下述R代碼,以便計算的TF-IDF相對於計算IDF到基座2

Terms 
Docs  blue bright  sky  sun 
    1 0.7924813 0.0000000 0.2924813 0.0000000 
    2 0.0000000 0.2924813 0.0000000 0.2924813 
    3 0.0000000 0.1949875 0.1949875 0.1949875 

但是,如果我執行手動計算,結果是不匹配的。 我注意到的是,在R中,IDF的計算方式爲log2(文檔總數/文檔數量爲t的文檔)。

有沒有辦法將R中的對數基數從2改爲10? 請建議

回答

0

試着寫自己的函數

weightTfIdf.log10 <- function (m, normalize = TRUE) 
{ 
    isDTM <- inherits(m, "DocumentTermMatrix") 
    if (isDTM) 
     m <- t(m) 
    if (normalize) { 
     cs <- col_sums(m) 
     if (any(cs == 0)) 
      warning("empty document(s): ", paste(Docs(m)[cs == 
       0], collapse = " ")) 
     names(cs) <- seq_len(nDocs(m)) 
     m$v <- m$v/cs[m$j] 
    } 
    rs <- row_sums(m > 0) 
    if (any(rs == 0)) 
     warning("unreferenced term(s): ", paste(Terms(m)[rs == 
      0], collapse = " ")) 
    lnrs <- log10(nDocs(m)/rs) 
    lnrs[!is.finite(lnrs)] <- 0 
    m <- m * lnrs 
    attr(m, "weighting") <- c(sprintf("%s%s", "term frequency - inverse document frequency", 
     if (normalize) " (normalized)" else ""), "tf-idf") 
    if (isDTM) 
     t(m) 
    else m 
} 
environment(weightTfIdf.log10) <- environment(TermDocumentMatrix) 

dtm <- TermDocumentMatrix(dd, control = list(weighting = weightTfIdf.log10)) 
as.matrix(dtm) 
#   Docs 
# Terms    1   2   3 
# blue 0.23856063 0.00000000 0.00000000 
# bright 0.00000000 0.23856063 0.00000000 
# bright. 0.00000000 0.00000000 0.15904042 
# sky  0.08804563 0.00000000 0.05869709 
# sun  0.00000000 0.08804563 0.05869709