2015-11-20 41 views
1

我試圖在大型數據集中查找字符模式(單詞部分)的頻率。r文本挖掘:查找字符模式的頻率

例如,我有一個CSV文件中的以下列表:

  • applestrawberrylime
  • applegrapelime
  • pineapplemangoguava
  • kiwiguava
  • grapeapple
  • mixedberry
  • kiwiguavapineapple
  • limemixedberry

有沒有辦法找到所有的字符組合的頻率是多少?像:

  • appleberry
  • 番石榴
  • applestrawberry
  • kiwiguava
  • grapeapple
  • 稻草
  • 應用
  • AP
  • 假髮
  • MEM

更新:這是我在我的數據中尋找長度爲三的所有字符模式的頻率:

threecombo <- do.call(paste0,expand.grid(rep(list(c('a', 'b', 'c', 'd','e','f','g','h','i','j','k','l','m','n','o','p','q','r','s','t','u','v','w','x','y','z')), 3))) 

threecompare<-sapply(threecombo, function(x) length(grep(x, myData))) 

代碼工作我希望它的方式,我想重複上述步驟以獲得更長的字符長度(4,5,6等),但需要一段時間才能運行。有沒有更好的方法來做到這一點?

+1

歡迎StackOverflow的匹配單字組DFM到個別水果單詞的所有排列卦!你的問題很有趣,但很難回答。當有明確的問題時,真的這個網站會更好。在你的情況下,你可能想要提供一個鏈接到一個語料庫,然後顯示一些你已經嘗試使用的代碼,然後顯示你使用該代碼的問題。有關提示,請參閱http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example! –

+0

謝謝我用我的代碼迄今爲止取得的進展更新了我的問題 – user3709049

回答

1

您的第一個問題是grep/grepl的一個簡單任務,而且我看到您已將我的答案的這一部分納入您的修訂問題中。

docs <- c('applestrawberrylime', 'applegrapelime', 'pineapplemangoguava', 
      'kiwiguava', 'grapeapple', 'mixedberry', 'kiwiguavapineapple', 
      'limemixedberry') 

patterns <- c('appleberry', 'guava', 'applestrawberry', 'kiwiguava', 
       'grapeapple', 'grape', 'app', 'ap', 'wig', 'mem', 'go') 

# how often does each pattern occur in the set of docs? 
sapply(patterns, function(x) sum(grepl(x, docs))) 

如果你要檢查每一個可能的模式,你可以搜索字母每個組合(當你開始做以上),但是這顯然繞很長的路要走。

一種策略是隻計算實際發生的每種模式的頻率。每個文件的字符長度爲n有1個可能的長度模式n,2個模式長度爲n - 1等等。你可以提取其中的每一個,然後數起來。

all_patterns <- lapply(docs, function(x) { 

    # individual chars in this doc 
    chars <- unlist(strsplit(x, '')) 

    # unique possible sequence lengths 
    seqs <- sapply(1:nchar(x), seq) 

    # each sequence in each position 
    sapply(seqs, function(y) { 
     start_pos <- 0:(nchar(x) - max(y)) 
     sapply(start_pos, function(z) paste(chars[z + y], collapse='')) 
    }) 
}) 

unq_patterns <- unique(unlist(all_patterns)) 

# how often does each unique pattern occur in the set of docs? 
occur <- sapply(unq_patterns, function(x) sum(grepl(x, docs))) 

# top 25 most frequent patterns 
sort(occur, decreasing = T)[1:25]  

# e  i  a  l  p  r  m ap pp pl le app ppl 
# 7  7  6  6  5  5  5  5  5  5  5  5  5 
# ple appl pple apple  g  w  b  y ra be er rr 
# 5  5  5  5  5  3  3  3 3  3  3  3 

這工作,迅速跑開,但作爲文檔的主體變長,可能會陷入癱瘓(即使在這個簡單的例子中,有625種獨特的模式)。可以對所有s/lapply調用使用並行處理,但仍然...

+0

這很接近,但我正在尋找一些能幫助我,當我不知道所有可能的模式/組合時。 – user3709049

+0

你有什麼嘗試?想想你能提供什麼樣的內容來獲得預期的結果。另外,考慮到對於大量文本文本,「所有可能的字符組合」將會非常龐大​​ – arvi1000

+0

我剛纔更新了我的問題,我意識到數據集是巨大的,但我不知道所有可能的模式是什麼。我試圖找出最常見的模式。 – user3709049

2

由於您可能正在從一組包含非水果詞的文本中尋找水果口味的組合,我已經編寫了一些類似於您示例中的文檔。我用quanteda包構建文檔項矩陣,然後基於包含水果詞的ngrams進行過濾。

docs <- c("One flavor is apple strawberry lime.", 
      "Another flavor is apple grape lime.", 
      "Pineapple mango guava is our newest flavor.", 
      "There is also kiwi guava and grape apple.", 
      "Mixed berry was introduced last year.", 
      "Did you like kiwi guava pineapple?", 
      "Try the lime mixed berry.") 
flavorwords <- c("apple", "guava", "berry", "kiwi", "guava", "grape") 

require(quanteda) 
# form a document-feature matrix ignoring common stopwords + "like" 
# for ngrams, bigrams, trigrams 
fruitDfm <- dfm(docs, ngrams = 1:3, ignoredFeatures = c("like", "also", stopwords("english"))) 
## Creating a dfm from a character vector ... 
## ... lowercasing 
## ... tokenizing 
## ... indexing documents: 7 documents 
## ... indexing features: 90 feature types 
## ... removed 47 features, from 176 supplied (glob) feature types 
## ... complete. 
## ... created a 7 x 40 sparse dfm 
## Elapsed time: 0.01 seconds. 
# select only those features containing flavorwords as regular expression 
fruitDfm <- selectFeatures(fruitDfm, flavorwords, valuetype = "regex") 
## kept 22 features, from 5 supplied (regex) feature types 
# show the features 
topfeatures(fruitDfm, nfeature(fruitDfm)) 
##    apple     guava     grape    pineapple     kiwi 
##     3      3      2      2      2 
##   kiwi_guava     berry   mixed_berry   strawberry  apple_strawberry 
##     2      2      2      1      1 
##  strawberry_lime apple_strawberry_lime   apple_grape   grape_lime  apple_grape_lime 
##     1      1      1      1      1 
##  pineapple_mango   mango_guava pineapple_mango_guava   grape_apple  guava_pineapple 
##     1      1      1      1      1 
## kiwi_guava_pineapple  lime_mixed_berry 
##     1      1 

補充:

如果你正在尋找匹配沒有空格的文件分離的條件,就可以形成的n-gram與一個空字符串連接符,和下面匹配。

flavorwordsConcat <- c("applestrawberrylime", "applegrapelime", "pineapplemangoguava", 
         "kiwiguava", "grapeapple", "mixedberry", "kiwiguavapineapple", 
         "limemixedberry") 

fruitDfm <- dfm(docs, ngrams = 1:3, concatenator = "") 
fruitDfm <- fruitDfm[, features(fruitDfm) %in% flavorwordsConcat] 
fruitDfm 
# Document-feature matrix of: 7 documents, 8 features. 
# 7 x 8 sparse Matrix of class "dfmSparse" 
#  features 
# docs applestrawberrylime applegrapelime pineapplemangoguava kiwiguava grapeapple mixedberry kiwiguavapineapple limemixedberry 
# text1     1    0     0   0   0   0     0    0 
# text2     0    1     0   0   0   0     0    0 
# text3     0    0     1   0   0   0     0    0 
# text4     0    0     0   1   1   0     0    0 
# text5     0    0     0   0   0   1     0    0 
# text6     0    0     0   1   0   0     1    0 
# text7     0    0     0   0   0   1     0    1 

如果您的文本包含級聯的味道的話,那麼你可以使用

unigramFlavorWords <- c("apple", "guava", "grape", "pineapple", "kiwi") 
head(unlist(combinat::permn(unigramFlavorWords, paste, collapse = ""))) 
[1] "appleguavagrapepineapplekiwi" "appleguavagrapekiwipineapple" "appleguavakiwigrapepineapple" 
[4] "applekiwiguavagrapepineapple" "kiwiappleguavagrapepineapple" "kiwiappleguavapineapplegrape" 
+0

我喜歡你的答案,但如果存儲在文檔中的項目沒有用空格分隔,它將如何工作? docs <-c(「applestraberrylime」,「kiwipineapple」,「grapelemon」)另外,如果你不知道所有可能的風味詞呢? – user3709049

+0

不確定你在問什麼,但試圖在更新的答案中涵蓋兩種可能性。 –