我試圖使用簡單的退後模型來構建預測性字詞算法,但我正在努力創建單詞的頻率表以創建選擇概率下一個詞。我需要用適當的頻率創建ngram列表。R從多個句子中提取所有單詞並創建一個ngram頻率表
該任務是課程的一部分,所以我無法提供數據,因爲它來自學校。我使用的樣本長度爲10,000個句子,每個句子長度不同。
我有一個解決方案,但我知道這是不好的形式,因爲我與rbind循環,顯然這需要太長時間。
library(quanteda)
library(data.table)
ntimes2<- ngrams(tokenize(sampNews,removePunct = TRUE,removeNumbers = TRUE,
removeTwitter=TRUE),n=2)
listwords<- function(input){
words<-data.frame(x=0)
for (i in 1:10102){
words<- rbind(words, input[i])
}
words<<-words
}
listwords(ntimes2)
但是,我不知道如何從另一種方式從標記化列表中提取句子。
我試過使用stylo txt.to.words,但我無法控制拆分規則,以排除標點符號的所有變化。特別是我想防止撇號創建一個字拆分。
words<-txt.to.words(sampNews,splitting.rule = "[[:space]]|(?=[^,[:^punct:]])")
words<-txt.to.words(sampNews,splitting.rule = "(_| |,|?|#|@)")
這個工作,但只適用於有限數量的分離器。
words<-txt.to.words(sampNews,splitting.rule = "(_|)")
strsplit把詞但是它擁有多個列表,這依舊意味着我需要循環的數據,將其拉入一個主列表/數據幀,這樣我可以創建一個頻率表的結構。
words<- strsplit(sampNews, "[[:space:]]|(?=[^,'[:^punct:]])", perl=TRUE)
[[5]]
[1] "And" "when" "it's" "often" "difficult"
[6] "to" "predict" "a" "law's" "impact,"
[11] "legislators" "should" "think" "twice" "before"
[16] "carrying" "any" "bill" "." ""
[21] "Is" "it" "absolutely" "necessary" "?"
[26] "" "Is" "it" "an" "issue"
[31] "serious" "enough" "to" "merit" "their"
[[6]]
[1] "There" "was" "a" "certain" "amount"
[6] "of" "scoffing" "going" "around" "a"
我試圖sapply/lapply/rbindlist但有可能我沒有正確地使用他們,所以請你提出解決方案,包括那些。
任何意見是非常感謝。
Ĵ
添加一些數據,得到手感
sampNews [1:2]
[1] "He wasn't home alone, apparently."
[2] "The St. Louis plant had to close. It would die of old age. Workers had been making cars there since the onset of mass automotive production in the 1920s."
> class(sampNews)
[1] "character"