我試圖從我的數據文本分析中刪除拼寫錯誤。所以我正在使用量子包的字典功能。它適用於Unigrams。但它爲Bigrams提供了意想不到的輸出。不知道如何處理拼寫錯誤,以便他們不會潛入我的Bigrams和Trigrams。使用詞典在Quanteda中創建Bigram
ZTestCorp1 <- c("The new law included a capital gains tax, and an inheritance tax.",
"New York City has raised a taxes: an income tax and a sales tax.")
ZcObj <- corpus(ZTestCorp1)
mydict <- dictionary(list("the"="the", "new"="new", "law"="law",
"capital"="capital", "gains"="gains", "tax"="tax",
"inheritance"="inheritance", "city"="city"))
Zdfm1 <- dfm(ZcObj, ngrams=2, concatenator=" ",
what = "fastestword",
toLower=TRUE, removeNumbers=TRUE,
removePunct=TRUE, removeSeparators=TRUE,
removeTwitter=TRUE, stem=FALSE,
ignoredFeatures=NULL,
language="english",
dictionary=mydict, valuetype="fixed")
wordsFreq1 <- colSums(sort(Zdfm1))
電流輸出
> wordsFreq1
the new law capital gains tax inheritance city
0 0 0 0 0 0 0 0
不使用詞典,輸出如下:
> wordsFreq
tax and the new new law law included included a a capital
2 1 1 1 1 1
capital gains gains tax and an an inheritance inheritance tax new york
1 1 1 1 1 1
york city city has has raised raised a a taxes taxes an
1 1 1 1 1 1
an income income tax and a a sales sales tax
1 1 1 1 1
預期兩字組
The new
new law
law capital
capital gains
gains tax
tax inheritance
inheritance city
p.s.我假設標記是在字典匹配後完成的。但看起來情況並非如我所見。
在另一方面,我試圖創建我的字典對象作爲
mydict <- dictionary(list(mydict=c("the", "new", "law", "capital", "gains",
"tax", "inheritance", "city")))
但沒有奏效。所以我不得不使用上面我認爲效率不高的方法。
UPDATE 基於Ken的溶液輸出:
> (myDfm1a <- dfm(ZcObj, verbose = FALSE, ngrams=2,
+ keptFeatures = c("the", "new", "law", "capital", "gains", "tax", "inheritance", "city")))
Document-feature matrix of: 2 documents, 14 features.
2 x 14 sparse Matrix of class "dfmSparse" features
docs the_new new_law law_included a_capital capital_gains gains_tax tax_and an_inheritance
text1 1 1 1 1 1 1 1 1
text2 0 0 0 0 0 0 1 0
features
docs inheritance_tax new_york york_city city_has income_tax sales_tax
text1 1 0 0 0 0 0
text2 0 1 1 1 1 1
感謝您的慷慨和詳細的解釋。我收到這個錯誤。有任何想法嗎?? '>(toksDict < - selectFeatures(toks,mydict,選擇= 「保持」)) 錯誤UseMethod( 「selectFeatures」): 沒有適用的方法關於 'selectFeatures' 應用於類「C的目的( 'tokenizedTexts', 'list')「' – PeterV
可能是因爲'selectFeatures()'的方法僅在最新的(GitHub)版本的quanteda中擴展,並且您正在使用CRAN版本。按照https://github.com/kbenoit/quanteda從GitHub安裝,截至今天的版本是0.9.1-7。 (將於2016年1月更新CRAN版本。) –
謝謝@Ken。這很棒!我將安裝最新的cran軟件包。事實上,我喜歡你提供的第二個解決方案,因爲它考慮了停用詞。這對我來說很重要,因爲我正在從事一個單詞預測項目。然而,我很好奇它是如何設法拉入**紐約**。我認爲紐約不是一個停詞。當我使用ngrams = 2選項時,我得到了這個。 – PeterV