2015-12-26 119 views
2

我試圖從我的數據文本分析中刪除拼寫錯誤。所以我正在使用量子包的字典功能。它適用於Unigrams。但它爲Bigrams提供了意想不到的輸出。不知道如何處理拼寫錯誤,以便他們不會潛入我的Bigrams和Trigrams。使用詞典在Quanteda中創建Bigram

ZTestCorp1 <- c("The new law included a capital gains tax, and an inheritance tax.", 
       "New York City has raised a taxes: an income tax and a sales tax.") 

ZcObj <- corpus(ZTestCorp1) 

mydict <- dictionary(list("the"="the", "new"="new", "law"="law", 
         "capital"="capital", "gains"="gains", "tax"="tax", 
         "inheritance"="inheritance", "city"="city")) 

Zdfm1 <- dfm(ZcObj, ngrams=2, concatenator=" ", 
     what = "fastestword", 
     toLower=TRUE, removeNumbers=TRUE, 
     removePunct=TRUE, removeSeparators=TRUE, 
     removeTwitter=TRUE, stem=FALSE, 
     ignoredFeatures=NULL, 
     language="english", 
     dictionary=mydict, valuetype="fixed") 

wordsFreq1 <- colSums(sort(Zdfm1)) 

電流輸出

> wordsFreq1 
    the   new   law  capital  gains   tax inheritance  city 
     0   0   0   0   0   0   0   0 

不使用詞典,輸出如下:

> wordsFreq 
    tax and   the new   new law law included  included a  a capital 
      2    1    1    1    1    1 
capital gains  gains tax   and an an inheritance inheritance tax  new york 
      1    1    1    1    1    1 
    york city  city has  has raised  raised a   a taxes  taxes an 
      1    1    1    1    1    1 
    an income  income tax   and a   a sales  sales tax 
      1    1    1    1    1 

預期兩字組

The new 
new law 
law capital 
capital gains 
gains tax 
tax inheritance 
inheritance city 

p.s.我假設標記是在字典匹配後完成的。但看起來情況並非如我所見。

在另一方面,我試圖創建我的字典對象作爲

mydict <- dictionary(list(mydict=c("the", "new", "law", "capital", "gains", 
         "tax", "inheritance", "city"))) 

但沒有奏效。所以我不得不使用上面我認爲效率不高的方法。

UPDATE 基於Ken的溶液輸出:

> (myDfm1a <- dfm(ZcObj, verbose = FALSE, ngrams=2, 
+    keptFeatures = c("the", "new", "law", "capital", "gains", "tax", "inheritance", "city"))) 
Document-feature matrix of: 2 documents, 14 features. 
2 x 14 sparse Matrix of class "dfmSparse" features 
docs the_new new_law law_included a_capital capital_gains gains_tax tax_and an_inheritance 
text1  1  1   1   1    1   1  1    1 
text2  0  0   0   0    0   0  1    0 
    features 
docs inheritance_tax new_york york_city city_has income_tax sales_tax 
text1    1  0   0  0   0   0 
text2    0  1   1  1   1   1 

回答

4

更新2017年12月21日爲quanteda

高興的新版本就看你與這個軟件包上!我認爲在你遇到困難時有兩個問題。首先是如何在形成ngram之前應用特徵選擇。其次是如何定義特徵選擇(使用量子)。

第一個問題:如何在形成ngrams之前應用特徵選擇。在這裏你已經定義了一個字典來做到這一點。 (正如我將在下面顯示的,這裏沒有必要。)您想刪除所有不在選擇列表中的術語,然後形成bigrams。 quanteda默認不會這樣做,因爲它不是一個標準形式的「bigram」,其中的單詞不是按照由相鄰性嚴格定義的某個窗口來並置的。例如,在您的預期結果中,law capital不是一對相鄰的術語,這是bigram的通常定義。

但是,我們可以通過更「手動」地構建文檔特徵矩陣來覆蓋此行爲。

首先,標記文本。

# tokenize the original 
toks <- tokens(ZcObj, removePunct = TRUE, removeNumbers = TRUE) %>% 
    tokens_tolower() 
toks 
## tokens object from 2 documents. 
## text1 : 
## [1] "the"   "new"   "law"   "included" "a"   "capital"  "gains"  "tax"   "and"   "an"   "inheritance" "tax"   
## 
## text2 : 
## [1] "new" "york" "city" "has" "raised" "a"  "taxes" "an"  "income" "tax" "and" "a"  "sales" "tax" 

現在,我們運用你的字典mydict的符號化文本使用tokens_select()

(toksDict <- tokens_select(toks, mydict, selection = "keep")) 
## tokens object from 2 documents. 
## text1 : 
## [1] "the"   "new"   "law"   "capital"  "gains"  "tax"   "inheritance" "tax"   
## 
## text2 : 
## [1] "new" "city" "tax" "tax" 

從這個選定的一組令牌,我們現在可以形成雙字母組(或者我們可以直接喂toksDictdfm()) :

(toks2 <- tokens_ngrams(toksDict, n = 2, concatenator = " ")) 
## tokens object from 2 documents. 
## text1 : 
## [1] "the new"   "new law"   "law capital"  "capital gains" "gains tax"  "tax inheritance" "inheritance tax" 
## 
## text2 : 
## [1] "new city" "city tax" "tax tax" 

# now create the dfm 
(myDfm2 <- dfm(toks2)) 
## Document-feature matrix of: 2 documents, 10 features. 
## 2 x 10 sparse Matrix of class "dfm" 
##  features 
## docs the new new law law capital capital gains gains tax tax inheritance inheritance tax new city city tax tax tax 
## text1  1  1   1    1   1    1    1  0  0  0 
## text2  0  0   0    0   0    0    0  1  1  1 
topfeatures(myDfm2) 
#  the new   new law  law capital capital gains  gains tax tax inheritance inheritance tax  new city  city tax   tax tax 
#   1    1    1    1    1    1    1    1    1    1 

功能列表現在非常接近你想要的。

第二個問題就是爲什麼你的字典的方法似乎效率不高。這是因爲你正在創建一個字典來執行特徵選擇,但並沒有真正將它用作字典 - 換句話說,就是一個字典,其中每個鍵都等於它自己的鍵值,因爲值不是字典。簡單地給它一個選擇令牌的字符向量,而不是它的工作正常,例如:

(myDfm1 <- dfm(ZcObj, verbose = FALSE, 
       keptFeatures = c("the", "new", "law", "capital", "gains", "tax", "inheritance", "city"))) 
## Document-feature matrix of: 2 documents, 8 features. 
## 2 x 8 sparse Matrix of class "dfm" 
##  features 
## docs the new law capital gains tax inheritance city 
## text1 1 1 1  1  1 2   1 0 
## text2 0 1 0  0  0 2   0 1 
+0

感謝您的慷慨和詳細的解釋。我收到這個錯誤。有任何想法嗎?? '>(toksDict < - selectFeatures(toks,mydict,選擇= 「保持」)) 錯誤UseMethod( 「selectFeatures」): 沒有適用的方法關於 'selectFeatures' 應用於類「C的目的( 'tokenizedTexts', 'list')「' – PeterV

+1

可能是因爲'selectFeatures()'的方法僅在最新的(GitHub)版本的quanteda中擴展,並且您正在使用CRAN版本。按照https://github.com/kbenoit/quanteda從GitHub安裝,截至今天的版本是0.9.1-7。 (將於2016年1月更新CRAN版本。) –

+0

謝謝@Ken。這很棒!我將安裝最新的cran軟件包。事實上,我喜歡你提供的第二個解決方案,因爲它考慮了停用詞。這對我來說很重要,因爲我正在從事一個單詞預測項目。然而,我很好奇它是如何設法拉入**紐約**。我認爲紐約不是一個停詞。當我使用ngrams = 2選項時,我得到了這個。 – PeterV