2015-05-01 17 views
1

我試圖在文本文件中找到所有名詞。最初我將.epub轉換爲.pdf文件。然後,我將.pdf成功轉換爲.txt文件,並刪除了一半的文本,因爲我只需要從本書的後半部分找到名詞。我想這樣做,所以我可以找到名詞的頻率,然後確定他們的決賽。獲取R中一本書(.txt文件)中的所有名詞,並製作頻率表和wordcloud

我可以通過原始文本文件正常執行頻率表,而無需進行任何轉換並製作wordcloud等,但似乎無法僅過濾名詞。有任何想法嗎?

cname <- file.path(".","Desktop", "egypt", "pdf") 
mytxtfiles <- list.files(path = cname, pattern = "txt", full.names = TRUE) 

#nouns2 and nouns doesnt seem to work :O -Ive tried both ways- 
nouns2 <- regmatches(mytxtfiles, gregexpr("^([A-Z][a-z]+)+$", mytxtfiles, perl=TRUE)) 
nouns <- lapply(mytxtfiles, function(i) { 
j <- paste0(scan(i, what = character()), collapse = " ") 
regmatches(j, gregexpr("^([A-Z][a-z]+)+$", j, perl=TRUE))}) 

#transformation if nouns do not work 
docs <- tm_map(docs[1], removeWords, stopwords("english")) 

#working wordcloud and freq data 
dtm <- DocumentTermMatrix(docs) 
findFreqTerms(dtm, lowfreq=100) 
findAssocs(dtm, "data", corlimit=0.6) 
freq <- sort(colSums(as.matrix(dtm)), decreasing=TRUE) 
wf <- data.frame(word=names(freq), freq=freq) 
p <- ggplot(subset(wf, freq >500), aes(word, freq)) 
p <-p + geom_bar(stat ="identity") 
p <- p + theme(axis.text.x=element_text(angle=45, hjust=1))    
library(wordcloud) 
wordcloud(names(freq),freq,min.freq=100, colors=brewer.pal(6,"Dark2")) 

我試圖nouns2和名詞,但他們返回類似:

nouns2 
[[1]] 
character(0) 
[[2]] 
character(0) 
[[3]] 
character(0) 
+0

事情與正則表達式的一部分心血來潮,代替'^([AZ] [AZ] +)+ $''與\\ B [AZ] [AZ] + \\ B' – hwnd

+0

爲什麼地球上你是否準備了PDF格式的TXT文件,解壓'.epub'會取得更好的效果。 – Cylian

回答

1

這裏是要找到所有的名詞,使用qdap包的方法。你可以從這裏出發。

text <- "To further enhance our practice, the president was honored to have him join the firm, former commissioner and the first to institute patent reexaminations, bringing a wealth of experience and knowledge to the firm and our clients." 

library(qdap) 
pos.text <- pos(sentence) # tells the count and parts of speech in the text 

vec.tagged <- as.vector(pos.text[[2]]) # retains only the tagged terms in a vector 
vec.tagged.split <- str_split(vec.tagged$POStagged, "/") # breaks the vector apart at the "/" 
all.nouns <- str_extract(vec.tagged.split[[1]], "^NN .+") # identifies the nouns 
all.nouns <- str_replace(all.nouns, "NN\\s", "") # removes NN tag 
all.nouns 

[1] NA    NA    NA    NA    NA    "novak"   "druce"   
[8] "was"   NA    NA    NA    NA    NA    NA    
[15] NA    NA    NA    NA    NA    "commissioner" "and"   
[22] NA    NA    NA    NA    NA    "reexaminations" NA    
[29] NA    NA    "of"    NA    "and"   NA    "to"    
[36] NA    NA    "and"   NA    NA    NA 
+0

我在這裏錯過了什麼嗎?輸出似乎主要是非名詞。它錯過了大部分名詞,並且有一些文字不會出現(novak,druce)。 – dww

相關問題