2012-01-11 92 views
1

我試圖在大會演講中確定最常用的詞語,並且必須由國會議員將其分開。我剛開始學習R和tm包。我有一個可以找到最常用詞的代碼,但是我可以使用什麼樣的代碼來自動識別和存儲演講者?R在tm包中劃分文本 - 識別揚聲器

文字是這樣的:

OPENING STATEMENT OF SENATOR HERB KOHL, CHAIRMAN 

    The Chairman. Good afternoon to everybody, and thank you 
very much for coming to this hearing this afternoon. 
    In today's tough economic climate, millions of seniors have 
lost a big part of their retirement and investments in only a 
matter of months. Unlike younger Americans, they do not have 
time to wait for the markets to rebound in order to recoup a 
lifetime of savings. 
[....] 

    STATEMENT OF SENATOR MEL MARTINEZ, RANKING MEMBER 
[....] 

我希望能夠通過人們獲得這些名稱,或獨立的文本。希望您能夠幫助我。非常感謝。

回答

0

說你想分割文件是否正確,以便每個揚聲器有一個文本對象?然後使用正則表達式來爲每個對象抓住說話者的名字?然後,您可以編寫一個函數來收集每個對象的詞頻等,並將它們放在一個表格中,其中的行或列名稱是演講者的名字。

如果是的話,你可能會說,x是你的文字,然後用strsplit(x, "STATEMENT OF")分割上,然後grep()str_extract()的話語句返回的2名或3個字後,參議員(他們總是隻有兩個名字在你的例?)。

看看這裏以獲得更多關於使用這些功能,一般在R文本操作:http://en.wikibooks.org/wiki/R_Programming/Text_Processing

UPDATE下面是一個更完整的答案...

#create object containing all text 
x <- c("OPENING STATEMENT OF SENATOR HERB KOHL, CHAIRMAN 

    The Chairman. Good afternoon to everybody, and thank you 
very much for coming to this hearing this afternoon. 
    In today's tough economic climate, millions of seniors have 
lost a big part of their retirement and investments in only a 
matter of months. Unlike younger Americans, they do not have 
time to wait for the markets to rebound in order to recoup a 
lifetime of savings. 

STATEMENT OF SENATOR BIG APPLE KOHL, CHAIRMAN 

I am trying to identify the most frequently used words in the 
congress speeches, and have to separate them by the congressperson. 
I am just starting to learn about R and the tm package. I have a code 
that can find the most frequent words, but what kind of a code can I 
use to automatically identify and store the speaker of the speech 

STATEMENT OF SENATOR LITTLE ORANGE, CHAIRMAN 

Would it be correct to say that you want 
to split the file so you have one text object 
per speaker? And then use a regular expression 
to grab the speaker's name for each object? Then 
you can write a function to collect word frequencies, 
etc. on each object and put them in a table where the 
row or column names are the speaker's names.") 

# split object on first two words 
y <- unlist(strsplit(x, "STATEMENT OF")) 

#load library containing handy function 
library(stringr) 

# use word() to return words in positions 3 to 4 of each string, which is where the first and last names are 
    z <- word(y[2:4], 3, 4) # note that the first line in the character vector y has only one word and this function gives and error if there are not enough words in the line 
    z # have a look at the result... 
    [1] "HERB KOHL,"  "BIG APPLE"  "LITTLE ORANGE," 

毫無疑問一個正則表達式嚮導可以想出更快更簡潔的方法!

無論如何,從這裏你可以運行一個函數來計算矢量y(即每個說話人的語音)的每一行上的單詞頻率,然後創建另一個結合單詞頻率結果和名稱的對象,以便進一步分析。

+1

謝謝,我想這可能會奏效。 – appletree 2012-01-11 06:28:33

+0

@appletree,我已經擴展了我的答案,我希望有所幫助。我有一個正則表達式的解決方案,但無法使其工作。也許有人會告訴我們它是如何完成的... – Ben 2012-01-11 07:40:32

0

這是我如何使用本的例子(使用qdap解析,並創建一個數據幀,然後轉換爲Corpus 3文檔處理它;注意,qdap是專爲喜歡這份成績單的數據和Corpus可能不最好的數據格式):

library(qdap) 
dat <- unlist(strsplit(x, "\\n")) 

locs <- grep("STATEMENT OF ", dat) 
nms <- sapply(strsplit(dat[locs], "STATEMENT OF |,"), "[", 2) 
dat[locs] <- "SPLIT_HERE" 
corp <- with(data.frame(person=nms, dialogue = 
    Trim(unlist(strsplit(paste(dat[-1], collapse=" "), "SPLIT_HERE")))), 
    df2tm_corpus(dialogue, person)) 

tm::inspect(corp) 

## A corpus with 3 text documents 
## 
## The metadata consists of 2 tag-value pairs and a data frame 
## Available tags are: 
## create_date creator 
## Available variables in the data frame are: 
## MetaID 
## 
## $`SENATOR BIG APPLE KOHL` 
## I am trying to identify the most frequently used words in the congress speeches, and have to separate them by the congressperson. I am just starting to learn about R and the tm package. I have a code that can find the most frequent words, but what kind of a code can I use to automatically identify and store the speaker of the speech 
## 
## $`SENATOR HERB KOHL` 
## The Chairman. Good afternoon to everybody, and thank you very much for coming to this hearing this afternoon.  In today's tough economic climate, millions of seniors have lost a big part of their retirement and investments in only a matter of months. Unlike younger Americans, they do not have time to wait for the markets to rebound in order to recoup a lifetime of savings. 
## 
## $`SENATOR LITTLE ORANGE` 
## Would it be correct to say that you want to split the file so you have one text object per speaker? And then use a regular expression to grab the speaker's name for each object? Then you can write a function to collect word frequencies, etc. on each object and put them in a table where the row or column names are the speaker's names.