說你想分割文件是否正確,以便每個揚聲器有一個文本對象?然後使用正則表達式來爲每個對象抓住說話者的名字?然後,您可以編寫一個函數來收集每個對象的詞頻等,並將它們放在一個表格中,其中的行或列名稱是演講者的名字。
如果是的話,你可能會說,x是你的文字,然後用strsplit(x, "STATEMENT OF")
分割上,然後grep()
或str_extract()
的話語句返回的2名或3個字後,參議員(他們總是隻有兩個名字在你的例?)。
看看這裏以獲得更多關於使用這些功能,一般在R
文本操作:http://en.wikibooks.org/wiki/R_Programming/Text_Processing
UPDATE下面是一個更完整的答案...
#create object containing all text
x <- c("OPENING STATEMENT OF SENATOR HERB KOHL, CHAIRMAN
The Chairman. Good afternoon to everybody, and thank you
very much for coming to this hearing this afternoon.
In today's tough economic climate, millions of seniors have
lost a big part of their retirement and investments in only a
matter of months. Unlike younger Americans, they do not have
time to wait for the markets to rebound in order to recoup a
lifetime of savings.
STATEMENT OF SENATOR BIG APPLE KOHL, CHAIRMAN
I am trying to identify the most frequently used words in the
congress speeches, and have to separate them by the congressperson.
I am just starting to learn about R and the tm package. I have a code
that can find the most frequent words, but what kind of a code can I
use to automatically identify and store the speaker of the speech
STATEMENT OF SENATOR LITTLE ORANGE, CHAIRMAN
Would it be correct to say that you want
to split the file so you have one text object
per speaker? And then use a regular expression
to grab the speaker's name for each object? Then
you can write a function to collect word frequencies,
etc. on each object and put them in a table where the
row or column names are the speaker's names.")
# split object on first two words
y <- unlist(strsplit(x, "STATEMENT OF"))
#load library containing handy function
library(stringr)
# use word() to return words in positions 3 to 4 of each string, which is where the first and last names are
z <- word(y[2:4], 3, 4) # note that the first line in the character vector y has only one word and this function gives and error if there are not enough words in the line
z # have a look at the result...
[1] "HERB KOHL," "BIG APPLE" "LITTLE ORANGE,"
毫無疑問一個正則表達式嚮導可以想出更快更簡潔的方法!
無論如何,從這裏你可以運行一個函數來計算矢量y
(即每個說話人的語音)的每一行上的單詞頻率,然後創建另一個結合單詞頻率結果和名稱的對象,以便進一步分析。
來源
2012-01-11 05:35:58
Ben
謝謝,我想這可能會奏效。 – appletree 2012-01-11 06:28:33
@appletree,我已經擴展了我的答案,我希望有所幫助。我有一個正則表達式的解決方案,但無法使其工作。也許有人會告訴我們它是如何完成的... – Ben 2012-01-11 07:40:32