2014-03-26 27 views
1

我想找出最有效的方法來匹配字符串的兩個向量到第三個字符串。我想從第一場比賽限制我的第二場比賽,以文字或字符的數量有限,遠字符串匹配兩個向量到文本的向量由兩者之間的距離限制

可以說我有名字像這樣的datframe:

signers <- data.frame(
    first = 
     c("Benjamin","Thomas","Robert","George","Thomas","Jared","James","John","James","George","George","James","Edmund","George") , 
    last = 
     c("Franklin","Mifflin","Morris","Clymer","Fitzsimons","Ingersoll","Wilson","Blair","Madison","Washington","Mason","McClurg","Randolph","Wythe") 
    ) 

,我有一些像這樣的文字:

text <- 
"A lot of people attended the Constitutional Convention in Philadephia, including Alexander Hamilton, Benjamin Franklin and John Adams. 
Not everyone who attended the convention ended up signing the Constitution, including George Wythe, John F. Mercer and Edmund Jennings Randolph who abstained." 

我想搜索「簽署者」數據框中的每個名稱並標記它們是否在文本中。

在本傑明富蘭克林和喬治Wythe的情況下,名稱完全在文本中。在Edmund Randolph的情況下,他的名字和姓氏之間有一個字或10個字符。

所以我要尋找的是這樣的:

 first  last  inparagraph 
1 Benjamin Franklin  1 
2 Thomas Mifflin 
3 Robert  Morris 
4 George  Clymer 
5 Thomas Fitzsimons 
6  Jared Ingersoll 
7  James  Wilson 
8  John  Blair 
9  James Madison 
10 George Washington 
11 George  Mason 
12 James McClurg 
13 Edmund Randolph  1 
14 George  Wythe  1 

我雖然使用lappy功能查找第一個名稱位於,但我不能確定如何在第一名稱的鄰近範圍內搜索被找出。

namesfinds <- lapply(signers$first , grep, text) 

回答

2

這裏是一個選項,允許最多的姓和名之間的三個詞或縮寫使用正則表達式:

patterns <- paste0("(.*)(", signers$first, "(\\s+[[:alpha:].]+){,3}\\s+", signers$last, ")(.*)") 
signers$inparagraph <- ifelse(sapply(patterns, grepl, text), "1", "") 

產地:

 first  last inparagraph 
1 Benjamin Franklin   1 
2 Thomas Mifflin    
3 Robert  Morris    
4 George  Clymer    
5 Thomas Fitzsimons    
6  Jared Ingersoll    
7  James  Wilson    
8  John  Blair   1 
9  James Madison    
10 George Washington    
11 George  Mason    
12 James McClurg    
13 Edmund Randolph   1 
14 George  Wythe   1 

注意約翰·布萊爾,因爲我匹配修改text爲測試目的包括他(見下面的數據)。如果您希望允許更少的單詞,則可以將{,3}更改爲更低的單詞。現在,如果你想真正提取匹配的名稱,你可以這樣做:

unname(sapply(patterns, gsub, "\\2", text))[sapply(patterns, grepl, text)] 
# [1] "Benjamin Franklin"  "John W. F. Blair"   "Edmund Jennings Randolph" 
# [4] "George Wythe"  

這裏是我使用的text

text <- 
    "A lot of people attended the Constitutional Convention in Philadephia, including Alexander Hamilton, Benjamin Franklin and John Adams. 
Not everyone who attended the convention ended up signing the Constitution, including George Wythe, John F. Mercer and Edmund Jennings Randolph who abstained and John W. F. Blair ate cake" 
+0

我知道它已經兩年了 - 但我非常感謝這個答案! – MatthewR

+0

@MatthewR,沒有問題,我很欣賞這種讚賞;) – BrodieG

1

它可能不漂亮,但似乎工作。通過粘貼正則表達式來捕捉中間名是我使用的技巧。看起來它可以與任何名字一起使用。希望它適用於所有數據。

> a <- paste(signers[,1], signers[,2]) 
> pst <- paste(signers$first, ".*", signers$last, sep = "") 
> gg <- gsub("\\.\\*", " ", names(unlist(sapply(pst, grep, text)))) 
> signers$inparagraph <- ifelse(a %in% gg, "1", "") 
> signers 
##  first  last inparagraph 
## 1 Benjamin Franklin   1 
## 2 Thomas Mifflin   
## 3 Robert  Morris   
## 4 George  Clymer   
## 5 Thomas Fitzsimons   
## 6  Jared Ingersoll   
## 7  James  Wilson   
## 8  John  Blair   
## 9  James Madison   
## 10 George Washington   
## 11 George  Mason   
## 12 James McClurg   
## 13 Edmund Randolph   1 
## 14 George  Wythe   1