2011-08-10 227 views
7

這是此問題的後續操作和複雜性:Extracting contents of a string within parentheses使用正則表達式從字符串中提取信息

在這個問題,我有以下字符串 -

"Will Farrell (Nick Hasley), Rebecca Hall (Samantha)" 

,我想獲得一個元組列表中的(actor, character)形式 -

[('Will Farrell', 'Nick Hasley'), ('Rebecca Hall', 'Samantha')] 

要概括的問題,我有一個稍微複雜的字符串,我需要提取相同的信息。我的字符串是 -

"Will Ferrell (Nick Halsey), Rebecca Hall (Samantha), Glenn Howerton (Gary), 
with Stephen Root and Laura Dern (Delilah)" 

我需要格式化這個如下:

[('Will Farrell', 'Nick Hasley'), ('Rebecca Hall', 'Samantha'), ('Glenn Howerton', 'Gary'), 
('Stephen Root',''), ('Lauren Dern', 'Delilah')] 

我知道我可以代替填充詞(與,和,&等),但可不知道如何添加空白條目 - '' - 如果沒有演員的角色名稱(在本例中是Stephen Root)。這樣做最好的方法是什麼?

最後,我需要考慮一個角色是否有多個角色,併爲該角色擁有的每個角色構建一個元組。最終的字符串我是:

"Will Ferrell (Nick Halsey), Rebecca Hall (Samantha), Glenn Howerton (Gary, Brad), with 
Stephen Root and Laura Dern (Delilah, Stacy)" 

,我需要建立一個元組的列表如下:

[('Will Farrell', 'Nick Hasley'), ('Rebecca Hall', 'Samantha'), ('Glenn Howerton', 'Gary'),  
('Glenn Howerton', 'Brad'), ('Stephen Root',''), ('Lauren Dern', 'Delilah'), ('Lauren Dern', 'Stacy')] 

謝謝。

+0

@邁克爾:感謝你的拼寫編輯。 – David542

+0

使用正則表達式真的有必要嗎? – utdemir

+0

不,它可以是任何東西。無論工作和最好。 – David542

回答

4
import re 
credits = """Will Ferrell (Nick Halsey), Rebecca Hall (Samantha), Glenn Howerton (Gary, Brad), with 
Stephen Root and Laura Dern (Delilah, Stacy)""" 

# split on commas (only if outside of parentheses), "with" or "and" 
splitre = re.compile(r"\s*(?:,(?![^()]*\))|\bwith\b|\band\b)\s*") 

# match the part before the parentheses (1) and what's inside the parens (2) 
# (only if parentheses are present) 
matchre = re.compile(r"([^(]*)(?:\(([^)]*)\))?") 

# split the parts inside the parentheses on commas 
splitparts = re.compile(r"\s*,\s*") 

characters = splitre.split(credits) 
pairs = [] 
for character in characters: 
    if character: 
     match = matchre.match(character) 
     if match: 
      actor = match.group(1).strip() 
      if match.group(2): 
       parts = splitparts.split(match.group(2)) 
       for part in parts: 
        pairs.append((actor, part)) 
      else: 
       pairs.append((actor, "")) 

print(pairs) 

輸出:

[('Will Ferrell', 'Nick Halsey'), ('Rebecca Hall', 'Samantha'), 
('Glenn Howerton', 'Gary'), ('Glenn Howerton', 'Brad'), ('Stephen Root', ''), 
('Laura Dern', 'Delilah'), ('Laura Dern', 'Stacy')] 
0

你要的是找出一些併發症(恕我直言,你不能假定每個名字是由名姓的,但也名姓開頭大寫字母的單詞序列,加Jr.或姓名M.姓氏,或其他本地化變體,Jean-Claude van Damme,Louis da Silva等)。

現在,對於您發佈的示例輸入而言,這可能會有些過頭,但正如我上面寫的,我認爲事情很快就會變得混亂,因此我會使用nltk來解決此問題。

這裏是一個非常粗糙,不能很好地測試片段,但它應該做的工作:

import nltk 
from nltk.chunk.regexp import RegexpParser 

_patterns = [ 
    (r'^[A-Z][a-zA-Z]*[A-Z]?[a-zA-Z]+.?$', 'NNP'), # proper nouns 
    (r'^[(]$', 'O'), 
    (r'[,]', 'COMMA'), 
    (r'^[)]$', 'C'), 
    (r'.+', 'NN')         # nouns (default) 
] 

_grammar = """ 
     NAME: {<NNP> <COMMA> <NNP>} 
     NAME: {<NNP>+} 
     ROLE: {<O> <NAME>+ <C>} 
     """  
text = "Will Ferrell (Nick Halsey), Rebecca Hall (Samantha), Glenn Howerton (Gary, Brad), with Stephen Root and Laura Dern (Delilah, Stacy)" 
tagger = nltk.RegexpTagger(_patterns)  
chunker = RegexpParser(_grammar) 
text = text.replace('(', '(').replace(')', ')').replace(',', ' , ') 
tokens = text.split() 
tagged_text = tagger.tag(tokens) 
tree = chunker.parse(tagged_text) 

for n in tree: 
    if isinstance(n, nltk.tree.Tree) and n.node in ['ROLE', 'NAME']: 
     print n 

# output is: 
# (NAME Will/NNP Ferrell/NNP) 
# (ROLE (/O (NAME Nick/NNP Halsey/NNP))/C) 
# (NAME Rebecca/NNP Hall/NNP) 
# (ROLE (/O (NAME Samantha/NNP))/C) 
# (NAME Glenn/NNP Howerton/NNP) 
# (ROLE (/O (NAME Gary/NNP ,/COMMA Brad/NNP))/C) 
# (NAME Stephen/NNP Root/NNP) 
# (NAME Laura/NNP Dern/NNP) 
# (ROLE (/O (NAME Delilah/NNP ,/COMMA Stacy/NNP))/C) 

然後,必須處理標記的輸出,並把名字和角色的列表,而不是印刷,但你得到照片。

我們在這裏做的是第一遍,我們根據_patterns中的正則表達式標記每個標記,然後根據簡單的語法做第二遍構建更復雜的塊。你可以使語法和模式複雜化,就像你想的那樣。捕捉名稱的變體,混亂的輸入,縮寫等等。

我認爲用單個正則表達式來做這件事對於非平凡的輸入會很痛苦。

否則,Tim's solution正好解決了您發佈的輸入問題,並且沒有nltk依賴性。

0

如果你想有一個非正則表達式的解決方案...(假定沒有嵌套的括號。)

in_string = "Will Ferrell (Nick Halsey), Rebecca Hall (Samantha), Glenn Howerton (Gary, Brad), with Stephen Root and Laura Dern (Delilah, Stacy)"  

in_list = [] 
is_in_paren = False 
item = {} 
next_string = '' 

index = 0 
while index < len(in_string): 
    char = in_string[index] 

    if in_string[index:].startswith(' and') and not is_in_paren: 
     actor = next_string 
     if actor.startswith(' with '): 
      actor = actor[6:] 
     item['actor'] = actor 
     in_list.append(item) 
     item = {} 
     next_string = '' 
     index += 4  
    elif char == '(': 
     is_in_paren = True 
     item['actor'] = next_string 
     next_string = ''  
    elif char == ')': 
     is_in_paren = False 
     item['part'] = next_string 
     in_list.append(item) 
     item = {}     
     next_string = '' 
    elif char == ',': 
     if is_in_paren: 
      item['part'] = next_string 
      next_string = '' 
      in_list.append(item) 
      item = item.copy() 
      item.pop('part')     
    else: 
     next_string = "%s%s" % (next_string, char) 

    index += 1 


out_list = [] 
for dict in in_list: 
    actor = dict.get('actor') 
    part = dict.get('part') 

    if part is None: 
     part = '' 

    out_list.append((actor.strip(), part.strip())) 

print out_list 

輸出: [( '威爾·法瑞爾', '尼克哈爾西'),('麗貝卡霍爾','Samantha'),('Glenn Howerton','Gary'),('Glenn Howerton','Brad'),('Stephen Root',''),('Laura Dern','Delilah'), ( '勞拉鄧恩', '斯泰西')]

1

添Pietzcker的溶液可以被簡化爲(請注意,圖案被修改過):

import re 
credits = """ Will Ferrell (Nick Halsey), Rebecca Hall (Samantha), Glenn Howerton (Gary, Brad), with 
Stephen Root and Laura Dern (Delilah, Stacy)""" 

# split on commas (only if outside of parentheses), "with" or "and" 
splitre = re.compile(r"(?:,(?![^()]*\))(?:\s*with)*|\bwith\b|\band\b)\s*") 

# match the part before the parentheses (1) and what's inside the parens (2) 
# (only if parentheses are present) 
matchre = re.compile(r"\s*([^(]*)(?<!)\s*(?:\(([^)]*)\))?") 

# split the parts inside the parentheses on commas 
splitparts = re.compile(r"\s*,\s*") 

pairs = [] 
for character in splitre.split(credits): 
    gr = matchre.match(character).groups('') 
    for part in splitparts.split(gr[1]): 
     pairs.append((gr[0], part)) 

print(pairs) 

然後:

import re 
credits = """ Will Ferrell (Nick Halsey), Rebecca Hall (Samantha), Glenn Howerton (Gary, Brad), with 
Stephen Root and Laura Dern (Delilah, Stacy)""" 

# split on commas (only if outside of parentheses), "with" or "and" 
splitre = re.compile(r"(?:,(?![^()]*\))(?:\s*with)*|\bwith\b|\band\b)\s*") 

# match the part before the parentheses (1) and what's inside the parens (2) 
# (only if parentheses are present) 
matchre = re.compile(r"\s*([^(]*)(?<!)\s*(?:\(([^)]*)\))?") 

# split the parts inside the parentheses on commas 
splitparts = re.compile(r"\s*,\s*") 

gen = (matchre.match(character).groups('') for character in splitre.split(credits)) 

pp = [ (gr[0], part) for gr in gen for part in splitparts.split(gr[1])] 

print pp 

的技巧是使用groups('')與參數''