使用正則表達式從字符串中提取信息

這是此問題的後續操作和複雜性：Extracting contents of a string within parentheses。使用正則表達式從字符串中提取信息

在這個問題，我有以下字符串 -

"Will Farrell (Nick Hasley), Rebecca Hall (Samantha)"

，我想獲得一個元組列表中的(actor, character)形式 -

[('Will Farrell', 'Nick Hasley'), ('Rebecca Hall', 'Samantha')]

要概括的問題，我有一個稍微複雜的字符串，我需要提取相同的信息。我的字符串是 -

"Will Ferrell (Nick Halsey), Rebecca Hall (Samantha), Glenn Howerton (Gary), 
with Stephen Root and Laura Dern (Delilah)"

我需要格式化這個如下：

[('Will Farrell', 'Nick Hasley'), ('Rebecca Hall', 'Samantha'), ('Glenn Howerton', 'Gary'), 
('Stephen Root',''), ('Lauren Dern', 'Delilah')]

我知道我可以代替填充詞（與，和，&等），但可不知道如何添加空白條目 - '' - 如果沒有演員的角色名稱（在本例中是Stephen Root）。這樣做最好的方法是什麼？

最後，我需要考慮一個角色是否有多個角色，併爲該角色擁有的每個角色構建一個元組。最終的字符串我是：

"Will Ferrell (Nick Halsey), Rebecca Hall (Samantha), Glenn Howerton (Gary, Brad), with 
Stephen Root and Laura Dern (Delilah, Stacy)"

，我需要建立一個元組的列表如下：

[('Will Farrell', 'Nick Hasley'), ('Rebecca Hall', 'Samantha'), ('Glenn Howerton', 'Gary'),  
('Glenn Howerton', 'Brad'), ('Stephen Root',''), ('Lauren Dern', 'Delilah'), ('Lauren Dern', 'Stacy')]

謝謝。

來源

2011-08-10 David542

@邁克爾：感謝你的拼寫編輯。 – David542

使用正則表達式真的有必要嗎？ – utdemir

不，它可以是任何東西。無論工作和最好。 – David542

import re 
credits = """Will Ferrell (Nick Halsey), Rebecca Hall (Samantha), Glenn Howerton (Gary, Brad), with 
Stephen Root and Laura Dern (Delilah, Stacy)""" 

# split on commas (only if outside of parentheses), "with" or "and" 
splitre = re.compile(r"\s*(?:,(?![^()]*\))|\bwith\b|\band\b)\s*") 

# match the part before the parentheses (1) and what's inside the parens (2) 
# (only if parentheses are present) 
matchre = re.compile(r"([^(]*)(?:\(([^)]*)\))?") 

# split the parts inside the parentheses on commas 
splitparts = re.compile(r"\s*,\s*") 

characters = splitre.split(credits) 
pairs = [] 
for character in characters: 
    if character: 
     match = matchre.match(character) 
     if match: 
      actor = match.group(1).strip() 
      if match.group(2): 
       parts = splitparts.split(match.group(2)) 
       for part in parts: 
        pairs.append((actor, part)) 
      else: 
       pairs.append((actor, "")) 

print(pairs)

輸出：

[('Will Ferrell', 'Nick Halsey'), ('Rebecca Hall', 'Samantha'), 
('Glenn Howerton', 'Gary'), ('Glenn Howerton', 'Brad'), ('Stephen Root', ''), 
('Laura Dern', 'Delilah'), ('Laura Dern', 'Stacy')]

來源

2011-08-10 13:14:14

你要的是找出一些併發症（恕我直言，你不能假定每個名字是由名姓的，但也名姓開頭大寫字母的單詞序列，加Jr.或姓名M.姓氏，或其他本地化變體，Jean-Claude van Damme，Louis da Silva等）。

現在，對於您發佈的示例輸入而言，這可能會有些過頭，但正如我上面寫的，我認爲事情很快就會變得混亂，因此我會使用nltk來解決此問題。

這裏是一個非常粗糙，不能很好地測試片段，但它應該做的工作：

import nltk 
from nltk.chunk.regexp import RegexpParser 

_patterns = [ 
    (r'^[A-Z][a-zA-Z]*[A-Z]?[a-zA-Z]+.?$', 'NNP'), # proper nouns 
    (r'^[(]$', 'O'), 
    (r'[,]', 'COMMA'), 
    (r'^[)]$', 'C'), 
    (r'.+', 'NN')         # nouns (default) 
] 

_grammar = """ 
     NAME: {<NNP> <COMMA> <NNP>} 
     NAME: {<NNP>+} 
     ROLE: {<O> <NAME>+ <C>} 
     """  
text = "Will Ferrell (Nick Halsey), Rebecca Hall (Samantha), Glenn Howerton (Gary, Brad), with Stephen Root and Laura Dern (Delilah, Stacy)" 
tagger = nltk.RegexpTagger(_patterns)  
chunker = RegexpParser(_grammar) 
text = text.replace('(', '(').replace(')', ')').replace(',', ' , ') 
tokens = text.split() 
tagged_text = tagger.tag(tokens) 
tree = chunker.parse(tagged_text) 

for n in tree: 
    if isinstance(n, nltk.tree.Tree) and n.node in ['ROLE', 'NAME']: 
     print n 

# output is: 
# (NAME Will/NNP Ferrell/NNP) 
# (ROLE (/O (NAME Nick/NNP Halsey/NNP))/C) 
# (NAME Rebecca/NNP Hall/NNP) 
# (ROLE (/O (NAME Samantha/NNP))/C) 
# (NAME Glenn/NNP Howerton/NNP) 
# (ROLE (/O (NAME Gary/NNP ,/COMMA Brad/NNP))/C) 
# (NAME Stephen/NNP Root/NNP) 
# (NAME Laura/NNP Dern/NNP) 
# (ROLE (/O (NAME Delilah/NNP ,/COMMA Stacy/NNP))/C)

然後，必須處理標記的輸出，並把名字和角色的列表，而不是印刷，但你得到照片。

我們在這裏做的是第一遍，我們根據_patterns中的正則表達式標記每個標記，然後根據簡單的語法做第二遍構建更復雜的塊。你可以使語法和模式複雜化，就像你想的那樣。捕捉名稱的變體，混亂的輸入，縮寫等等。

我認爲用單個正則表達式來做這件事對於非平凡的輸入會很痛苦。

否則，Tim's solution正好解決了您發佈的輸入問題，並且沒有nltk依賴性。

來源

2011-08-10 13:49:00

如果你想有一個非正則表達式的解決方案...（假定沒有嵌套的括號。）

in_string = "Will Ferrell (Nick Halsey), Rebecca Hall (Samantha), Glenn Howerton (Gary, Brad), with Stephen Root and Laura Dern (Delilah, Stacy)"  

in_list = [] 
is_in_paren = False 
item = {} 
next_string = '' 

index = 0 
while index < len(in_string): 
    char = in_string[index] 

    if in_string[index:].startswith(' and') and not is_in_paren: 
     actor = next_string 
     if actor.startswith(' with '): 
      actor = actor[6:] 
     item['actor'] = actor 
     in_list.append(item) 
     item = {} 
     next_string = '' 
     index += 4  
    elif char == '(': 
     is_in_paren = True 
     item['actor'] = next_string 
     next_string = ''  
    elif char == ')': 
     is_in_paren = False 
     item['part'] = next_string 
     in_list.append(item) 
     item = {}     
     next_string = '' 
    elif char == ',': 
     if is_in_paren: 
      item['part'] = next_string 
      next_string = '' 
      in_list.append(item) 
      item = item.copy() 
      item.pop('part')     
    else: 
     next_string = "%s%s" % (next_string, char) 

    index += 1 


out_list = [] 
for dict in in_list: 
    actor = dict.get('actor') 
    part = dict.get('part') 

    if part is None: 
     part = '' 

    out_list.append((actor.strip(), part.strip())) 

print out_list

輸出： [（ '威爾·法瑞爾'， '尼克哈爾西'），（'麗貝卡霍爾'，'Samantha'），（'Glenn Howerton'，'Gary'），（'Glenn Howerton'，'Brad'），（'Stephen Root'，''），（'Laura Dern'，'Delilah'），（ '勞拉鄧恩'， '斯泰西'）]

來源

2011-08-10 17:10:36 jcfollower

添Pietzcker的溶液可以被簡化爲（請注意，圖案被修改過）：

import re 
credits = """ Will Ferrell (Nick Halsey), Rebecca Hall (Samantha), Glenn Howerton (Gary, Brad), with 
Stephen Root and Laura Dern (Delilah, Stacy)""" 

# split on commas (only if outside of parentheses), "with" or "and" 
splitre = re.compile(r"(?:,(?![^()]*\))(?:\s*with)*|\bwith\b|\band\b)\s*") 

# match the part before the parentheses (1) and what's inside the parens (2) 
# (only if parentheses are present) 
matchre = re.compile(r"\s*([^(]*)(?<!)\s*(?:\(([^)]*)\))?") 

# split the parts inside the parentheses on commas 
splitparts = re.compile(r"\s*,\s*") 

pairs = [] 
for character in splitre.split(credits): 
    gr = matchre.match(character).groups('') 
    for part in splitparts.split(gr[1]): 
     pairs.append((gr[0], part)) 

print(pairs)

然後：

import re 
credits = """ Will Ferrell (Nick Halsey), Rebecca Hall (Samantha), Glenn Howerton (Gary, Brad), with 
Stephen Root and Laura Dern (Delilah, Stacy)""" 

# split on commas (only if outside of parentheses), "with" or "and" 
splitre = re.compile(r"(?:,(?![^()]*\))(?:\s*with)*|\bwith\b|\band\b)\s*") 

# match the part before the parentheses (1) and what's inside the parens (2) 
# (only if parentheses are present) 
matchre = re.compile(r"\s*([^(]*)(?<!)\s*(?:\(([^)]*)\))?") 

# split the parts inside the parentheses on commas 
splitparts = re.compile(r"\s*,\s*") 

gen = (matchre.match(character).groups('') for character in splitre.split(credits)) 

pp = [ (gr[0], part) for gr in gen for part in splitparts.split(gr[1])] 

print pp

的技巧是使用groups('')與參數''

來源

2011-08-10 22:25:51 eyquem

使用正則表達式從字符串中提取信息

回答

相關問題