如何從BIO分塊句子中提取塊？ - 蟒

給的輸入句子，具有BIO chunk tags：如何從BIO分塊句子中提取塊？ - 蟒

[（ '什麼'， 'B-NP'），（ '是'， 'B-VP'），（ '的'，' （''，'B-NP'），（'空速'， 'I-NP'），（''，'B-PP'），（'an'，'B-NP'），（'unladen'，'I -NP '），（' 吞」， 'I-NP'），（ '？'， 'O'）]

我需要提取相關的短語進行，例如如果我想提取'NP'，我需要提取包含B-NP和I-NP的元組片段。

[OUT]：

[('What', '0'), ('the airspeed', '2-3'), ('an unladen swallow', '5-6-7')]

（注：在提取物中的元組中的數字表示的標記索引。）

我曾嘗試使用以下代碼中提取它：

def extract_chunks(tagged_sent, chunk_type): 
    current_chunk = [] 
    current_chunk_position = [] 
    for idx, word_pos in enumerate(tagged_sent): 
     word, pos = word_pos 
     if '-'+chunk_type in pos: # Append the word to the current_chunk. 
      current_chunk.append((word)) 
      current_chunk_position.append((idx)) 
     else: 
      if current_chunk: # Flush the full chunk when out of an NP. 
       _chunk_str = ' '.join(current_chunk) 
       _chunk_pos_str = '-'.join(map(str, current_chunk_position)) 
       yield _chunk_str, _chunk_pos_str 
       current_chunk = [] 
       current_chunk_position = [] 
    if current_chunk: # Flush the last chunk. 
     yield ' '.join(current_chunk), '-'.join(current_chunk_position) 


tagged_sent = [('What', 'B-NP'), ('is', 'B-VP'), ('the', 'B-NP'), ('airspeed', 'I-NP'), ('of', 'B-PP'), ('an', 'B-NP'), ('unladen', 'I-NP'), ('swallow', 'I-NP'), ('?', 'O')] 
print (list(extract_chunks(tagged_sent, chunk_type='NP')))

但是當我有相鄰類型的相鄰塊時：

tagged_sent = [('The', 'B-NP'), ('Mitsubishi', 'I-NP'), ('Electric', 'I-NP'), ('Company', 'I-NP'), ('Managing', 'B-NP'), ('Director', 'I-NP'), ('ate', 'B-VP'), ('ramen', 'B-NP')] 

print (list(extract_chunks(tagged_sent, chunk_type='NP')))

它輸出這樣的：

[('The Mitsubishi Electric Company Managing Director', '0-1-2-3-4-5'), ('ramen', '7')]

代替所需的：

[('The Mitsubishi Electric Company', '0-1-2-3'), ('Managing Director', '4-5'), ('ramen', '7')]

這怎麼可能從上面的代碼解決？

除了上面的代碼是如何完成的，是否有更好的解決方案來提取特定的chunk_type所需的塊？

來源

2015-09-01 alvas

def extract_chunks(tagged_sent, chunk_type): 
    grp1, grp2, chunk_type = [], [], "-" + chunk_type 
    for ind, (s, tp) in enumerate(tagged_sent): 
     if tp.endswith(chunk_type): 
      if not tp.startswith("B"): 
       grp2.append(str(ind)) 
       grp1.append(s) 
      else: 
       if grp1: 
        yield " ".join(grp1), "-".join(grp2) 
       grp1, grp2 = [s], [str(ind)] 
    yield " ".join(grp1), "-".join(grp2)

輸出：

In [2]: l = [('The', 'B-NP'), ('Mitsubishi', 'I-NP'), ('Electric', 'I-NP'), ('Company', 'I-NP'), ('Managing', 'B-NP'), 
    ...:    ('Director', 'I-NP'), ('ate', 'B-VP'), ('ramen', 'B-NP')] 

In [3]: list(extract_chunks(l, "NP")) 
Out[3]: 
[('The Mitsubishi Electric Company', '0-1-2-3'), 
('Managing Director', '4-5'), 
('ramen', '7')] 

In [4]: l = [('What', 'B-NP'), ('is', 'B-VP'), ('the', 'B-NP'), ('airspeed', 'I-NP'), ('of', 'B-PP'), ('an', 'B-NP'), ('unladen', 'I-NP'), ('swallow', 'I-NP'), ('?', 'O')] 

In [5]: list(extract_chunks(l, "NP")) 
Out[5]: [('What', '0'), ('the airspeed', '2-3'), ('an unladen swallow', '5-6-7')]

來源

2015-09-04 11:37:39

我會做這樣的：

import re 
def extract_chunks(tagged_sent, chunk_type): 
    # compiles the expression we want to match 
    regex = re.compile(chunk_type) 

    # filters matched items in a dictionary whose keys are the matched indexes 
    first_step = {index_:tag[0] for index_, tag in enumerate(tagged_sent) if regex.findall(tag[1])} 

    # builds list of lists following output format 
    second_step = [] 
    for key_ in sorted(first_step.keys()): 
     if second_step and int(second_step [len(second_step)-1][1].split('-')[-1]) == key_ -1:   
      second_step[len(second_step)-1][0] += ' {0}'.format(first_step[key_]) 
      second_step[len(second_step)-1][1] += '-{0}'.format(str(key_)) 
     else: 
      second_step.append([first_step[key_], str(key_)]) 

    # builds output in final format 
    return [tuple(item) for item in second_step]

你能適應它，而不是用整幢輸出記憶就像我現在這樣的發電機和refactory它有更好的表現（我在匆忙，所以代碼遠不是最佳的）。

希望它有幫助！

來源

2015-09-01 14:56:17 maccinza

試試這個，它會提取所有類型的塊與他們各自的單詞的索引。

def extract_chunks(tagged_sent, chunk_type='NP'): 
    out_sen = [] 
    for idx, word_pos in enumerate(tagged_sent): 
     word,bio = word_pos 
     boundary,tag = bio.split("-") if "-" in bio else ('','O') 
     if tag != chunk_type:continue 
     if boundary == "B": 
      out_sen.append([word, str(idx)]) 
     elif boundary == "I": 
      out_sen[-1][0] += " "+ word 
      out_sen[-1][-1] += "-"+ str(idx) 
     else: 
      out_sen.append([word, str(idx)]) 
    return out_sen

演示：

>>> tagged_sent = [('The', 'B-NP'), ('Mitsubishi', 'I-NP'), ('Electric', 'I-NP'), ('Company', 'I-NP'), ('Managing', 'B-NP'), ('Director', 'I-NP'), ('ate', 'B-VP'), ('ramen', 'B-NP')] 
>>> output_sent = extract_chunks(tagged_sent) 
>>> print map(tuple, output_sent) 
[('The Mitsubishi Electric Company', '0-1-2-3'), ('Managing Director', '4-5'), ('ramen', '7')]

來源

2015-09-04 07:29:03 Riyaz

如何從BIO分塊句子中提取塊？ - 蟒

回答

相關問題