2017-05-16 59 views
0

我有大量的產品說明數據,並需要將產品名稱和意圖從描述中分離出來,因爲我在用POS標籤標記文本後發現分隔NNP標籤對進一步清潔有一定幫助。如何從列表中篩選特定的POS標籤以分開列表?

我有以下類似的數據,我只想過濾NNP標籤,並希望它們在各自的列表中過濾,但無法這樣做。

data = [[('User', 'NNP'), 
    ('is', 'VBZ'), 
    ('not', 'RB'), 
    ('able', 'JJ'), 
    ('to', 'TO'), 
    ('order', 'NN'), 
    ('products', 'NNS'), 
    ('from', 'IN'), 
    ('iShopCatalog', 'NN'), 
    ('Coala', 'NNP'), 
    ('excluding', 'VBG'), 
    ('articles', 'NNS'), 
    ('from', 'IN'), 
    ('VWR', 'NNP')], 
[('Arfter', 'NNP'), 
    ('transferring', 'VBG'), 
    ('the', 'DT'), 
    ('articles', 'NNS'), 
    ('from', 'IN'), 
    ('COALA', 'NNP'), 
    ('to', 'TO'), 
    ('SRM', 'VB'), 
    ('the', 'DT'), 
    ('Category', 'NNP'), 
    ('S9901', 'NNP'), 
    ('Dummy', 'NNP'), 
    ('is', 'VBZ'), 
    ('maintained', 'VBN')], 
[('Due', 'JJ'), 
    ('to', 'TO'), 
    ('this', 'DT'), 
    ('the', 'DT'), 
    ('user', 'NN'), 
    ('is', 'VBZ'), 
    ('not', 'RB'), 
    ('able', 'JJ'), 
    ('to', 'TO'), 
    ('order', 'NN'), 
    ('the', 'DT'), 
    ('product', 'NN')], 
[('All', 'DT'), 
    ('other', 'JJ'), 
    ('users', 'NNS'), 
    ('can', 'MD'), 
    ('order', 'NN'), 
    ('these', 'DT'), 
    ('articles', 'NNS')], 
[('She', 'PRP'), 
    ('can', 'MD'), 
    ('order', 'NN'), 
    ('other', 'JJ'), 
    ('products', 'NNS'), 
    ('from', 'IN'), 
    ('a', 'DT'), 
    ('POETcatalog', 'NNP'), 
    ('without', 'IN'), 
    ('any', 'DT'), 
    ('problems', 'NNS')], 
[('Furtheremore', 'IN'), 
    ('she', 'PRP'), 
    ('is', 'VBZ'), 
    ('able', 'JJ'), 
    ('to', 'TO'), 
    ('order', 'NN'), 
    ('products', 'NNS'), 
    ('from', 'IN'), 
    ('the', 'DT'), 
    ('Vendor', 'NNP'), 
    ('VWR', 'NNP'), 
    ('through', 'IN'), 
    ('COALA', 'NNP')], 
[('But', 'CC'), 
    ('articles', 'NNS'), 
    ('from', 'IN'), 
    ('all', 'DT'), 
    ('other', 'JJ'), 
    ('suppliers', 'NNS'), 
    ('are', 'VBP'), 
    ('not', 'RB'), 
    ('orderable', 'JJ')], 
[('I', 'PRP'), 
    ('already', 'RB'), 
    ('spoke', 'VBD'), 
    ('to', 'TO'), 
    ('anic', 'VB'), 
    ('who', 'WP'), 
    ('maintain', 'VBP'), 
    ('the', 'DT'), 
    ('catalog', 'NN'), 
    ('COALA', 'NNP'), 
    ('and', 'CC'), 
    ('they', 'PRP'), 
    ('said', 'VBD'), 
    ('that', 'IN'), 
    ('the', 'DT'), 
    ('reason', 'NN'), 
    ('should', 'MD'), 
    ('be', 'VB'), 
    ('the', 'DT'), 
    ('assignment', 'NN'), 
    ('of', 'IN'), 
    ('the', 'DT'), 
    ('plant', 'NN')], 
[('User', 'NNP'), 
    ('is', 'VBZ'), 
    ('a', 'DT'), 
    ('assinged', 'JJ'), 
    ('to', 'TO'), 
    ('Universitaet', 'NNP'), 
    ('Regensburg', 'NNP'), 
    ('in', 'IN'), 
    ('Scout', 'NNP'), 
    ('but', 'CC'), 
    ('in', 'IN'), 
    ('P17', 'NNP'), 
    ('table', 'NN'), 
    ('YESRMCDMUSER01', 'NNP'), 
    ('she', 'PRP'), 
    ('is', 'VBZ'), 
    ('assigned', 'VBN'), 
    ('to', 'TO'), 
    ('company', 'NN'), 
    ('001500', 'CD'), 
    ('Merck', 'NNP'), 
    ('KGaA', 'NNP')], 
[('Please', 'NNP'), 
    ('find', 'VB'), 
    ('attached', 'JJ'), 
    ('some', 'DT'), 
    ('screenshots', 'NNS')]] 

我寫了下面的代碼:

def prodname(a): 
    p = [] 
    for i in a: 
     for j in range(len(i)): 
      if i[j][1]=='NNP': 
       p.append(i[j][0]) 
    return p 

這是給下面的輸出:

['User', 
    'Coala', 
    'VWR', 
    'Arfter', 
    'COALA', 
    'Category', 
    'S9901', 
    'Dummy', 
    'POETcatalog', 
    'Vendor', 
    'VWR', 
    'COALA', 
    'COALA', 
    'User', 
    'Universitaet', 
    'Regensburg', 
    'Scout', 
    'P17', 
    'YESRMCDMUSER01', 
    'Merck', 
    'KGaA', 
    'Please'] 

我想獲得的輸出是:

[['User', 
    'Coala', 
    'VWR'] 
['Arfter', 
'COALA', 
'Category', 
'S9901', 
'Dummy'] 
[], 
[], 
['POETcatalog'], 
['Vendor', 
'VWR', 
'COALA'], 
[], 
['COALA'], 
['User', 
'Universitaet', 
'Regensburg', 
'Scout', 
'P17', 
'YESRMCDMUSER01', 
'Merck', 
'KGaA'], 
['Please']] 

也試圖使用[[] for i in range(len(data)]追加到他們各自的名單,但不能這樣做。

回答

1

你可以使用這個列表理解:

[[j[0] for j in i if j[-1]=="NNP"] for i in data] 

輸出:

[['User', 'Coala', 'VWR'], ['Arfter', 'COALA', 'Category', 'S9901', 'Dummy'], [], [], ['POETcatalog'], ['Vendor', 'VWR', 'COALA'], [], ['COALA'], ['User', 'Universitaet', 'Regensburg', 'Scout', 'P17', 'YESRMCDMUSER01', 'Merck', 'KGaA'], ['Please']] 
+0

謝謝了。這樣可行。 – sharathchandramandadi

0

列表理解是要走的路。但@McGrady的答案可能有點難以閱讀。

下面是一個更容易閱讀的解決方案:

document = [[('User', 'NNP'), ('is', 'VBZ'), ('not', 'RB'), ('able', 'JJ'), ('to', 'TO'), ('order', 'NN'), ('products', 'NNS'), ('from', 'IN'), ('iShopCatalog', 'NN'), ('Coala', 'NNP'), ('excluding', 'VBG'), ('articles', 'NNS'), ('from', 'IN'), ('VWR', 'NNP')], [('Arfter', 'NNP'), ('transferring', 'VBG'), ('the', 'DT'), ('articles', 'NNS'), ('from', 'IN'), ('COALA', 'NNP'), ('to', 'TO'), ('SRM', 'VB'), ('the', 'DT'), ('Category', 'NNP'), ('S9901', 'NNP'), ('Dummy', 'NNP'), ('is', 'VBZ'), ('maintained', 'VBN')], [('Due', 'JJ'), ('to', 'TO'), ('this', 'DT'), ('the', 'DT'), ('user', 'NN'), ('is', 'VBZ'), ('not', 'RB'), ('able', 'JJ'), ('to', 'TO'), ('order', 'NN'), ('the', 'DT'), ('product', 'NN')], [('All', 'DT'), ('other', 'JJ'), ('users', 'NNS'), ('can', 'MD'), ('order', 'NN'), ('these', 'DT'), ('articles', 'NNS')], [('She', 'PRP'), ('can', 'MD'), ('order', 'NN'), ('other', 'JJ'), ('products', 'NNS'), ('from', 'IN'), ('a', 'DT'), ('POETcatalog', 'NNP'), ('without', 'IN'), ('any', 'DT'), ('problems', 'NNS')], [('Furtheremore', 'IN'), ('she', 'PRP'), ('is', 'VBZ'), ('able', 'JJ'), ('to', 'TO'), ('order', 'NN'), ('products', 'NNS'), ('from', 'IN'), ('the', 'DT'), ('Vendor', 'NNP'), ('VWR', 'NNP'), ('through', 'IN'), ('COALA', 'NNP')], [('But', 'CC'), ('articles', 'NNS'), ('from', 'IN'), ('all', 'DT'), ('other', 'JJ'), ('suppliers', 'NNS'), ('are', 'VBP'), ('not', 'RB'), ('orderable', 'JJ')], [('I', 'PRP'), ('already', 'RB'), ('spoke', 'VBD'), ('to', 'TO'), ('anic', 'VB'), ('who', 'WP'), ('maintain', 'VBP'), ('the', 'DT'), ('catalog', 'NN'), ('COALA', 'NNP'), ('and', 'CC'), ('they', 'PRP'), ('said', 'VBD'), ('that', 'IN'), ('the', 'DT'), ('reason', 'NN'), ('should', 'MD'), ('be', 'VB'), ('the', 'DT'), ('assignment', 'NN'), ('of', 'IN'), ('the', 'DT'), ('plant', 'NN')], [('User', 'NNP'), ('is', 'VBZ'), ('a', 'DT'), ('assinged', 'JJ'), ('to', 'TO'), ('Universitaet', 'NNP'), ('Regensburg', 'NNP'), ('in', 'IN'), ('Scout', 'NNP'), ('but', 'CC'), ('in', 'IN'), ('P17', 'NNP'), ('table', 'NN'), ('YESRMCDMUSER01', 'NNP'), ('she', 'PRP'), ('is', 'VBZ'), ('assigned', 'VBN'), ('to', 'TO'), ('company', 'NN'), ('001500', 'CD'), ('Merck', 'NNP'), ('KGaA', 'NNP')], [('Please', 'NNP'), ('find', 'VB'), ('attached', 'JJ'), ('some', 'DT'), ('screenshots', 'NNS')]] 
output = [[word for word, pos in sentence if pos=='NNP'] for sentence in document] 

如果你喜歡乾淨的代碼,你可以換你的腦袋周圍嵌套列表理解,https://stackoverflow.com/a/3633145/610569

output = [word for sentence in document for word, pos in sentence if pos=='NNP']