如何解決這個正則表達式以捕獲字符串的特定字符？

我有一個very_largeString包含單詞的列表，以及一些id，我想提取所有的話，它的id有NC和AQ是形態ocurre consecutevely和打印id的其餘部分。例如：如何解決這個正則表達式以捕獲字符串的特定字符？

very_largeString= ''' Hola hola I 1 
compis compis NCMS000 0.500006 
! ! Fat 1 

esta este DD0FS0 0.986779 
y y CC 0.999962 
es ser VSIP3S0 1 
que que CS 0.437483 
es ser VSIP3S0 1 
muy muy RG 1 
sencilla sencillo AQ0FS0 1 
de de SPS00 0.999984 
utilizar utilizar VMN0000 1 
, , Fc 1 
que que CS 0.437483 
si si CS 0.99954 
nos nos PP1CP000 0.935743 
ponen poner VMIP3P0 1 
facilidad facilidad NCFS000 1 
con con SPS00 1 
las el DA0FP0 0.970954 
tareas tarea NCFP000 1 
de de SPS00 0.999984 
la el DA0FS0 0.972269 
casa casa NCFS000 0.979058 
pues pues CS 0.998047 
mejor mejor AQ0CS0 0.873665 
que que PR0CN000 0.562517 
mejor mejor AQ0CS0 0.873665 
, , Fc 1 
pero pero CC 0.999764 
tan tan RG 1 
antigua antiguo AQ0FS0 0.953488 
que que CS 0.437483 
según según SPS00 0.995943 
mi mi DP1CSS 0.999101 
madre madre NCFS000 1 
era ser VSII1S0 0.491262 
de de SPS00 0.999984 
carga carga NCFS000 0.952569 
superior superior AQ0CS0 0.992424 
'''

這將是所需的輸出，因爲他們有一個在id的開頭的NC和AQ字符：

[('carga', 'NCFS000', 'superior', 'AQ0CS0'), ('carga', 'NCFS000', 'frontal', 'AQ0CS0')]

如何解決我的正則表達式，以提取所有作爲編號的文字編號爲AQ和NC？這是我所有準備嘗試：

regex_ = re.findall(r'^(\w+)\s\w+\s(NCFS000)\s[0-9.]+\n^(\w+)\s\w+\s(AQ0CS0)', very_largeString, re.M) 

print regex_

輸出就是單詞和it's例如相關id：

[('word','id'),('word','id')]

來源

2014-10-27 john doe

那麼你的期望輸出將不匹配實際的輸出，我猜你沒有列出你的輸出的其他組合？ – hwnd 2014-10-27 20:01:24

我只想專注於所有具有NC和AQ作爲id的單詞並且一個接一個地出現（即沒有空格，沒有其他單詞和ID） – 2014-10-27 20:03:38

from pprint import pprint 
import re 
result = re.findall(r''' 
    (?mx)    # Muti-line, verbose 
    ^    # Align to beginning of a line 
    (\S+)\s+   # Grab first word 
    \S+\s+    # Don't care about 2nd word 
    (NC\S+)\s+   # 3rd word must have NC 
    \S+\n    # End of first line 
    ^    # Next line is identical in form 
    (\S+)\s+   # to the first line 
    \S+\s+  
    (AQ\S+)\s+   # except 3rd word must have AQ 
    \S+\n 
''', very_largeString) 
pprint (result)

來源

2014-10-27 20:10:18

這是輸出：'[（'carga'，'NCFS000 ''，'superior'，'AQ0CS0'），（'punto'，'NCMS000'，'medio'，'AQ0MS0'），（'color'，'NCMS000'，'blanco'，'AQ0MS0'），（'carga'，'NCFS000'，'frontal'，'AQ0CS0'），（'ruido'，'NCMS000'，'jeje'，'NCMS000'）]'這個問題是最後一個parethesis'（'ruido '，'NCMS000'，'jeje'，'NCMS000'）'我不能返回相同的ID（即'NCMS000，NCMS000'），我只能返回'NC'和'AQ'。 – 2014-11-03 20:35:09

你是否應用了明顯的修復？查看我的答案的最近編輯。 – 2014-11-04 00:49:33

謝謝@Rob！ – 2014-11-04 06:56:54

我的猜測是you're試圖做一些NLP（自然語言處理），你想從一些西班牙語語料庫中提取由noun和qualifier組成的對。已經有用於這些任務的工具。

我建議你看看Python Natural Language Tool Kit（NLTK）。

另外我不得不說是不是一個普通的任務，而是在完全自然的文本上對語料庫執行這些操作。我認爲你應該解釋你的意圖，也許你試圖達到的解決方案並不是解決實際問題的最佳解決方案。

幫助我們來幫助你。

來源

2014-10-27 20:17:06

如何解決這個正則表達式以捕獲字符串的特定字符？

回答

相關問題