使用grep在特定上下文中查找字符串

-1

我想從大文件中查找並提取由特定上下文包圍的所有單詞。該文件中的所有線條看起來像這個樣子，但>和<\w>之間不同的詞：使用grep在特定上下文中查找字符串

<="UO" lemma="|" lex="|" sense="|" prefix="|" suffix="|" compwf="|" complemgram="|" ref="05" dephead="04" deprel="ET">and<\w>

我只希望輸出是「和」。所以我基本上想要提取上下文>xxx<\w>中的所有字符串（單詞，標點符號和數字）。我嘗試了許多使用grep和正則表達式的不同選擇，但我要麼得到所有的話或>和<\w> ......把圖案從整個文件我想輸出看起來像這樣：

and 
we 
appreciate 
this 
very 
much 
.

等等......

來源

2017-04-05 S.H

添加輸入文本和預期輸出 – RomanPerekhrest

對不起，由於某種原因，它沒有顯示我第一次發佈 –

「我只希望輸出是'，''並不足以說明你正在努力實現的目標。請給我們一個輸出結果的例子。否則，我的建議是使用這個代碼：'echo「和'' – sadmicrowave

你可以使用這樣的模式。這將匹配>和<\w>之間的任何內容。

import re 
pat = re.compile(r'>(.*?)<\\w>') 
pat.findall(input_string)

來源

2017-04-05 14:38:55

您的模式不會排除來自所需結果的「>」和「<\w>」字符集 – sadmicrowave

好的。考慮到與下面的值的輸入文件（我希望我理解你的使用情況）：

<="UO" lemma="|" lex="|" sense="|" prefix="|" suffix="|" compwf="|" complemgram="|" ref="05" dephead="04" deprel="ET">and<\w> 
<="UO" lemma="|" lex="|" sense="|" prefix="|" suffix="|" compwf="|" complemgram="|" ref="05" dephead="04" deprel="ET">we<\w> 
<="UO" lemma="|" lex="|" sense="|" prefix="|" suffix="|" compwf="|" complemgram="|" ref="05" dephead="04" deprel="ET">appreciate<\w> 
<="UO" lemma="|" lex="|" sense="|" prefix="|" suffix="|" compwf="|" complemgram="|" ref="05" dephead="04" deprel="ET">this<\w> 
<="UO" lemma="|" lex="|" sense="|" prefix="|" suffix="|" compwf="|" complemgram="|" ref="05" dephead="04" deprel="ET">very<\w> 
<="UO" lemma="|" lex="|" sense="|" prefix="|" suffix="|" compwf="|" complemgram="|" ref="05" dephead="04" deprel="ET">much<\w> 
<="UO" lemma="|" lex="|" sense="|" prefix="|" suffix="|" compwf="|" complemgram="|" ref="05" dephead="04" deprel="ET">.<\w>

以下Python的正則表達式應該爲你工作：

>>> import re 
>>> pat = re.compile(r'(?<=">)(.*)(?=<\\w>)') 
>>> pat.findall(input_string) 
['and', 'we', 'appreciate', 'this', 'very', 'much', '.']

來源

2017-04-05 14:46:36 sadmicrowave

您的模式將因標點符號而失敗。就像'''在末尾 –

你是對的，我更新了我的正則表達式 – sadmicrowave

如果中間有'>'怎麼辦？像'>><\w>' –

使用grep在特定上下文中查找字符串

回答

相關問題