我用文本語料庫編寫了下列Unix命令和正則表達式。Unix命令和正則表達式從語料庫中刪除XML標籤
我想不XML 提取文字只有英文段,放在一個名爲「file.txt的」文件。
以下代碼僅刪除<seg>
,但它保留了末尾的XML標記</seg>
。查看輸入和輸出以瞭解我的問題。
cat uncorpora_plain.txt |grep -a1 '<tuv xml:lang="EN">' |grep '<seg>' |perl -pe 's/\<seg>\b/''/'
文字的部位提取前:
<tuv xml:lang="EN">
<seg>Adopted at the 81st plenary meeting, on 4 December 2000, on
the recommendation of the Committee (A/55/602/Add.2 and Corr.1,
para. 94), by a recorded vote of 106 to 1, with 67 abstentions, as
follows:</seg>
輸出後運行Unix命令:
Adopted at the 81st plenary meeting, on 4 December 2000, on the
recommendation of the Committee (A/55/602/Add.2 and Corr.1, para. 94),
by a recorded vote of 106 to 1, with 67 abstentions, as follows:</seg>
您的幫助將非常感激!
不知道這是什麼你想要:'sed's/ \ | \ | <\/seg> // g'file.xml' –
archemiro
我應該使用這個命令(grep -a1) –
爲什麼要使用該命令?如果另一個命令以較少的努力和更清晰的方式實現了您所期望的結果? – ghoti