2017-09-15 42 views
0

我用文本語料庫編寫了下列Unix命令和正則表達式。Unix命令和正則表達式從語料庫中刪除XML標籤

我想不XML 提取文字只有英文段,放在一個名爲「file.txt的」文件。

以下代碼僅刪除<seg>,但它保留了末尾的XML標記</seg>。查看輸入和輸出以瞭解我的問題。

cat uncorpora_plain.txt |grep -a1 '<tuv xml:lang="EN">' |grep '<seg>' |perl -pe 's/\<seg>\b/''/' 

文字的部位提取前:

<tuv xml:lang="EN"> 
    <seg>Adopted at the 81st plenary meeting, on 4 December 2000, on 
    the recommendation of the Committee (A/55/602/Add.2 and Corr.1, 
    para. 94), by a recorded vote of 106 to 1, with 67 abstentions, as 
    follows:</seg> 

輸出後運行Unix命令:

Adopted at the 81st plenary meeting, on 4 December 2000, on the 
recommendation of the Committee (A/55/602/Add.2 and Corr.1, para. 94), 
by a recorded vote of 106 to 1, with 67 abstentions, as follows:</seg> 

您的幫助將非常感激!

+0

不知道這是什麼你想要:'sed's/ \ | \ | <\/seg> // g'file.xml' – archemiro

+0

我應該使用這個命令(grep -a1) –

+0

爲什麼要使用該命令?如果另一個命令以較少的努力和更清晰的方式實現了您所期望的結果? – ghoti

回答

1
sed -e 's/<[^>]*>//g' file.xml 

這應該工作

+0

它工作並刪除所有標籤,但保留其他語言的所有文本。 –

+0

@BeautifulMind,還有什麼其他語言?考慮到您提供的示例數據,此解決方案看起來正好產生了您的問題表示您正在查找的內容。其輸出中沒有開放標籤或關閉標籤。它讓我滿意。 – ghoti

+0

該語料庫包含6種語言。我想刪除所有的XML標籤和其他語言,並只保留英文段。 –

1

我會重複陳腐的規則:不解析XML/HTML使用awk /桑達/ grep的 - 使用合適解析器。

xmlstarlet就是其中之一。

有效的XML示例:

<root> 
<tuv xml:lang="EN"> 
    <seg>Adopted at the 81st plenary meeting, on 4 December 2000, on 
    the recommendation of the Committee (A/55/602/Add.2 and Corr.1, 
    para. 94), by a recorded vote of 106 to 1, with 67 abstentions, as 
    follows:</seg> 
</tuv> 
<tuv xml:lang="UA"> 
    <seg>УкраÏна - унікальна країна, 
    багата талановитими людьми ...</seg> 
</tuv> 
</root> 

的命令:

xmlstarlet sel -t -v "//tuv[@xml:lang='EN']//seg" -n input.xml > uncorpus.eng.txt 

uncorpus.eng.txt內容:

Adopted at the 81st plenary meeting, on 4 December 2000, on 
    the recommendation of the Committee (A/55/602/Add.2 and Corr.1, 
    para. 94), by a recorded vote of 106 to 1, with 67 abstentions, as 
    follows: 
0

聽起來好像這是你問什麼(使用GNU AWK多焦RS):

awk -v RS='</seg>' 'sub(/.*<tuv\s+xml:lang="EN">\s*<seg>/,"")' 

,但沒有檢驗的樣品輸入/輸出這是一個猜測。這是運行對@RomanPerekhrest由FWIW輸入:

$ cat file 
<root> 
<tuv xml:lang="EN"> 
    <seg>Adopted at the 81st plenary meeting, on 4 December 2000, on 
    the recommendation of the Committee (A/55/602/Add.2 and Corr.1, 
    para. 94), by a recorded vote of 106 to 1, with 67 abstentions, as 
    follows:</seg> 
</tuv> 
<tuv xml:lang="UA"> 
    <seg>УкраÏна - унікальна країна, 
    багата талановитими людьми ...</seg> 
</tuv> 
</root> 

$ awk -v RS='</seg>' 'sub(/.*<tuv\s+xml:lang="EN">\s*<seg>/,"")' file 
Adopted at the 81st plenary meeting, on 4 December 2000, on 
    the recommendation of the Committee (A/55/602/Add.2 and Corr.1, 
    para. 94), by a recorded vote of 106 to 1, with 67 abstentions, as 
    follows: 

,如果你想在每行的開始擺脫空白的:

$ awk -v RS='</seg>' 'sub(/.*<tuv\s+xml:lang="EN">\s*<seg>/,""){ gsub(/\n[[:blank:]]*/,"\n"); print}' file 
Adopted at the 81st plenary meeting, on 4 December 2000, on 
the recommendation of the Committee (A/55/602/Add.2 and Corr.1, 
para. 94), by a recorded vote of 106 to 1, with 67 abstentions, as 
follows: