如何從使用perl，sed或awk的內容的非常大的多行文本文件剪切html標記？

我想要改造這個文本（刪除<math>.*?</math>）使用sed，awk的或Perl：如何從使用perl，sed或awk的內容的非常大的多行文本文件剪切html標記？

{| 
|- 
| colspan="2"| 
: <math> 
[\underbrace{\color{Red}4,2}_{4 > 2},5,1,7] \rightarrow 
[2,\underbrace{\color{OliveGreen}4,5}_{4 < 5},1,7] \rightarrow 
[2,4,\underbrace{\color{Red}5,1}_{5 > 1},7] \rightarrow 
[2,4,1,\underbrace{\color{OliveGreen}5,7}_{5 < 7}] 
</math> 
|- 
| 
: <math> 
[\underbrace{\color{OliveGreen}2,4}_{2 < 4},1,5,{\color{Blue}7}] \rightarrow 
[2,\underbrace{\color{Red}4,1}_{4 > 1},5,{\color{Blue}7}] \rightarrow 
[2,1,\underbrace{\color{OliveGreen}4,5}_{4 < 5},{\color{Blue}7}] 
</math> 
: <math> 
[\underbrace{\color{Red}2,1}_{2 > 1},4,{\color{Blue}5},{\color{Blue}7}] \rightarrow 
[1,\underbrace{\color{OliveGreen}2,4}_{2 < 4},{\color{Blue}5},{\color{Blue}7}] 
</math> 
: <math> 
[\underbrace{\color{OliveGreen}1,2}_{1 < 2},{\color{Blue}4},{\color{Blue}5},{\color{Blue}7}] 
</math> 
|}

到這些文字（請原諒我，如果我刪除了太多 - 我應該刪除<math>.*?</math>）：

{| 
|- 
| colspan="2"| 
: 
|- 
| 
: 
: 
: 
|}

我讀了大約20頁並測試了10個腳本，但沒有好的結果。最好的我做的是：

cat dirt-math.txt | awk '/<math>/{cut=1; print;}/<\/math>/{cut=0}!cut'

不管它沒有正常工作，因爲左派<math></math>這是不壞，但我不知道awk來提高更多。

來源

2015-10-10 Chameleon

如果所有數據的格式都非常好，那麼您的解決方案非常接近。我修改了它只是稍微

在AWK：

sub(/<math>.*/, "") {print; cut=1} 
/<\/math>/   {cut=0; next} 
!cut

來源

2015-10-10 12:45:24

謝謝你的建議我不知道awk，所以你的建議將允許做下一步，這對我來說非常困難。 – Chameleon

由於' $'剩下，因此無法工作。 –$ Chameleon

我會再次檢查，因爲我犯了錯誤:) – Chameleon

這應做到：

perl -0777 -pe 's!<math>.*?</math>!!sg' dirt-math.txt

-p說，我們正在做一個sed樣的readline /打印行循環，-0777說每個「線」實際上是整個輸入文件，並-e指定代碼運行（在每個「行」（文件））。（？！）

如果您的文本文件過大，不適合到內存中，你可以試試這個：

perl -pe 's!<math>.*?</math>!!s; if ($cut) { if (s!^.*?</math>!!) { $cut = 0 } else { $_ = "" } } if (!$cut && s!<math>.*!!s) { $cut = 1 }' dirt-math.txt

或（略微更具可讀性）：

perl -pe ' 
    s!<math>.*?</math>!!g; 
    if ($cut) { 
     if (s!^.*?</math>!!) { $cut = 0 } 
     else { $_ = "" } 
    } 
    if (!$cut && s!<math>.*!!s) { $cut = 1 } 
' dirt-math.txt

這實際上是一個小狀態機器。 $cut記錄我們是否處於未關閉的<math>標籤（因此需要切斷輸入）。如果是這樣，我們檢查我們是否能夠找到/刪除</math>。如果是這樣，我們完成了切割（我們發現了一個關閉</math>標籤）;否則我們用空字符串覆蓋「當前行」（$_ = "";這是實際的剪切部分）。如果在此之後我們沒有切割（我們沒有使用else來處理... </math> not math text <math>出現在一行上的情況），我們嘗試從輸入中刪除<math>...。如果是這樣，我們剛剛看到一個開頭<math>標籤，需要開始切割。

來源 2015-10-10 09:44:03 melpomene

不適用於大型文件'出內存'！ – Chameleon

@Chameleon ...你的HTML文件有多大？ – melpomene

整個維基百科在波蘭 - 這是約1GB的包裝，所以我可以預測，約爲100GB。 – Chameleon

這也可以使用..觸發器（未範圍）運算完成的，無需將整個文件在內存中，從起點等除去<math>：

perl -wlne 'unless(((/.*<math>/../<\/math>/)||0) > 1){s/<math>//;print}' your-file

來源 2015-10-10 11:15:51

不錯的主意我會檢查這個。 – Chameleon

在不同的文件上失敗。 – Chameleon

請更新文件格式。 –

這是不太單班輪，但它會做你想要的。一如既往，有很多方法可以做到這一點。但是我在這裏使用'|'作爲記錄分隔符和'：'作爲字段分隔符。這允許我遍歷包含數學的記錄中的字段，並只打印不包含<math></math>的字段。

BEGIN {RS="|";FS=":";ORS=""} 

/math/ { 
    for (i=1;i<=NF;i++) { 
     if ($i ~ /math/) {print ":\n"} 
     else {print $i} 
    } 
    print "|";next; 
} 

/^\}/ { 
    print "}"; 
    next; 
} 

{ 
    print $0"|" 
} 

END {print "\n"}

來源 2015-10-10 12:13:58 jayant

你認爲'：

'總是會出現'+++'嗎？ –

Chameleon

@Chameleon我假設在'：'和'：'或'：'和'|'之間，如果有數學，那麼整個字段將被忽略，所以'：+++

...

：'will become':: ' – jayant

如何從使用perl，sed或awk的內容的非常大的多行文本文件剪切html標記？

回答

相關問題