刪除HTML標記，如果它包含一些文字

如果一個div的孩子匹配某個字符串，我想刪除整個div。例如：刪除HTML標記，如果它包含一些文字

<div> 
some text here 
if this text is matched, remove whole div 
some other text 
</div>

我必須在許多文件上這樣做，所以我正在尋找一些像sed這樣的Linux命令。

謝謝你關注此事。

來源

2011-04-22 Amol

Yeah不使用正則表達式超文本標記語言，它會搞砸：http://stackoverflow.com/a/1732454/928098 – 2012-04-30 01:21:40

有可能是一個更好的方式來做到這一點，但我已經在過去做的是：

1）剔除換行符（因爲跨行匹配很難在最好的和倒退甚至更糟）

2）解析

3）把新行回

cat /tmp/data | tr "\n" "@" | sed -e 's/<div>[^<]*some text here[^<]*<\/div>//g' | tr "@" "\n"

這是假設「@」可以不會出現在文件中。

來源

2011-04-22 17:02:36 drysdam

呀，不使用正則表達式對HTML時，系統會去壞了：http://stackoverflow.com/a/1732454/928098 – 2012-04-30 01:21:04

如果我明白你的問題正確的話，就可以在一個單一的sed命令來實現：

sed '/<div>/I{:A;N;h;/<\/div>/I!{H;bA};/<\/div>/I{g;/\bsome text here\b/Id}}' file.txt

測試

比方說，這是你的file.txt的：

a. no-div text 

<DIV> 

some text here 
1. if this text is matched, remove whole DIV 
some other text -- WILL MATCH 
</div> 

<div> 
awesome text here 
2. if this text is matched, remove whole DIV 
this will NOT be matched 
</div> 

b. no-div text 

<Div> 
another text here 
3. if this text is matched, remove whole DIV 
and this too will NOT be matched 
</Div> 

<div> 
Some TEXT Here 
4. if this text is matched, remove whole DIV 
foo bar foo bar - WILL MATCH 
</DIV> 

c. no-div text

現在當我運行sed命令時，它給出了這個輸出：

a. no-div text 


<div> 
awesome text here 
2. if this text is matched, remove whole DIV 
this will NOT be matched 
</div> 

b. no-div text 

<Div> 
another text here 
3. if this text is matched, remove whole DIV 
and this too will NOT be matched 
</Div> 


c. no-div text

正如你可以從上面的輸出驗證模式some text here匹配div標籤之間的那些div塊已被完全刪除。

PS：我在這裏做大小寫不敏感的搜索，如果你不需要這種行爲請讓我知道。我只需要從上面的sed命令中刪除I開關。

來源

2011-04-23 06:15:22 anubhava

嗨@anubhava，你的代碼看起來很棒，你能解釋一下嗎？例如：A命令 – 2013-03-12 07:39:36

您可以使用ed代替sed。 ed命令將整個文件讀入內存並執行就地文件編輯（即不存在安全備份）。

htmlstr=' 
<see file.txt in answer by anubhava> 
' 
matchstr='[sS][oO][mM][eE]\ [tT][eE][xX][tT]\ [hH][eE][rR][eE]' 
divstr='[dD][iI][vV]' 
# for in-place file editing use "ed -s file" and replace ",p" with "w" 
# cf. http://wiki.bash-hackers.org/howto/edit-ed 
cat <<-EOF | sed -e 's/^ *//' -e 's/ *$//' -e '/^ *#/d' | ed -s <(echo "$htmlstr") 
    H 
    # ?re? The previous line containing the regular expression re. (see man ed) 
    # '[[:<:]]' and '[[:>:]]' match the null string at the beginning and end of a word respectively. (see man re_format) 
    #,g/[[:<:]]${matchstr}[[:>:]]/?<${divstr}>?,/<\/${divstr}>/d 
    ,g/[[:<:]]${matchstr}[[:>:]]/?<${divstr}>?+0,/<\/${divstr}>/+0d 
    ,p 
    q 
EOF

來源

2011-04-24 15:12:15 jeff

刪除HTML標記，如果它包含一些文字

回答

測試

現在當我運行sed命令時，它給出了這個輸出：

相關問題