2014-06-24 46 views
1

鑑於像如何獲得文字和某些標籤

"<p> >this line starts with an arrow <br /> this line does not </p>" 

"<p> >this line starts with an arrow </p> <p> this line does not </p>" 

字符串我怎樣才能找到帶箭頭開始和一個div

包圍他們的字裏行間替換文本

,使之成爲:

"<p> <div> >this line starts with an arrow </div> <br /> this line does not </p> 
+1

你如何定義 「行」? –

+1

與[@ alexcxe's](http://stackoverflow.com/a/24391725/2461379)一起回答,因爲,呃... [我會在這裏留下這個...](http://stackoverflow.com/ a/1732454/2461379) –

回答

6

因爲它是一個HTML你解析,使用工具 - HTML解析r,如BeautifulSoup

使用find_all()地發現,與>開始,wrap()它們與新的div標籤的所有文本節點:

from bs4 import BeautifulSoup 

data = "<p> >this line starts with an arrow <br /> this line does not </p>" 

soup = BeautifulSoup(data) 
for item in soup.find_all(text=lambda x: x.strip().startswith('>')): 
    item.wrap(soup.new_tag('div')) 

print soup.prettify() 

打印:

<p> 
    <div> 
    >this line starts with an arrow 
    </div> 
    <br/> 
    this line does not 
</p> 
+0

+1。當我點擊這個問題時,我真的擔心會有一些關於正則表達式的東西...... –

3

你可以用>\s+(>.*?)<嘗試正則表達式模式。

import re 
regex = re.compile("\\>\\s{1,}(\\>.{0,}?)\\<") 
testString = "" # fill this in 
matchArray = regex.findall(testString) 
# the matchArray variable contains the list of matches 

並用<div> matched_group </div>替換匹配的組。這裏模式尋找> ><內的任何內容。

這裏是debuggex

1

你可以試試這個表達式演示,

>(\w[^<]*) 

DEMO

Python代碼會是這樣,

>>> import re 
>>> str = '"<p> >this line starts with an arrow <br /> this line does not </p>"' 
>>> m = re.sub(r'>(\w[^<]*)', r"<div> >\1</div> ", str) 
>>> m 
'"<p> <div> >this line starts with an arrow </div> <br /> this line does not </p>"'