2016-02-06 136 views
-1

我正在抓取一個網站,並希望獲取特定標記內的內容。 標籤我想獲得裏面的內容是:<pre class="js-tab-content"></pre>Python正則表達式不返回我正在尋找的

這裏是我的代碼:

request = urllib.request.Request(url=url) 
response = urllib.request.urlopen(request) 
content = response.read().decode() 

tab = re.findall(r'<pre class="js-tab-content">(.*?)</pre>', content) 

print(tab) 

當我打印標籤,我得到一個空列表[]

這裏是內容我正在尋找:

.... <pre class="js-tab-content"><i></i><span>Em</span>    <span>D</span>    <span>Em</span>    <span>D</span> 

Lift M 
ac Cahir Og your face, brooding o'er the old disgrace 

    <span>Em</span>     <span>D</span>      <span>G</span>-<span>D</span>-<span>Em</span>  

That black Fitzwilliam stormed your place and drove you to the Fern. 

<span>Em</span>    <span>D</span>   <span>Em</span>       <span>D</span> 

Gray said victory was sure, soon the firebrand he'd secure 

<span>Em</span>    <span>D</span>   <span>G</span>-<span>D</span>-<span>Em</span> 

Until he met at Glenmalure, Feach Mac Hugh O'Byrne 



Chorus: 

<span>G</span>        <span>D</span> 

Curse and swear, Lord Kildare, Feach will do what Feach will dare 

<span>G</span>        <span>G</span>-<span>D</span>-<span>Em</span> 

Now Fitzwilliam have a care, fallen is your star low 

<span>G</span>          <span>D</span> 

Up with halbert, out with sword, on we go for by the Lord 

<span>G</span>        <span>G</span>-<span>D</span>-<span>Em</span> 

Feach Mac Hugh has given his word: Follow me up to Carlow 



From Tassagart ____to Clonmore flows a stream of Saxon Gore 

Great is Rory Og O'More at sending loons to Hades. 

White is sick and Lane is fled, now for black Fitzwilliams head 

We'll send it over, dripping red, to Liza and her ladies 



See the swords of Glen Imayle flashing o'er the English Pale 

See all the children of the Gael, beneath O'Byrne's banners 

Rooster of the fighting stock, would you let an Saxon cock 

Crow out upon an Irish rock, fly up and teach him manners 

</pre> .... 

我不明白爲什麼這是返回一個e空列表而不是列表中的內容與內容中的字符串。

我環顧了大約半個小時的互聯網,找不到任何幫助。

對不起,如果我看起來很愚蠢,如果它是如此明顯!

無論如何,在此先感謝!

+2

不要使用正則表達式來解析HTML。看到這裏:http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags#answer-1732454 – bgporter

+2

好的,這裏有兩件很明顯的事情:1)解析HTML正則表達式是一個壞主意,2)'。'默認不匹配Python正則表達式中的換行符(添加'flags = re.S')。一個不是很明顯的事情:懶惰點匹配模式已知會在匹配大量文本時減慢您的應用程序,因此,我建議使用BeautifulSoup或Python的任何其他HTML解析庫。 –

+0

而且...解決了我的問題。哇,我沒有意識到這一點! 我想我明白了爲什麼正則表達式與html不好。我也知道標籤中沒有其他屬性或類似的東西,裏面的標籤並不重要。 – David

回答

2
tab = re.findall(r'<pre class="js-tab-content">(.*?)</pre>', content, re.S) 

re.S需要.匹配換行符字符。

+0

謝謝,完美無缺! re.M做什麼? – David

+0

不過,使用'。*?'不好主意。你應該展開它,而且你甚至不需要're.S'。 –

+0

@AndreaCorbellini,固定。這是一種習慣的力量,我更喜歡用re.M來表示目的是爲了多行。這裏沒有必要。 – xfx

5

好了,要添加到的意見,這裏是你如何使用BeautifulSoupHTML解析器提取在這種情況下,pre文本:

from bs4 import BeautifulSoup 

soup = BeautifulSoup(content, "html.parser") 
print(soup.find("pre", class_="js-tab-content").get_text()) 
+0

謝謝你的幫助。我決定使用xfx的答案,因爲我將在程序中使用re不包含html。不管怎麼說,還是要謝謝你! – David