我正在抓取一個網站,並希望獲取特定標記內的內容。 標籤我想獲得裏面的內容是:<pre class="js-tab-content"></pre>
Python正則表達式不返回我正在尋找的
這裏是我的代碼:
request = urllib.request.Request(url=url)
response = urllib.request.urlopen(request)
content = response.read().decode()
tab = re.findall(r'<pre class="js-tab-content">(.*?)</pre>', content)
print(tab)
當我打印標籤,我得到一個空列表[]
這裏是內容我正在尋找:
.... <pre class="js-tab-content"><i></i><span>Em</span> <span>D</span> <span>Em</span> <span>D</span>
Lift M
ac Cahir Og your face, brooding o'er the old disgrace
<span>Em</span> <span>D</span> <span>G</span>-<span>D</span>-<span>Em</span>
That black Fitzwilliam stormed your place and drove you to the Fern.
<span>Em</span> <span>D</span> <span>Em</span> <span>D</span>
Gray said victory was sure, soon the firebrand he'd secure
<span>Em</span> <span>D</span> <span>G</span>-<span>D</span>-<span>Em</span>
Until he met at Glenmalure, Feach Mac Hugh O'Byrne
Chorus:
<span>G</span> <span>D</span>
Curse and swear, Lord Kildare, Feach will do what Feach will dare
<span>G</span> <span>G</span>-<span>D</span>-<span>Em</span>
Now Fitzwilliam have a care, fallen is your star low
<span>G</span> <span>D</span>
Up with halbert, out with sword, on we go for by the Lord
<span>G</span> <span>G</span>-<span>D</span>-<span>Em</span>
Feach Mac Hugh has given his word: Follow me up to Carlow
From Tassagart ____to Clonmore flows a stream of Saxon Gore
Great is Rory Og O'More at sending loons to Hades.
White is sick and Lane is fled, now for black Fitzwilliams head
We'll send it over, dripping red, to Liza and her ladies
See the swords of Glen Imayle flashing o'er the English Pale
See all the children of the Gael, beneath O'Byrne's banners
Rooster of the fighting stock, would you let an Saxon cock
Crow out upon an Irish rock, fly up and teach him manners
</pre> ....
我不明白爲什麼這是返回一個e空列表而不是列表中的內容與內容中的字符串。
我環顧了大約半個小時的互聯網,找不到任何幫助。
對不起,如果我看起來很愚蠢,如果它是如此明顯!
無論如何,在此先感謝!
不要使用正則表達式來解析HTML。看到這裏:http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags#answer-1732454 – bgporter
好的,這裏有兩件很明顯的事情:1)解析HTML正則表達式是一個壞主意,2)'。'默認不匹配Python正則表達式中的換行符(添加'flags = re.S')。一個不是很明顯的事情:懶惰點匹配模式已知會在匹配大量文本時減慢您的應用程序,因此,我建議使用BeautifulSoup或Python的任何其他HTML解析庫。 –
而且...解決了我的問題。哇,我沒有意識到這一點! 我想我明白了爲什麼正則表達式與html不好。我也知道標籤中沒有其他屬性或類似的東西,裏面的標籤並不重要。 – David