正則表達式使用正則表達式不拋棄任何結果嗎？

我正在使用python 3來刮一個網站並打印一個值。這裏是代碼正則表達式使用正則表達式不拋棄任何結果嗎？

import urllib.request 
import re 

url = "http://in.finance.yahoo.com/q?s=spy" 
hfile = urllib.request.urlopen(url) 
htext = hfile.read().decode('utf-8') 
regex = '<span id="yfs_l84_SPY">(.+?)</span>' 
code = re.compile(regex) 
price = re.findall(code,htext) 
print (price)

當我運行這個片段，它打印一個空的列表，即。 []，但我期待一個值，例如483.33。

我錯了什麼？幫助

來源

2013-10-28 Shivamshaz

請_please_不要使用正則表達式解析HTML。使用（'* gasp！*'）HTML解析器。 –

Matt，爲什麼我們不能使用正則表達式？最新問題 – Shivamshaz

它不是你可以使用它，它只是有WAAAAY更好的預製，內置的工具已經用於解析這種類型的東西。檢查[this]（http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags?lq=1）在網頁的源代碼中發佈 – TehTris

您沒有正確地使用正則表達式，有2種方式這樣做的：

regex = '<span id="yfs_l84_spy">(.+?)</span>' 
code = re.compile(regex) 
price = code.findall(htext)

regex = '<span id="yfs_l84_spy">(.+?)</span>' 
price = re.findall(regex, htext)

應當指出的是，Python的正則表達式庫在內部執行一些緩存，所以預緩衝僅具有有限的效果。

來源

2013-10-28 19:37:55 Wolph

我試過兩種方法，但結果仍然相同，爲空列表。 – Shivamshaz

在這種情況下，你的正則表達式根本不匹配。這可能是幾乎任何東西，一個額外的空間，大寫字母與小寫字母，多行......等 – Wolph

我不得不建議您不要使用正則表達式來解析HTML，因爲HTML is not a regular language。是的，你可以在這裏使用它。這不是一個好習慣。

最大的問題我想，你遇到的是你要找的上頁yfs_l84_spy的span的真正id。注意事例。

這就是說，這裏是在BeautifulSoup快速實施。

import urllib.request 
from bs4 import BeautifulSoup 

url = "http://in.finance.yahoo.com/q?s=spy" 
hfile = urllib.request.urlopen(url) 
htext = hfile.read().decode('utf-8') 
soup = BeautifulSoup(htext) 
soup.find('span',id="yfs_l84_spy") 
Out[18]: <span id="yfs_l84_spy">176.12</span>

並獲得在這個數字：

found_tag = soup.find('span',id="yfs_l84_spy") #tag is a bs4 Tag object 
found_tag.next #get next (i.e. only) element of the tag 
Out[36]: '176.12'

來源

2013-10-28 19:52:37 roippi

感謝您的建議。我會切換:) – Shivamshaz

回答

相關問題