從網站提取特定行

</span> 
        <div class="clearB paddingT5px"></div> 
        <small> 
         10/12/2015 5:49:00 PM - Seeking Alpha 
        </small> 
        <div class="clearB paddingT10px"></div>

假設我有一個網站的源代碼，其中的一部分看起來像這樣。我試圖找到「小」和「/小」之間的界限。在整個網頁中有很多這樣的線條，籠罩在「小」和「/小」之間。我想提取所有介於「小」和「/小」之間的行。從網站提取特定行

我試圖用一個「正則表達式」功能，它看起來像這樣

regex = '<small>(.+?)</small>' 
datestamp = re.compile(regex) 
urls = re.findall(datestamp, htmltext)

此只返回一個空格。請告訴我這個。

來源

2015-10-13 M PAUL

你爲什麼試圖用正則表達式解析HTML？使用HTML解析器！ – jonrsharpe

請嘗試（。+）。你的正則表達式是'懶'。 – Noxeus

BeautifulSoup select或find_all方法效率更高 – mmachine

這裏有兩種方法可以處理這個：

首先使用正則表達式，不建議：

import re 

html = """</span> 
    <div class="clearB paddingT5px"></div> 
    <small> 
     10/12/2015 5:49:00 PM - Seeking Alpha 
    </small> 
    <div class="clearB paddingT10px"></div>""" 

for item in re.findall('\<small\>\s*(.*?)\s*\<\/small\>', html, re.I+re.M): 
    print '"{}"'.format(item)

其次使用類似BeautifulSoup解析HTML爲您提供：

from bs4 import BeautifulSoup 

soup = BeautifulSoup(html, "html.parser") 
for item in soup.find_all("small"): 
    print '"{}"'.format(item.text.strip())

給出以下輸出：

"10/12/2015 5:49:00 PM - Seeking Alpha"

來源

2015-10-13 10:23:50

這裏使用xml.etree。有了這個，你可以從網頁中獲取html數據，並使用urllib2返回你想要的任何標籤.....就像這樣。

import urllib2 
from xml.etree import ElementTree 

url = whateverwebpageyouarelookingin 
request = urllib2.Request(url, headers={"Accept" : "application/xml"}) 
u = urllib2.urlopen(request) 
tree = ElementTree.parse(u) 
rootElem = tree.getroot() 
yourdata = rootElem.findall("small") 
print yourdata

來源

2015-10-13 10:24:09 Amazingred

從網站提取特定行

回答

相關問題