閱讀網頁與Python

我試圖讀取並處理網頁在Python具有像在它下面幾行：閱讀網頁與Python

   <div class="or_q_tagcloud" id="tag1611"></div></td></tr><tr><td class="or_q_artist"><a title="[Artist916]" href="http://rateyourmusic.com/artist/ac_dc" class="artist">AC/DC</a></td><td class="or_q_album"><a title="[Album374717]" href="http://rateyourmusic.com/release/album/ac_dc/live_f5/" class="album">Live</a></td><td class="or_q_rating" id="rating374717">4.0</td><td class="or_q_ownership" id="ownership374717">CD</td><td class="or_q_tags_td">

我目前只在藝術家的名字感興趣（AC/DC）和專輯名稱（Live）。我可以使用libxml2dom來讀取和打印它們，但我無法弄清楚如何區分鏈接，因爲每個鏈接的節點值都是None。

一個顯而易見的方法是一次讀取輸入行，但有沒有更聰明的方式來處理這個html文件，以便我可以創建兩個單獨的列表，其中每個索引匹配另一個或具有此信息的結構？

import urllib 
import sgmllib 
import libxml2dom 

def collect_text(node): 
    "A function which collects text inside 'node', returning that text." 

    s = "" 
    for child_node in node.childNodes: 
    if child_node.nodeType == child_node.TEXT_NODE: 
     s += child_node.nodeValue 
    else: 
     s += collect_text(child_node) 
    return s 

    f = urllib.urlopen("/home/x/Documents/rym_list.html") 

    s = f.read() 

    doc = libxml2dom.parseString(s, html=1) 

    links = doc.getElementsByTagName("a") 
    for link in links: 
    print "--\nNode " , artist.childNodes 
    if artist.localName == "artist": 
     print "artist" 
    print collect_text(artist).encode('utf-8') 

    f.close()

來源

2010-08-09 Makis

你能告訴我們你目前的代碼嗎？也許你需要明確引用anchor的firstChild？（文本節點） – 2010-08-09 15:14:21

我沒有看到一次讀取輸入行有什麼問題。 – katrielalex 2010-08-09 15:25:26

只需要注意一下，如果你的for循環可以重複多次：創建新字符串就像地獄一樣昂貴（它們是不可變的 - 你最終每次都會創建一個新對象），並且每次迭代都會執行一次。最好追加到列表中，然後在循環後加上'''.join（）'列表。它可以使戲劇性的加速。 – Daenyth 2010-08-09 20:21:11

由於HTML小的這段，我不知道這是否是完整的網頁上有效的，但在這裏是如何提取「AC/DC」和「活」使用lxml.etree和xpath。

>>> from lxml import etree 
>>> doc = etree.HTML("""<html> 
... <head></head> 
... <body> 
... <tr> 
... <td class="or_q_artist"><a title="[Artist916]" href="http://rateyourmusic.com/artist/ac_dc" class="artist">AC/DC</a></td> 
... <td class="or_q_album"><a title="[Album374717]" href="http://rateyourmusic.com/release/album/ac_dc/live_f5/" class="album">Live</a></td> 
... <td class="or_q_rating" id="rating374717">4.0</td><td class="or_q_ownership" id="ownership374717">CD</td> 
... <td class="or_q_tags_td"> 
... </tr> 
... </body> 
... </html> 
... """) 
>>> doc.xpath('//td[@class="or_q_artist"]/a/text()|//td[@class="or_q_album"]/a/text()') 
['AC/DC', 'Live']

來源

2010-08-09 16:19:45 MattH

您可以從http://rateyourmusic.com/collection_p/Makis/oo找到完整的文件，但是您無法直接從該網站讀取它，因爲它們似乎阻止腳本訪問。 – Makis 2010-08-09 19:00:44

您無法直接閱讀，因爲您需要登錄才能閱讀。換句話說，除非您發佈您的用戶名和密碼，否則無法閱讀。如果你有任何釣魚網站，你應該發佈你的用戶名和密碼。 – aaronasterling 2010-08-09 19:33:14

哎唷，我沒有檢查。您可以查看anyones集合，但不能打開可打印頁面（其中包含一個頁面上的所有相冊）。 – Makis 2010-08-10 17:31:23

看看你是否能在JavaScript中使用jQuery風格DOM/CSS選擇器來獲得在你想要的元素/文本解決問題。
如果你可以得到一個用於python的BeautifulSoup的副本，你應該在幾分鐘之內就可以開始。

來源

2010-08-09 20:15:47 dhruvbird

閱讀網頁與Python

回答

相關問題