Python HTML - 通過屬性獲取元素

我經常閱讀音樂網站，它有一個部分，用戶可以發佈自己的虛構音樂相關故事。有一個91部分系列（寫在一段時間，部分上傳），總是遵循以下約定： http://www.ultimate-guitar.com/columns/fiction/riot_band_blues_part_#.html。Python HTML - 通過屬性獲取元素

我希望能夠從每個部分獲得格式化的文本並將其放入一個html文件中。

方便地，有一個打印版本的鏈接，正確格式爲我的目的。我所要做的就是編寫一個腳本來下載所有的部分，然後將它們轉儲到文件中。不難。

不幸的是，印刷版的網址爲： www.ultimate-guitar.com/print.php?what=article & ID = 95932

知道什麼文章對應的唯一方法是什麼ID字段是查看原始文章中某個輸入標籤的值屬性。

我想要做的是這樣的：

Go to each page, incrementng through the varying numbers. 

Find the <input> tag with attribute 'name="rowid"' and get the number in it's 'value=' attribute. 

Go to www.ultimate-guitar.com/print.php?what=article&id=<value>. 
Append everything (minus <html><head> and <body> to a html file. 

Rinse and repeat.

這可能嗎？ python是正確的語言嗎？另外，我應該使用什麼dom/html/xml庫？

感謝您的任何幫助。

來源

2012-02-26 Matt Hood

你實際上可以在javascript/jquery中做到這一點，沒有太多的麻煩。 javascripty-pseudocode，附加到空文檔：

for(var pageNum = 1; i<= 91; i++) { 
    $.ajax({ 
     url: url + pageNum, 
     async: false, 
     success: function() { 
      var printId = $('input[name="rowid"]').val(); 
      $.ajax({ 
       url: printUrl + printId, 
       async: false, 
       success: function(data) { 
        $('body').append($(data).find('body').contents()); 
       } 
      }); 
     } 
    }); 
}

加載完成後，您可以將生成的HTML保存到文件中。

來源

2012-02-26 03:19:15 beerbajay

這將被視爲跨域，並且不適用於瀏覽器安全目的 – Vigrond 2012-02-26 03:51:59

正確。它將作爲一個修改一點點的油門猴腳本。 – beerbajay 2012-02-26 05:04:34

隨着LXML和的urllib2：

import lxml.html 
import urllib2 

#implement the logic to download each page, with HTML strings in a sequence named pages 
url = "http://www.ultimate-guitar.com/print.php?what=article&id=%s" 

for page in pages: 
    html = lxml.html.fromstring(page) 
    ID = html.find(".//input[@name='rowid']").value 
    article = urllib2.urlopen(url % ID).read() 
    article_html = lxml.html.fromstring(article) 
    with open(ID + ".html", "w") as html_file: 
     html_file.write(article_html.find(".//body").text_content())

編輯：在運行此，似乎有可能在一些頁面Unicode字符。解決此問題的一種方法是執行article = article.encode("ascii", "ignore")或將.read（）後面的encode方法強制爲ASCII並忽略Unicode，儘管這是一個懶惰的修復。

這是假設你只想要正文標籤內所有內容的文本內容。這將在Python文件的本地目錄中以storyID.html格式（如「95932.html」）保存文件。如果您喜歡，請更改保存語義。

來源

2012-02-26 03:46:04 Anorov

Python HTML - 通過屬性獲取元素

回答

相關問題