自動從網頁中提取供稿鏈接（原子，rss等）

我有一個巨大的URL列表，我的任務是將它們提供給一個python腳本，如果有的話應該吐出feed URL。有沒有可以幫助的API庫或代碼？自動從網頁中提取供稿鏈接（原子，rss等）

2011-10-25 Max

我在推薦Beautiful Soup來解析HTML，然後得到<鏈接rel =「alternate」標籤，其中的feed被引用的第二個華夫餅乾悖論。該代碼我通常使用：

from BeautifulSoup import BeautifulSoup as parser 

def detect_feeds_in_HTML(input_stream): 
    """ examines an open text stream with HTML for referenced feeds. 

    This is achieved by detecting all ``link`` tags that reference a feed in HTML. 

    :param input_stream: an arbitrary opened input stream that has a :func:`read` method. 
    :type input_stream: an input stream (e.g. open file or URL) 
    :return: a list of tuples ``(url, feed_type)`` 
    :rtype: ``list(tuple(str, str))`` 
    """ 
    # check if really an input stream 
    if not hasattr(input_stream, "read"): 
     raise TypeError("An opened input *stream* should be given, was %s instead!" % type(input_stream)) 
    result = [] 
    # get the textual data (the HTML) from the input stream 
    html = parser(input_stream.read()) 
    # find all links that have an "alternate" attribute 
    feed_urls = html.findAll("link", rel="alternate") 
    # extract URL and type 
    for feed_link in feed_urls: 
     url = feed_link.get("href", None) 
     # if a valid URL is there 
     if url: 
      result.append(url) 
    return result

來源

2011-10-25 07:20:14 PhilS

我不知道任何現有的庫，但Atom或RSS提要通常與<link>標籤顯示在<head>節這樣：

<link rel="alternative" type="application/rss+xml" href="http://link.to/feed"> 
<link rel="alternative" type="application/atom+xml" href="http://link.to/feed">

簡單的方法將被下載和解析這些URL的用HTML解析器，如lxml.html，並獲取相關<link>標記的href屬性。

來源

2011-10-25 03:23:49 Avaris

取決於良好的形成在這些飼料中的信息是如何（比如，是否在http://.../形式的所有環節嗎？你知道，如果他們都將在href或link標籤？在飼料的所有鏈接去其他的飼料？等），我會推薦從簡單的正則表達式到直接的解析模塊從提取飼料中提取鏈接。我只能推薦beautiful soup。儘管即使是最好的解析器也只會走得這麼遠 - 尤其是在上面提到的情況下，如果不能保證數據中的所有鏈接都將鏈接到其他提要;那麼你必須自己做一些額外的抓取和探測。

來源

2011-10-25 03:27:53

有feedfinder：

>>> import feedfinder 
>>> 
>>> feedfinder.feed('scripting.com') 
'http://scripting.com/rss.xml' 
>>> 
>>> feedfinder.feeds('scripting.com') 
['http://delong.typepad.com/sdj/atom.xml', 
'http://delong.typepad.com/sdj/index.rdf', 
'http://delong.typepad.com/sdj/rss.xml'] 
>>>

來源

2013-03-22 08:46:08

feedfinder不再維持，但現在有['feedfinder2']（https://pypi.python.org/pypi/ feedfinder2）。 – Scarabee

自動從網頁中提取供稿鏈接（原子，rss等）

回答

相關問題