使用lxml或???從中提取網頁

信息

# Import der Pythonmodule 
import urllib 
import lxml 
import mechanize 
import sys 

# Verbindung zum URL aufbauen 
try: 
    URL = urllib.urlopen("http://...") 

except: 
    print "Verbindung zum URL fehlgeschlagen" 
    sys.exit(0) 

# Quellcode des URL lesen 
URL_quellcode = URL.readlines() 

# Verbindung zum URL beenden 
URL.close()

到目前爲止好，我可以打開和讀取URL的來源。現在我想查看各種可能性來提取一些東西。

可能性1： < P類= 「作者名」>有些名稱</P>
可能性2：相對= 「作者」>有些名稱</A>

我想提取作者姓名。我的邏輯如下：

檢查「author-name」的所有類 - 如果發現給我標籤內的文本。如果沒有找到檢查「rel =」author「 - 如果發現給我的標籤內的文本。如果不打印」沒有找到作者「

我該怎麼做呢？我可以使用正則表達式，lxml，或任何。什麼是最優雅的方式

來源

2014-10-06 eLudium

使用BeautifulSoup

from bs4 import BeautifulSoup 

document_a = """ 
<html> 
    <body> 
     <p class="author-name">Some Name</p> 
    </body> 
</html> 
""" 

document_b = """ 
<html> 
    <body> 
     <p rel="author-name">Some Name</p> 
    </body> 
</html> 
""" 
def get_author(document): 
    soup = BeautifulSoup(document_a) 
    p = soup.find(class_="author-name") 
    if not p: 
     p = soup.find(rel="author-name") 
     if not p: 
      return "No Author Found" 
    return p.text 

print "author in first document:", get_author(document_a) 
print "author in second document:", get_author(document_b)

結果：？！

author in first document: Some Name 
author in second document: Some Name

來源

2014-10-06 13:25:00 Kevin

真棒，就像一個魅力我開始與BS現在，真的很有趣不管怎麼說，我是想知道這將如何工作無線th未知數量的URL。我將從.txt文件加載它們，因此我不能像document_a .b .c等那樣做。基本上，輸出將是URL，Authorname作爲一個列表的打印操作。 – eLudium 2014-10-06 15:10:22

在這種情況下，你需要做一些類似'print [url，get_author（get_document（url））for my_file]'。您必須編寫一個'get_document'函數來從給定的url中檢索HTML數據。 – Kevin 2014-10-06 15:30:48

使用lxml或???從中提取網頁

回答

相關問題