用Python刮？

我想從here中獲取所有索引詞及其定義。是否有可能使用Python刮取網頁內容？用Python刮？

Firebug Exploration顯示以下URL返回我期望的內容，包括索引及其對'a'的定義。

http://pali.hum.ku.dk/cgi-bin/cpd/pali?acti=xart&arid=14179&sphra=undefined

使用了哪些模塊？有沒有任何教程可用？

我不知道詞典中索引了多少單詞。我絕對是編程初學者。

來源

2011-04-12 SAKAMOTO

謝謝你快速回復。但我是認真的初學者，因此無法理解從不可數的URL獲取包括A到Z定義在內的所有內容的方式。 – SAKAMOTO 2011-04-12 12:58:13

[Scraping html with Python or ...]（http://stackoverflow.com/questions/2181708/scraping-html-with-python-or） – 2011-04-12 13:00:33

可以請你細化你的問題，並寫下你到底是什麼試圖在輸入和期望的輸出方面刮擦？ – 2011-04-12 13:36:55

您應該使用urllib2來獲取URL內容，使用BeautifulSoup來解析HTML/XML。

示例 - 檢索來自StackOverflow.com主頁的所有問題：

import urllib2 
from BeautifulSoup import BeautifulSoup 

page = urllib2.urlopen("http://stackoverflow.com") 
soup = BeautifulSoup(page) 

for incident in soup('h3'): 
    print [i.decode('utf8') for i in incident.contents] 
    print

此代碼示例改編自BeautifulSoup documentation。

來源

2011-04-12 12:48:45

+1因爲你舉了一個很好的例子。 BeautifulSoup踢屁股;） – 2011-04-12 12:58:11

@dasWeezul謝謝！確定它是一個很好的包，但它有嚴重的unicode問題。 – 2011-04-12 12:58:53

請注意，雖然美麗的湯目前無人居住。我認爲'lxml.html'可以解析HTML，但可能會使用不太友好的API。（我從來沒有使用美麗的湯，我沒有使用'lxml'，所以我不確定。） – 2011-04-12 12:59:55

您可以使用內置的urllib或urllib2從網上獲取數據，但解析本身是最重要的部分。我可以建議美妙的BeautifulSoup嗎？它可以處理任何事情。 http://www.crummy.com/software/BeautifulSoup/

該文檔是按照教程構建的。分類： http://www.crummy.com/software/BeautifulSoup/documentation.html

就你而言，你可能需要使用通配符來查看字典中的所有條目。你可以這樣做：

import urllib2 

def getArticles(query, start_index, count): 
    xml = urllib2.urlopen('http://pali.hum.ku.dk/cgi-bin/cpd/pali?' + 
          'acti=xsea&tsearch=%s&rfield=entr&recf=%d&recc=%d' % 
          (query, start_index, count)) 

    # TODO: 
    # parse xml code here (using BeautifulSoup or an xml parser like Python's 
    # own xml.etree. We should at least have the name and ID for each article. 
    # article = (article_name, article_id) 

    return (article_names # a list of parsed names from XML 

def getArticleContent(article): 
    xml = urllib2.urlopen('http://pali.hum.ku.dk/cgi-bin/cpd/pali?' + 
          'acti=xart&arid=%d&sphra=undefined' % article_id) 

    # TODO: parse xml 
    return parsed_article

現在你可以循環的東西。例如，爲了讓所有的文章開始在「ANA」，使用通配符「*全日空」和循環，直到你得到任何結果：，有

query = 'ana*' 
article_dict = {} 
i = 0 
while (true): 
    new_articles = getArticles(query, i, 100) 
    if len(new_articles) == 0: 
     break 

    i += 100 
    for article_name, article_id in new_articles: 
     article_dict[article_name] = getArticleContent(article_id)

一旦這樣做，你會所有內容的字典文章，由名字引用。我省略瞭解析本身，但在這種情況下非常簡單，因爲一切都是XML。您甚至可能不需要使用BeautifulSoup（儘管它仍然方便且易於使用XML）。

來源

2011-04-12 13:21:05

+1對於量身定做的答案。 – 2011-04-12 13:23:44

我非常感謝你的善良教導。非常感謝你。 – SAKAMOTO 2011-04-12 13:38:12

回答

相關問題