確定網站的數量在網站上的蟒蛇

我有以下鏈接：確定網站的數量在網站上的蟒蛇

http://www.europarl.europa.eu/sides/getDoc.do?type=REPORT&mode=XML&reference=A7-2010-0001&language=EN

URL的參考部分包含以下信息：

A7 ==議會（目前爲第七議會，前者是A6等）

2010年==

0001 ==文檔編號

對於每年和議會，我想確定網站上的文件數量。例如，2010年的數字很複雜，數字186,195,196有空頁，而最大數是214.理想情況下，輸出應該是一個包含所有文檔編號的矢量，不包括缺失的文檔。

任何人都可以告訴我，如果這是可能的蟒蛇？

最好，托馬斯

來源

2010-07-09 Thomas Jensen

這裏是一個解決方案，但增加請求之間的一些超時是一個好主意：

import urllib 
URL_TEMPLATE="http://www.europarl.europa.eu/sides/getDoc.do?type=REPORT&mode=XML&reference=A7-%d-%.4d&language=EN" 
maxRange=300 

for year in [2010, 2011]: 
    for page in range(1,maxRange): 
     f=urllib.urlopen(URL_TEMPLATE%(year, page)) 
     text=f.read() 
     if "<title>Application Error</title>" in text: 
      print "year %d and page %.4d NOT found" %(year, page) 
     else: 
      print "year %d and page %.4d FOUND" %(year, page) 
     f.close()

來源

2010-07-09 06:18:30 zoli2k

非常感謝，所有的答案都是非常好的例子！ – 2010-07-12 10:58:51

首先，確保刮他們的網站是合法的。

其次，注意當文檔不存在，在HTML文件中包含：

<title>Application Error</title>

第三，使用的urllib超過遍歷所有你想要的東西：

for p in range(1,7): 
for y in range(2000, 2011): 
    doc = 1 
    while True: 
    # use urllib to open the url: (root)+p+y+doc 
    # if the HTML has the string "application error" break from the while 
    doc+=1

來源

2010-07-09 05:45:12 Escualo

謝謝，非常有幫助！該網站是公開的（這些都是我們當選的議員），所以我想法律方面不應該是一個問題。 – 2010-07-12 10:59:47

這裏有一個稍微更完整（但哈克）的例子似乎工作（使用urllib2） - 我相信你可以爲你的特定需求定製它。

我也會重複Arrieta的警告，確保網站的所有者不介意你刮掉它的內容。

#!/usr/bin/env python 
import httplib2 
h = httplib2.Http(".cache") 

parliament = "A7" 
year = 2010 

#Create two lists, one list of URLs and one list of document numbers. 
urllist = [] 
doclist = [] 

urltemplate = "http://www.europarl.europa.eu/sides/getDoc.do?type=REPORT&mode=XML&reference=%s-%d-%04u&language=EN" 

for document in range(0,9999): 
    url = urltemplate % (parliament,year,document) 
    resp, content = h.request(url, "GET") 
    if content.find("Application Error") == -1: 
     print "Document %04u exists" % (document)  
     urllist.append(urltemplate % (parliament,year,document)) 
     doclist.append(document) 
    else: 
     print "Document %04u doesn't exist" % (document) 
print "Parliament %s, year %u has %u documents" % (parliament,year,len(doclist))

來源

2010-07-09 06:13:39

非常感謝Jon的精心解答，這對於一個學習Python的人來說是非常棒的！ – 2010-07-12 11:00:15

確定網站的數量在網站上的蟒蛇

回答

相關問題