也許,你想這樣:
from urllib import urlopen
import re
pgno = 2
url = "http://www.eximguru.com/traderesources/pincode.aspx?&GridInfo=Pincode0%s" %str(pgno)
print url +'\n'
sock = urlopen(url)
htmlcode = sock.read()
sock.close()
x = re.search('%;"><a href="javascript:__doPostBack',htmlcode).start()
pat = ('\t\t\t\t<td style="width:\d+%;">(\d+)</td>'
'<td style="width:\d+%;">(.+?)</td>'
'<td style="width:\d+%;">(.+?)</td>'
'<td style="width:30%;">(.+?)</td>\r\n')
regx = re.compile(pat)
print '\n'.join(map(repr,regx.findall(htmlcode,x)))
結果
http://www.eximguru.com/traderesources/pincode.aspx?&GridInfo=Pincode02
('110001', 'New Delhi', 'Delhi', 'Baroda House')
('110001', 'New Delhi', 'Delhi', 'Bengali Market')
('110001', 'New Delhi', 'Delhi', 'Bhagat Singh Market')
('110001', 'New Delhi', 'Delhi', 'Connaught Place')
('110001', 'New Delhi', 'Delhi', 'Constitution House')
('110001', 'New Delhi', 'Delhi', 'Election Commission')
('110001', 'New Delhi', 'Delhi', 'Janpath')
('110001', 'New Delhi', 'Delhi', 'Krishi Bhawan')
('110001', 'New Delhi', 'Delhi', 'Lady Harding Medical College')
('110001', 'New Delhi', 'Delhi', 'New Delhi Gpo')
('110001', 'New Delhi', 'Delhi', 'New Delhi Ho')
('110001', 'New Delhi', 'Delhi', 'North Avenue')
('110001', 'New Delhi', 'Delhi', 'Parliament House')
('110001', 'New Delhi', 'Delhi', 'Patiala House')
('110001', 'New Delhi', 'Delhi', 'Pragati Maidan')
('110001', 'New Delhi', 'Delhi', 'Rail Bhawan')
('110001', 'New Delhi', 'Delhi', 'Sansad Marg Hpo')
('110001', 'New Delhi', 'Delhi', 'Sansadiya Soudh')
('110001', 'New Delhi', 'Delhi', 'Secretariat North')
('110001', 'New Delhi', 'Delhi', 'Shastri Bhawan')
('110001', 'New Delhi', 'Delhi', 'Supreme Court')
('110002', 'New Delhi', 'Delhi', 'Rajghat Power House')
('110002', 'New Delhi', 'Delhi', 'Minto Road')
('110002', 'New Delhi', 'Delhi', 'Indraprastha Hpo')
('110002', 'New Delhi', 'Delhi', 'Darya Ganj')
我在研究了HTML源代碼的結構與下面的代碼(我想你會明白後寫了這個代碼它沒有更多的解釋):
from urllib2 import Request,urlopen
import re
pgno = 2
url = "http://www.eximguru.com/traderesources/pincode.aspx?&GridInfo=Pincode0%s" %str(pgno)
print url +'\n'
sock = urlopen(url)
htmlcode = sock.read()
sock.close()
li = htmlcode.splitlines(True)
print '\n'.join(str(i) + ' ' + repr(line)+'\n' for i,line in enumerate(li) if 275<i<300)
ch = ''.join(li[0:291])
from collections import defaultdict
didi =defaultdict(int)
for c in ch:
didi[c] += 1
print '\n\n'+repr(li[289])
print '\n'.join('%r -> %s' % (c,didi[c]) for c in li[289] if didi[c]<35)
。
現在,問題是所有pgno的值都返回相同的HTML。該網站可能會檢測到它是一個想要連接和獲取數據的程序。這個問題必須使用urllib2中的工具來處理,但我沒有接受過這方面的培訓。
您可以通過選擇並按Ctrl + K來格式化代碼在stackoverflow中。 – phihag
通過在其周圍放置反引號('\'')來格式化內聯代碼。 –
添加'str(pgno)'應該已經工作了。嘗試單獨構建URL並打印它,看看你得到了什麼。順便說一句,在'09'之後,你想要'010',然後是'099','0100'嗎? –