0
解析數據下面是我使用,以分析數據從一個網頁無法正確BeautifulSoup
link1 = "https://www.codechef.com/status/" + sys.argv[1] + "?sort_by=All&sorting_order=asc&language=29&status=15&handle=&Submit=GO"
opener = urllib2.build_opener()
opener.addheaders = [('User-agent', 'Mozilla/5.0')]
response = opener.open(link1)
s = response.read()
soup = BeautifulSoup(s)
l = soup.findAll('tr',{'class' : 'kol'})
下面的代碼片段是獲取存儲在變量link1
一個示例頁面的URL https://www.codechef.com/status/CIELAB?sort_by=All&sorting_order=asc&language=29&status=15&handle=&Submit=GO
現在,問題是變量l總是得到一個空列表,即使表中有條目由我試圖找到的HTML標記生成。
請幫我解決這個問題。
編輯
完整代碼
from BeautifulSoup import BeautifulSoup
import urllib2
import os
import sys
import subprocess
import time
import HTMLParser
import requests
html_parser = HTMLParser.HTMLParser()
link = "https://www.codechef.com/status/"+sys.argv[1]+"?sort_by=All&sorting_order=asc&language=29&status=15&handle=&Submit=GO"
opener = urllib2.build_opener()
opener.addheaders = [('User-agent', 'Mozilla/5.0')]
response = opener.open(link)
s = response.read()
soup = BeautifulSoup(s)
try:
l = soup.findAll('div',{'class' : 'pageinfo'})
for x in l:
str_val = str(x.contents)
pos = str_val.find('of')
i = pos+3
x = 0
while i < len(str_val):
if str_val[i] >= str(0) and str_val[i] <= str(9):
x = x*10 + int(str_val[i])
i += 1
except:
x = 1
print x
global lis
lis = list()
break_loop = 0
for i in range(0,x):
print i
if break_loop == 1:
break
if i == 0:
link1 = link
else:
link1 = "https://www.codechef.com/status/"+sys.argv[1]+"?page="+str(i)+"&sort_by=All&sorting_order=asc&language=29&status=15&handle=&Submit=GO"
# opener = urllib2.build_opener()
# opener.addheaders = [('User-agent', 'Mozilla/5.0')]
# response = opener.open(link1)
useragent = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'
req = requests.get(link1, headers={'User-Agent': useragent})
# s = response.read()
soup = BeautifulSoup(req.content)
l = soup.findAll('tr',{'class' : r'\"kol\"'})
print l
for val in l:
lang_val = val.find('td',{'width' : '70'})
lang = lang_val.renderContents().strip()
print lang
try:
data = val.find('td',{'width' : '51'})
data_val = data.span.contents
except:
break
if lang != 'PHP':
break_loop = 1
break
if len(data_val) > 1 and html_parser.unescape(data_val[2]) != '100':
continue
str_val = str(val.td.contents)
p = 0
j = 0
while p < len(str_val):
if str_val[p] >= str(0) and str_val[p] <= str(9):
j = j*10 + int(str_val[p])
p += 1
lis.insert(0,str(j))
if len(lis) > 0:
try:
os.mkdir(sys.argv[1]+"_php")
except:
pass
count = 1
for data in lis:
cmd = "python parse_data_final.py "+data+" > "+sys.argv[1]+"_php/"+sys.argv[1]+"_"+str(count)+".php"
subprocess.call(cmd, shell=True)
count += 1
'l = soup.findAll('tr',{'class':r'\「kol \」'})'不起作用。我仍然得到一個空的列表。 –
@saqibns它適用於我..你可以鏈接你的代碼?你還有什麼python版本? –
我正在使用Python 2.7 –