我沒有做過多的網絡抓取我的經驗。到目前爲止,我正在使用python並使用BeautifulSoup4
來刮取黑客新聞頁面。網絡抓取是否有圖案?
只是想知道是否有模式,我應該記住之前做刮。現在代碼看起來非常難看,我覺得自己像一個黑客。
代碼:
import requests
from bs4 import BeautifulSoup
class Command(BaseCommand):
page = {}
td_count = 2
data_count = 0
def handle(self, *args, **options):
for i in range(1,4):
self.page_no = i
self.parse()
print self.page[1]
def get_result(self):
return requests.get('https://news.ycombinator.com/news?p=%s'% self.page_no)
def parse(self):
soup = BeautifulSoup(self.get_result().text, 'html.parser')
for x in soup.find_all('table')[2].find_all('tr'):
self.data_count += 1
self.page[self.data_count] = {'other_data' : None, 'url' : ''}
if self.td_count%3 == 0:
try:
subtext = x.find_all('td','subtext')[0]
self.page[self.data_count - 1]['other_data'] = subtext
except IndexError:
pass
title = x.find_all('td', 'title')
if title:
try:
self.page[self.data_count]['url'] = title[1].a
print title[1].a
except IndexError:
print 'Done page %s'%self.page_no
self.td_count +=1
爲什麼不試試[scrapy(http://scrapy.org/)? – liushuaikobe
如果我沒有錯,它的非麻省理工?我會嘗試一下 –