1
最近我用scrapy刮zoominfo然後我測試下面的網址如何使用scrapy
http://subscriber.zoominfo.com/zoominfo/#!search/profile/person?personId=521850874&targetid=profile
但一些如何在終端,它改變了這樣的
[scrapy] DEBUG: Crawled (200) <GET http://subscriber.zoominfo.com/zoominfo/?_escaped_fragment_=search%2Fprofile%2Fperson%3FpersonId%3D521850874%26targetid%3Dprofile>
我已經加入到應對escaped_fragment AJAXCRAWL_ENABLED = True
在setting.py
但網址仍然有escaped_fragment
。我懷疑我沒有進入我想要的正確頁面。
的spider.py
代碼如下:
#!/usr/bin/env python
# -*- coding:utf-8 -*-
import scrapy
from scrapy.selector import Selector
from scrapy.http import Request, FormRequest
from tutorial.items import TutorialItem
from scrapy.spiders.init import InitSpider
class LoginSpider(InitSpider):
name = 'zoominfo'
login_page = 'https://www.zoominfo.com/login'
start_urls = [
'http://subscriber.zoominfo.com/zoominfo/#!search/profile/person?personId=521850874&targetid=profile',
]
headers = {
"Accept":"text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
"Accept-Encoding":"gzip, deflate, br",
"Accept-Language":"en-US,en;q=0.5",
"Connectionc":"keep-alive",
"User-Agent":"Mozilla/5.0 (Macintosh; Intel Mac OS X 10.12; rv:50.0) Gecko/20100101 Firefox/50.0",
}
def init_request(self):
return Request(url=self.login_page, callback=self.login)
def login(self, response):
print "Preparing Login"
return FormRequest.from_response(
response,
headers=self.headers,
formdata={
'task':'save',
'redirect':'http://subscriber.zoominfo.com/zoominfo/#!search/profile/person?personId=521850874&targetid=profile',
'username': username,
'password': password
},
callback=self.after_login,
dont_filter = True,
)
def after_login(self, response):
if "authentication failed" in response.body:
self.log("Login unsuccessful")
else:
self.log(":Login Successfully")
self.initialized()
return Request(url='http://subscriber.zoominfo.com/zoominfo/', callback=self.parse)
def parse(self, response):
base_url = 'http://subscriber.zoominfo.com/zoominfo/#!search/profile/person?personId=521850874&targetid=profile'
sel = Selector(response)
item = TutorialItem()
divs = sel.xpath("//div[3]//p").extract()
item['title'] = sel.xpath("//div[3]")
print divs
request = Request(base_url, callback=self.parse)
yield request
感謝任何人都可以給我一個提示。
,所以你的意思是我已經進入,我想正確的頁面?但爲什麼在終端發生的URL與我想要的不同,包括'/'變成'%2F',還有'?'變成'%3F'。 –
@PeterTsung因爲所有的#'後敏感字符!'被轉義,因爲它們不是URL的一部分,換句話說,那些位的意思不是爲你的瀏覽器,但對於網站的服務器。 – Granitosaurus
所以你的意思是我已經進入了我想要的正確的頁面,雖然終端中的網址與在代碼中發佈的網址不同。 –