2016-09-27 61 views
0

我正在從python請求轉移到scrapy,我想提出一個請求,單擊instagram hashtag頁面底部的按鈕。Scrapy Post Data

捲曲是這個

curl "https://www.instagram.com/query/" -H "cookie: mid=VwBJIwAEAAGiVNY3epWm9pRgD9Ge; fbm_124024574287414=base_domain=.instagram.com; ig_pr=1; ig_vw=956; s_network=; fbsr_124024574287414=5HQEzU7XMqOLO4KeQMmSvyBcKsH2svemV1-nWIE4_iM.eyJhbGdvcml0aG0iOiJITUFDLVNIQTI1NiIsImNvZGUiOiJBUUQ0TnNLMjVCZmdvUFN4TjdfODNQaW81Z3U4MTNaZmZWVlNCcEdJNUdRWlczdmdfNGVXNXJyck5Sc3NXRFlSWjZiZEpWMU95V3hNUUcwSE9qMHItYlRiYk40VXpNZG5aLUJ5Zzk0VWZNSW1RZTd4R1JzTS1yaXRabmc0Z3FYNkpwbnF4b0VXajRPNEVGSDVoTXBCUFNHUGNHN0RHQ01uSjFLeXh1dllOc2cyaFpnSDFheVI0RUhMbE1nZGM4emVrNm9DXzdLa2s1TUoyYzhyYmEwWXo1VkI1bVVmS3NvLS11dXVxdjJlRmxFUHpYczVNQ3E1bW5BRk5IeWxxMG9veENQcXcwWUVLSnpsNnZSUzFReGUzQWpsQzFPU0cySU1QM0wwMGhUcnRraFF4ZEFhZElVMUtNNUw5VTRab2dlbjltdUFadkJjV0U3UUMxeTdibDRyTzhwWCIsImlzc3VlZF9hdCI6MTQ3NDkzODQ3MywidXNlcl9pZCI6IjEzNzc3ODgzNjkifQ; csrftoken=th33gPnvrsNS74reomY69ETfojX2avQ7" -H "origin: https://www.instagram.com" -H "accept-encoding: gzip, deflate, br" -H "accept-language: en-US,en;q=0.8" -H "user-agent: Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.116 Safari/537.36" -H "x-requested-with: XMLHttpRequest" -H "x-csrftoken: th33gPnvrsNS74reomY69ETfojX2avQ7" -H "x-instagram-ajax: 1" -H "content-type: application/x-www-form-urlencoded" -H "accept: */*" -H "referer: https://www.instagram.com/explore/tags/love/" -H "authority: www.instagram.com" --data "q=ig_hashtag(love)+"%"7B+media.after(J0HV-nGYwAAAF0HV-nGXAAAAFjgA"%"2C+10)+"%"7B"%"0A++count"%"2C"%"0A++nodes+"%"7B"%"0A++++caption"%"2C"%"0A++++code"%"2C"%"0A++++comments+"%"7B"%"0A++++++count"%"0A++++"%"7D"%"2C"%"0A++++comments_disabled"%"2C"%"0A++++date"%"2C"%"0A++++dimensions+"%"7B"%"0A++++++height"%"2C"%"0A++++++width"%"0A++++"%"7D"%"2C"%"0A++++display_src"%"2C"%"0A++++id"%"2C"%"0A++++is_video"%"2C"%"0A++++likes+"%"7B"%"0A++++++count"%"0A++++"%"7D"%"2C"%"0A++++owner+"%"7B"%"0A++++++id"%"0A++++"%"7D"%"2C"%"0A++++thumbnail_src"%"2C"%"0A++++video_views"%"0A++"%"7D"%"2C"%"0A++page_info"%"0A"%"7D"%"0A+"%"7D&ref=tags"%"3A"%"3Ashow" --compressed 

所以對我已經試過兩件事情表單數據:

body = response.xpath("//body") 
html = str(body.extract()) 
end_cursor = re.search(r"\"end\_cursor\"\: \"(.+?)\"", html).group(1) 

data = "q=ig_hashtag({})+%7B+media.after({}+10)+%7B%0A++count%2C%0A++nodes+%7B%0A++++caption%2C%0A++++code%2C%0A++++comments+%7B%0A++++++count%0A++++%7D%2C%0A++++comments_disabled%2C%0A++++date%2C%0A++++dimensions+%7B%0A++++++height%2C%0A++++++width%0A++++%7D%2C%0A++++display_src%2C%0A++++id%2C%0A++++is_video%2C%0A++++likes+%7B%0A++++++count%0A++++%7D%2C%0A++++owner+%7B%0A++++++id%0A++++%7D%2C%0A++++thumbnail_src%2C%0A++++video_views%0A++%7D%2C%0A++page_info%0A%7D%0A+%7D&ref=tags%3A%3Ashow".format(tag, end_cursor) 
url = 'https://www.instagram.com/query/' 

yield Request(url, body=data, method="POST", callback=self.parseHashtag) 

data = {"q" :"ig_hashtag({})+%7B+media.after({}+10)+%7B%0A++count%2C%0A++nodes+%7B%0A++++caption%2C%0A++++code%2C%0A++++comments+%7B%0A++++++count%0A++++%7D%2C%0A++++comments_disabled%2C%0A++++date%2C%0A++++dimensions+%7B%0A++++++height%2C%0A++++++width%0A++++%7D%2C%0A++++display_src%2C%0A++++id%2C%0A++++is_video%2C%0A++++likes+%7B%0A++++++count%0A++++%7D%2C%0A++++owner+%7B%0A++++++id%0A++++%7D%2C%0A++++thumbnail_src%2C%0A++++video_views%0A++%7D%2C%0A++page_info%0A%7D%0A+%7D&ref=tags%3A%3Ashow".format(tag, end_cursor)} 
yield FormRequest(url, formdata=data, callback=self.parseHashtag) 

我得到一個403錯誤,所以我我顯然發送的數據不正確,我是不正確地格式化數據還是錯誤地調用帖子?這是我的兩個想法,但我很不確定。任何幫助將非常感激,謝謝。

的網址是這樣的 - https://www.instagram.com/explore/tags/love/

這是我的混帳,https://github.com/Fuledbyramen/instagram_crawler/blob/master/instagram/spiders/instagram_spider.py

回答

1

你似乎缺少正確的頭或任何爲此事頭。

除了scrapy自行管理和填充的cookie之外,您應該提供您在網絡檢查器中看到的每個標頭。

您可以輕鬆地提取來自卷邊串網頭檢查給你的:

foo = '''-H "accept-encoding: gzip, deflate, br" -H "accept-language: en-US,en;q=0.8" -H "user-agent: Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.116 Safari/537.36" -H "x-requested-with: XMLHttpRequest" -H "x-csrftoken: th33gPnvrsNS74reomY69ETfojX2avQ7" -H "x-instagram-ajax: 1" -H "content-type: application/x-www-form-urlencoded" -H "accept: */*" -H "referer: https://www.instagram.com/explore/tags/love/" -H "authority: www.instagram.com"''' 
headers = [s.strip(' "').split(': ') for s in foo.split('-H')] 
headers = [h for h in headers if any(h)] 
headers = {k: v for k,v in headers} 

,你會得到:

{'accept': '*/*', 
'accept-encoding': 'gzip, deflate, br', 
'accept-language': 'en-US,en;q=0.8', 
'authority': 'www.instagram.com', 
'content-type': 'application/x-www-form-urlencoded', 
'referer': 'https://www.instagram.com/explore/tags/love/', 
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.116 Safari/537.36', 
'x-csrftoken': 'th33gPnvrsNS74reomY69ETfojX2avQ7', 
'x-instagram-ajax': '1', 
'x-requested-with': 'XMLHttpRequest'} 

有些是完全沒有必要的,像引薦主要用於分析,接受語言,接受和接受編碼很可能被忽略。用戶代理也由scrapy管理。

所以你剩下的是x-crsftoken它可能什麼都不做,但通常這些都隱藏在html源代碼的某處; x-instagram-ajax看起來像一個靜態頭來表示一個Ajax請求; x-requested-with顯示請求類型,主要是爲了防止中間人攻擊,你應該擁有它,因爲它是指示請求類型以避免被阻止。

編輯: 我試過網站,你可以realyl只是做一個GET請求身體作爲URL參數。只需右鍵單擊網絡檢查請求,然後單擊copy location with parameters,這將自動將網址參數中的類似字典的數據轉換爲正文。

https://www.instagram.com/query/?q=ig_hashtag(scrapy)%20%7B%20media.after(J0HV-vvswAAAF0HV-Qp7AAAAFiYA%2C%2016)%20%7B%0A%20%20count%2C%0A%20%20nodes%20%7B%0A%20%20%20%20caption%2C%0A%20%20%20%20code%2C%0A%20%20%20%20comments%20%7B%0A%20%20%20%20%20%20count%0A%20%20%20%20%7D%2C%0A%20%20%20%20comments_disabled%2C%0A%20%20%20%20date%2C%0A%20%20%20%20dimensions%20%7B%0A%20%20%20%20%20%20height%2C%0A%20%20%20%20%20%20width%0A%20%20%20%20%7D%2C%0A%20%20%20%20display_src%2C%0A%20%20%20%20id%2C%0A%20%20%20%20is_video%2C%0A%20%20%20%20likes%20%7B%0A%20%20%20%20%20%20count%0A%20%20%20%20%7D%2C%0A%20%20%20%20owner%20%7B%0A%20%20%20%20%20%20id%0A%20%20%20%20%7D%2C%0A%20%20%20%20thumbnail_src%2C%0A%20%20%20%20video_views%0A%20%20%7D%2C%0A%20%20page_info%0A%7D%0A%20%7D&ref=tags%3A%3Ashow

+0

哦,我實際上添加的報頭,但只是想知道如何發送數據,這是我加入了其中的一些:CSRF = re.search(R「csrftoken \ =(+ ?)\;「,str(response.headers))。group(1)response.headers ['x-csrftoken'] = csrf – Fuledbyramen

+0

FormRequest幾乎只是將method ='POST'的請求,它將formdata中的值轉換爲請求身體。所以只要有正確的標題和正文的Request(method ='POST')就足夠了。您還可以通過簡單地使用response.headers.get('csrftoken')來優化您的頁眉搜索,因爲scrapy Response已經很好地爲您設置了一切格式。 – Granitosaurus

+0

那麼我會正確格式化第一個請求嗎?我可能堅持使用Request而不使用FormRequest,因爲cURL不給出字典,只是一個字符串 – Fuledbyramen