3

我試圖從 http://www.pogdesign.co.uk/cat/中取消一些數據。data scraping from pogdesign.co.uk/cat/

我想獲得每個程序的頻道和時間,但問題是,默認情況下它們不會出現。只有在手動配置並保存設置後,纔會顯示每個程序的頻道和播放時間。

據我瞭解,在Chrome開發者工具中查看「網絡」部分後,點擊「保存設置」後發生的實際情況是發送了POST請求和相關數據參數(例如's_networks':'on'等' ),然後發送一個GET請求,以檢索帶有頻道的html文件並顯示時間。

我試圖使用 python的requests包和mechanicalsoup包來模擬此過程(POST請求,然後GET請求)。

requests:

s = requests.Session() 
s.post('http://www.pogdesign.co.uk/cat/', data = {'s_networks':'on'}) 
s.get('http://www.pogdesign.co.uk/cat/') 

mechanicalsoup:

mcs = mechanicalsoup.Browser() 
res_post = mcs.post('http://www.pogdesign.co.uk/cat/', data {'s_networks':'on'}) 
res_get = mcs.get('http://www.pogdesign.co.uk/cat/') 

然而,我接收不包含該信道和廣播時間數據的響應。

我注意到的唯一區別是從瀏覽器的POST請求返回的狀態碼是302,並且從我的python請求返回的狀態碼是200

回答

3

這是因爲cookie的存儲用戶信息的,你可以試試下面的代碼

import requests 

s = requests.Session() 
data = { 
    "style": 3, 
    "timezone": "GMT", 
    "s_numbers": "on", 
    "s_epnames": "on", 
    "s_airtimes": "on", 
    "s_popups": "on", 
    "s_wunwatched": "on", 
    "s_sortbyname": "on", 
    "s_weekstyle": "on", 
    "s_24hr": "on", 
    "settings": None 
} 
cookies = { # you can get the cookie info from dev tool 
    "CAT_UID":'' , 
    "PHPSESSID":'' , 
    "_ga": '', 
    "_gid": '', 
    "_gat": "" 
} 
post = s.post('http://www.pogdesign.co.uk/cat/', data=data, cookies=cookies) 
text = post.text 
get = s.get('http://www.pogdesign.co.uk/cat/', cookies=cookies) 
text1 = get.text