1
我有一個腳本可以抓取來自許多不同頁面的網頁數據並將數據寫入一個txt文件。但是,從每個網頁中,我不需要前1200行HTML,所以我想跳過這些行並將其餘的文本寫入我的txt文件。如何從網頁寫入文件時跳過行?
有沒有辦法做到這一點,或者我應該跳過先閱讀它們,當我檢索URL?謝謝
import requests
from requests import session
payload = {
'action': 'login',
'username': '',
'password': ''
}
with session() as c: #Create a cookie session to login to the protected page
page_offset = 0
result_list = []
c.post('login page url here', payload)
while page_offset <= 1000:
url = "actual url to scrape"
request = c.get(url)
if not request.ok:
print ("error")
# Something went wrong
for block in request.iter_content(1024):
if not block:
break
result_list.append(block)
page_offset += 25
#print (page_offset)
#print (result_list)
end_data = ','.join([str(i) for i in result_list])
with open("terapeak.txt", 'wb') as text_file:
text_file.write(bytes(end_data.strip(),'UTF-8'))
html是什麼樣的?它是否被換行符分開? – yayu 2014-09-10 22:54:33
文本文件對於pastebin來說太大了,所以這裏是保管箱鏈接https://www.dropbox.com/s/9mmbbk53y8wilm2/example.txt?dl=0 – Goose 2014-09-10 23:22:16