如何從網頁在Python

複製字符串寫入文件我真的不知道Python和我研究了很多，但是這是我能想出如何從網頁在Python

import urllib2 
import re 

file = open('C:\Users\Sadiq\Desktop\IdList.txt', 'w') 

for a in range(1,11): 
    s = str(a) 
    url='http://fanpagelist.com/category/top_users/view/list/sort/fans/page%s' + s 
    page = urllib2.urlopen(url).read() 
    for x in range(1,21): 
     id = re.search('php?id=(.+?)"',page) 
     file.write(id) 
file.close()

我最好的代碼試圖複製身份證號碼。在網頁的像這樣

HREF = 「/ like_box.php？ID = 6679099553」

我只想寫一個txt文件在新行數。有10個網頁我想刮，我只想從每頁的前20個ID。但是，當我運行我的代碼時，它顯示403錯誤如何做到這一點？

這是完全錯誤

C:\Users\Sadiq\Desktop>extractId.py 
Traceback (most recent call last): 
File "C:\Users\Sadiq\Desktop\extractId.py", line 7, in <module> 
page = urllib2.urlopen(url).read() 
File "C:\Python27\lib\urllib2.py", line 154, in urlopen 
return opener.open(url, data, timeout) 
File "C:\Python27\lib\urllib2.py", line 437, in open 
response = meth(req, response) 
File "C:\Python27\lib\urllib2.py", line 550, in http_response 
'http', request, response, code, msg, hdrs) 
File "C:\Python27\lib\urllib2.py", line 475, in error 
return self._call_chain(*args) 
File "C:\Python27\lib\urllib2.py", line 409, in _call_chain 
result = func(*args) 
File "C:\Python27\lib\urllib2.py", line 558, in http_error_default 
raise HTTPError(req.get_full_url(), code, msg, hdrs, fp) 
urllib2.HTTPError: HTTP Error 403: Forbidden

來源

2016-12-24 Sadiq Husain Khan

打印網址，並看到，這是不正確。如果你使用'+'，那麼你不需要'％s'。要連接兩個字符串，你需要'「A」+「B」或「A％s」％「B」' – furas

btw：'write（）'不會添加'「\ n」'所以你需要'寫（id +「\ n」）' – furas

謝謝，但仍然沒有幫助。我仍然收到相同的錯誤 –

嘗試BeautifulSoup爲HTML刮：

from requests import request 
from bs4 import BeautifulSoup as bs 


with open('C:\Users\Sadiq\Desktop\IdList.txt', 'w') as out: 
    for page in range(1,11): 
     url='http://fanpagelist.com/category/top_users/view/list/sort/fans/page%d' % page # no need to convert 'page' to string 
     html = request('GET', url).text # requests module easier to use 
     soup = bs(html, 'html.parser') 
     for a in soup.findAll('a', {'class':"like_box"})[:20]: # search all links ('a') that have property "like_box" 
      out.write(a['href'].split('=')[1] + '\n')

來源

2016-12-24 14:30:41

我收到此錯誤 C：\用戶\薩迪克\桌面> extractId.py 回溯（最近通話最後一個）：文件「C：\用戶\薩迪克\桌面\蝙蝠俠文件「build \ bdist.win-amd64 \ egg \ bs4 \ __ init__.py」，第165行，在__init__中 bs4文件「Quiz \ extractId.py」，第9行，在湯= bs（html，'lxml'） .FeatureNotFound：無法找到具有您要求的功能的樹型構建器 d：lxml。你需要安裝一個解析器庫嗎？ –

嘗試刪除'lxml'：'soup = bs（html）' –

它沒有工作。它告訴我（它打印了一條消息，好像它是一個人，很奇怪！）我應該在html旁邊寫html.parser。但是這段代碼正在寫所有頁面上的id。我只想要第20個。你能否改變你的答案來達到這個效果？也請改正肥皂湯。提前致謝！ –

不要使用普通的正則表達式刮，使用HTML解析器像Beautiful Soup。

而我認爲你的錯誤來自你如何構建你的網址。使用「％」符號輸入轉換的變量不是「+」，這是附加

from bs4 import BeautifulSoup 
import urllib2 
for a in range(1,11): 
    s = str(a) 
    url='http://fanpagelist.com/category/top_users/view/list/sort/fans/page%s' % s 
    page = urllib2.urlopen(url).read() 
    soup = BeautifulSoup(page) 
    # find all links where the href contains 'php?id=' 
    # Note: you can also use css selectors or beautifulsoup's regex to do this 
    valid_links = [] 
    for link in soup.find_all('a',href=True): 
     if 'php?id=' in link: 
      valid_links.append(link['href']) 
    print valid_links

來源

2016-12-24 14:28:23 Tobey

我仍然得到同樣的錯誤 –

如何從網頁在Python

回答

相關問題