與webcrawler鏈接出現問題

我試圖創建一個webcrawler，它解析頁面上的所有html，抓取指定的鏈接（通過raw_input），跟隨該鏈接，然後重複此過程指定的次數（一次再次通過raw_input）。我能夠抓住第一個鏈接併成功打印出來。但是，我有問題「循環」整個過程，並且通常會抓住錯誤的鏈接。這是第一個鏈接與webcrawler鏈接出現問題

https://pr4e.dr-chuck.com/tsugi/mod/python-data/data/known_by_Fikret.html

（全面披露，這個問題涉及到分配的Coursera課程）

這裏是我的代碼

import urllib 
from BeautifulSoup import * 
url = raw_input('Enter - ') 
rpt=raw_input('Enter Position') 
rpt=int(rpt) 
cnt=raw_input('Enter Count') 
cnt=int(cnt) 
count=0 
counts=0 
tags=list() 
soup=None 
while x==0: 
    html = urllib.urlopen(url).read() 
    soup = BeautifulSoup(html) 
# Retrieve all of the anchor tags 
    tags=soup.findAll('a') 
    for tag in tags: 
     url= tag.get('href') 
     count=count + 1 
     if count== rpt: 
      break 
counts=counts + 1 
if counts==cnt:   
    x==1  
else: continue 
print url

來源

2015-12-20 w.c

我不完全理解輸入。該網址非常清晰。但爲什麼這個位置和計數？另外你爲什麼要重做整個過程多次？你不是隻需要抓取頁面上的所有網址？通常你只需要加載一次頁面並獲得所有的標籤。你能詳細說明嗎？ – DJanssens

對不起，如果我不清楚，我想webcrawler抓住網頁上的鏈接，例如，如果用戶輸入「位置」爲3，「計數」爲4，它會抓住第三個鏈接，然後輸入鏈接到urllib，解析該鏈接，獲取該頁面上的第三個鏈接，並循環4次，如「count」輸入中指定的那樣。 –

但是，你不需要加載頁面4次，對吧？您可以將解析的鏈接存儲爲列表，並只使用用戶指定的鏈接。 – DJanssens

我相信這是你是什麼尋找：

import urllib 
from bs4 import * 
url = raw_input('Enter - ') 
position=int(raw_input('Enter Position')) 
count=int(raw_input('Enter Count')) 

#perform the loop "count" times. 
for _ in xrange(0,count): 
    html = urllib.urlopen(url).read() 
    soup = BeautifulSoup(html) 
    tags=soup.findAll('a') 
    for tag in tags: 
     url= tag.get('href') 
     tags=soup.findAll('a') 
     # if the link does not exist at that position, show error. 
     if not tags[position-1]: 
      print "A link does not exist at that position." 
     # if the link at that position exist, overwrite it so the next search will use it. 
     url = tags[position-1].get('href') 
print url

代碼wil現在循環輸入中指定的次數，每次它將href放在給定的位置並將其替換爲url，這樣每個循環將在樹形結構中看起來更進一步。

我建議你使用變量的全名，這是很容易理解的。另外，您可以將它們投射並在一行中閱讀，這使您的開始更容易遵循。

來源

2015-12-20 23:11:23 DJanssens

我很欣賞你花時間去幫助我，但這不是我要找的。這將鏈接添加到列表中，然後打印這些指定的鏈接。我期望做的是使用頁面上的鏈接之一作爲排序點。因此，抓取工具會通過'tag.get（'href'）'獲取其中一個鏈接，將其重新插入文件開啓器'urllib.urlopen（url）.read'，然後解析該頁面，獲取鏈接，重新插入一個放入文件打開器，並重復指定的次數。不知道我是否做得更清楚，對不起，如果我沒有意義。 –

啊哈得到了它，我仍然對參數有點困惑。所以這個位置表示應該返回頁面上的哪個鏈接，並且計數表明該過程應該重複多少次，每次在url樹中進行更深入的處理並拿走3th分支？ – DJanssens

是的，這是解釋它的好方法。然而，我在原始代碼中意識到，在第一次迭代之後內部循環沒有辦法破壞，因爲在第一次迭代之後，count總是會大於用戶指定的數字。但是，即使內部循環固定，外部循環也不會中斷 –

基於DJanssens的迴應，我找到了解決方案;

url = tags[position-1].get('href')

爲我做了詭計！

感謝您的協助！

來源

2016-01-15 16:48:27

我還參與了該課程，並與一位朋友的幫助，我得到了這個工作了：

import urllib 
from bs4 import BeautifulSoup 

url = "http://python-data.dr-chuck.net/known_by_Happy.html" 
rpt=7 
position=18 

count=0 
counts=0 
tags=list() 
soup=None 
x=0 
while x==0: 
    html = urllib.urlopen(url).read() 
    soup = BeautifulSoup(html,"html.parser") 
    tags=soup.findAll('a') 
    url= tags[position-1].get('href') 
    count=count + 1 
    if count == rpt: 
     break 

print url

來源

2016-11-11 05:57:01

這裏是我的2美分：

import urllib 
#import ssl 
from bs4 import BeautifulSoup 
#'http://py4e-data.dr-chuck.net/known_by_Fikret.html' 
url = raw_input('Enter URL : ') 
position = int(raw_input('Enter position : ')) 
count = int(raw_input('Enter count : ')) 

print('Retrieving: ' + url) 
soup = BeautifulSoup(urllib.urlopen(url).read()) 

for x in range(1, count + 1): 
    link = list() 
    for tag in soup('a'): 
     link.append(tag.get('href', None))  
    print('Retrieving: ' + link[position - 1]) 
    soup = BeautifulSoup(urllib.urlopen(link[position - 1]).read())

來源

2017-10-09 09:48:50

更高效的答案出現在這裏：https://stackoverflow.com/questions/38267954/following-links -in-python的賦值 - 使用 - beautifulsoup/46653848＃46653848 –

與webcrawler鏈接出現問題

回答

相關問題