2015-10-29 92 views
1

爲了教育目的,我試圖編寫一個程序,提示用戶輸入「url」,「count」和「position」。 「網址」將被刮掉,「網址」中的「標籤」將被檢索,這將產生一個「標籤」列表。然後使用「位置」從先前檢索的「標籤」列表中選擇一個新鏈接,並將其用作要被抓取的新「url」。 「計數」是此過程發生的次數。如何遍歷標籤並重定向以檢索更多標籤?

Code: 
import urllib 
from bs4 import BeautifulSoup as bfs 

# Declare global variables 
href_list = [] 
no_iterations = 0 

# Prompt user for input 
url = raw_input('Enter url - ') 
count = raw_input('Enter count - ') 
position = raw_input('Enter position - ') 

# While loop with condition 
while no_iterations != int(count): 
    no_iterations += 1 

    # Scraping the url 
    html = urllib.urlopen(url).read() 
    soup = bfs(html) 

    # Retrieve all of the anchor tags 
    tags = soup('a') 
    for tag in tags: 
     href_list.append(tag.get('href', None)) 

    # Assiginig new url 
    url = href_list[int(position)-1] 

    # Printing info for user 
    print 'Retrieving:', href_list[int(position)-1] 
print 'Last Url:', href_list[int(position)-1] 

當我運行此程序是我所得到的:

Enter url - http://pr4e.dr-chuck.com/tsugi/mod/python-data/data/known_by_Fikret.html 
Enter count - 4 
Enter position - 3 

Retrieving: http://pr4e.dr-chuck.com/tsugi/mod/python-data/data/known_by_Montgomery.html 
Retrieving: http://pr4e.dr-chuck.com/tsugi/mod/python-data/data/known_by_Montgomery.html 
Retrieving: http://pr4e.dr-chuck.com/tsugi/mod/python-data/data/known_by_Montgomery.html 
Retrieving: http://pr4e.dr-chuck.com/tsugi/mod/python-data/data/known_by_Montgomery.html 
Last Url: http://pr4e.dr-chuck.com/tsugi/mod/python-data/data/known_by_Montgomery.html 

通過觀察輸出,我可以看到,URL不會被重置,因爲它應該,任何建議表示讚賞。

+0

你的arent遞增'位置',所以它總是相同的。所以'url = href_list [int(position)-1]'總是將url分配給相同的url –

+0

您需要創建新列表。它正在查看具有相同索引的同一列表,以便找到相同的URL。即使你追加的內容,他們不會覆蓋原來的 –

+0

明白了,我重置href_list = []聲明新的URL後,它的工作 – haytham

回答

1

我通過重新設置列表解決了我存儲在檢索到的標籤 代碼:

import urllib 
from bs4 import BeautifulSoup as bfs 

# Declare global variables 
href_list = [] 
no_iterations = 0 

# Prompt user for input 
url = raw_input('Enter url - ') 
count = raw_input('Enter count - ') 
position = raw_input('Enter position - ') 

# While loop with condition 
    while no_iterations != int(count): 
    no_iterations += 1 

    # Scraping the url 
    html = urllib.urlopen(url).read() 
    soup = bfs(html) 

    # Retrieve all of the anchor tags 
    tags = soup('a') 
    for tag in tags: 
     href_list.append(tag.get('href', None)) 

    # Assiginig new url 
    url = href_list[int(position)-1] 
    href_list = [] 
    # Printing info for user 
    print 'Retrieving:', href_list[int(position)-1] 
print 'Last Url:', url 

所以新的輸出是現在:

Enter url - http://pr4e.dr-chuck.com/tsugi/mod/python-data/data/known_by_Fikret.html 
Enter count - 4 
Enter position - 3 
Retrieving: http://pr4e.dr-chuck.com/tsugi/mod/python-data/data/known_by_Montgomery.html 
Retrieving: http://pr4e.dr-chuck.com/tsugi/mod/python-data/data/known_by_Mhairade.html 
Retrieving: http://pr4e.dr-chuck.com/tsugi/mod/python-data/data/known_by_Butchi.html 
Retrieving: http://pr4e.dr-chuck.com/tsugi/mod/python-data/data/known_by_Anayah.html 
Last Url: http://pr4e.dr-chuck.com/tsugi/mod/python-data/data/known_by_Anayah.html 

感謝您的支持