2017-07-26 130 views
2

我無法刮取分頁網頁中存在的文章的鏈接。此外,我有時會得到一個空白屏幕作爲我的輸出。我無法在循環中找到問題。此外,csv文件不會被創建。美麗的湯 - 無法從分頁頁面中獲取鏈接

from pprint import pprint 
import requests 
from bs4 import BeautifulSoup 
import lxml 
import csv 
import urllib2 

def get_url_for_search_key(search_key): 
    for i in range(1,100): 
     base_url = 'http://www.thedrum.com/' 
     response = requests.get(base_url + 'search?page=%s&query=' + search_key +'&sorted=')%i 
     soup = BeautifulSoup(response.content, "lxml") 
     results = soup.findAll('a') 
     return [url['href'] for url in soup.findAll('a')] 
     pprint(get_url_for_search_key('artificial intelligence')) 

with open('StoreUrl.csv', 'w+') as f: 
    f.seek(0) 
    f.write('\n'.join(get_url_for_search_key('artificial intelligence'))) 

回答

1

您確定只需要第100頁?也許有更多的人......

我下面你的任務的視野,這將收集所有的頁面鏈接,也正是抓住翻頁按鈕鏈接:

import requests 
from bs4 import BeautifulSoup 


base_url = 'http://www.thedrum.com/search?sort=date&query=artificial%20intelligence' 
response = requests.get(base_url) 
soup = BeautifulSoup(response.content, "lxml") 

res = [] 

while 1: 
    results = soup.findAll('a') 
    res.append([url['href'] for url in soup.findAll('a')]) 

    next_button = soup.find('a', text='Next page') 
    if not next_button: 
     break 
    response = requests.get(next_button['href']) 
    soup = BeautifulSoup(response.content, "lxml") 

編輯:另一種方法收集只有文章鏈接:

import requests 
from bs4 import BeautifulSoup 


base_url = 'http://www.thedrum.com/search?sort=date&query=artificial%20intelligence' 
response = requests.get(base_url) 
soup = BeautifulSoup(response.content, "lxml") 

res = [] 

while 1: 
    search_results = soup.find('div', class_='search-results') #localizing search window with article links 
    article_link_tags = search_results.findAll('a') #ordinary scheme goes further 
    res.append([url['href'] for url in article_link_tags]) 

    next_button = soup.find('a', text='Next page') 
    if not next_button: 
     break 
    response = requests.get(next_button['href']) 
    soup = BeautifulSoup(response.content, "lxml") 

打印鏈接中使用:

for i in res: 
    for j in i: 
     print(j) 
+0

爲了進行初步測試,我拿了第100頁。問題是,當我嘗試打印基於您的解決方案的鏈接時,我會看到一系列「無」打印在另一個下面。 – Rrj17

+0

你如何打印它們?請提供完整的代碼 –

+0

剛剛在您提供的代碼片段之後使用'pprint(res.append([url]'[url]')])'url中的URL。我不確定這是否正確。非常困惑。 – Rrj17