2017-03-01 80 views
0

我正在研究一個更大的代碼,它將顯示Google Newspaper搜索結果的鏈接,然後分析特定關鍵字和上下文和數據的鏈接。我已經完成了這一部分工作,現在當我嘗試遍歷結果頁面時,我遇到了一個問題。我不知道如何在沒有API的情況下做到這一點,我不知道如何使用它。我只需要能夠遍歷多頁搜索結果,以便我可以將分析應用到它。似乎有一個簡單的解決方案來遍歷結果頁面,但我沒有看到它。Python遍歷頁面Google搜索

有沒有關於如何解決這個問題的建議?我對Python有點新,並且一直在教自己所有這些刮擦技術,所以我不確定我是否在這裏簡單地錯過了一些簡單的東西。我知道這可能是Google限制自動搜索的一個問題,但即使拉入前100個鏈接也是有益的。我看到了來自普通谷歌搜索的例子,但不是來自Google Newspaper搜索

以下是代碼的主體。如果您有任何建議的地方,這將會有所幫助。提前致謝!

def get_page_tree(url): 
page = requests.get(url=url, verify=False) 
return html.fromstring(page.text) 

def find_other_news_sources(initial_url): 
    forwarding_identifier = '/url?q=' 
    google_news_search_url = "https://www.google.com/search?hl=en&gl=us&tbm=nws&authuser=0&q=ohio+pay-to-play&oq=ohio+pay-to-play&gs_l=news-cc.3..43j43i53.2737.7014.0.7207.16.6.0.10.10.0.64.327.6.6.0...0.0...1ac.1.NAJRCoza0Ro" 
    google_news_search_tree = get_page_tree(url=google_news_search_url) 
    other_news_sources_links = [a_link.replace(forwarding_identifier, '').split('&')[0] for a_link in google_news_search_tree.xpath('//a//@href') if forwarding_identifier in a_link] 
    return other_news_sources_links 

links = find_other_news_sources("https://www.google.com/search? hl=en&gl=us&tbm=nws&authuser=0&q=ohio+pay-to-play&oq=ohio+pay-to-play&gs_l=news-cc.3..43j43i53.2737.7014.0.7207.16.6.0.10.10.0.64.327.6.6.0...0.0...1ac.1.NAJRCoza0Ro") 

with open('textanalysistest.csv', 'wt') as myfile: 
    wr = csv.writer(myfile, quoting=csv.QUOTE_ALL) 
    for row in links: 
     print(row) 

回答

0

我期待到建立一個解析器具有相似結構的谷歌的(即一串連續的結果頁面,每一個都具有感興趣的內容表)的網站。

Selenium包(用於基於頁面元素的網站導航)和BeautifulSoup(用於html解析)的組合似乎是收集書面內容的首選武器。您可能也會發現它們也很有用,儘管我不知道谷歌有哪些防禦措施可以阻止刮蹭。

使用硒,beautifulsoup和geckodriver Mozilla Firefox的一個可能的實現:

from bs4 import BeautifulSoup, SoupStrainer 
from bs4.diagnose import diagnose 
from os.path import isfile 
from time import sleep 
import codecs 
from selenium import webdriver 

def first_page(link): 
    """Takes a link, and scrapes the desired tags from the html code""" 
    driver = webdriver.Firefox(executable_path = 'C://example/geckodriver.exe')#Specify the appropriate driver for your browser here 
    counter=1 
    driver.get(link) 
    html = driver.page_source 
    filter_html_table(html) 
    counter +=1 
    return driver, counter 


def nth_page(driver, counter, max_iter): 
    """Takes a driver instance, a counter to keep track of iterations, and max_iter for maximum number of iterations. Looks for a page element matching the current iteration (how you need to program this depends on the html structure of the page you want to scrape), navigates there, and calls mine_page to scrape.""" 
    while counter <= max_iter: 
     pageLink = driver.find_element_by_link_text(str(counter)) #For other strategies to retrieve elements from a page, see the selenium documentation 
     pageLink.click() 
     scrape_page(driver) 
     counter+=1 
    else: 
     print("Done scraping") 
    return 


def scrape_page(driver): 
    """Takes a driver instance, extracts html from the current page, and calls function to extract tags from html of total page""" 
    html = driver.page_source #Get html from page 
    filter_html_table(html) #Call function to extract desired html tags 
    return 


def filter_html_table(html): 
    """Takes a full page of html, filters the desired tags using beautifulsoup, calls function to write to file""" 
    only_td_tags = SoupStrainer("td")#Specify which tags to keep 
    filtered = BeautifulSoup(html, "lxml", parse_only=only_td_tags).prettify() #Specify how to represent content 
    write_to_file(filtered) #Function call to store extracted tags in a local file. 
    return 


def write_to_file(output): 
    """Takes the scraped tags, opens a new file if the file does not exist, or appends to existing file, and writes extracted tags to file.""" 
    fpath = "<path to your output file>" 
    if isfile(fpath): 
     f = codecs.open(fpath, 'a') #using 'codecs' to avoid problems with utf-8 characters in ASCII format. 
     f.write(output) 
     f.close() 
    else: 
     f = codecs.open(fpath, 'w') #using 'codecs' to avoid problems with utf-8 characters in ASCII format. 
     f.write(output) 
     f.close() 
    return 

在此之後,它只是一個調用的事:

link = <link to site to scrape> 
driver, n_iter = first_page(link) 
nth_page(driver, n_iter, 1000) # the 1000 lets us scrape 1000 of the result pages 

請注意,此腳本假定結果您嘗試刮取的頁面按順序編號,可以使用'find_element_by_link_text'從刮取頁面的html中檢索這些數字。有關從頁面檢索元素的其他策略,請參閱硒文檔here

此外,請注意,您需要下載這些所依賴的軟件包,以及爲了與您的瀏覽器交流而需要的驅動程序(在這種情況下,geckodriver,下載geckodriver,將它放在一個文件夾中,然後參考到'executable_path'中的可執行文件)

如果確實最終使用了這些軟件包,它可以幫助使用時間包(原生爲python)分佈您的服務器請求,以避免超出允許的最大請求數服務器關閉你正在刮。我沒有最終需要它爲我自己的項目,但看到here,原始問題的第二個答案,對於在第四個代碼塊中使用時間模塊的實現示例。

Yeeeeaaaahhh ...如果有更高代表的人可以編輯並添加一些鏈接到美麗,硒和時間文件,這將是偉大的,thaaaanks。