從搜索引擎獲取結果

搜索引擎特別是search.lycos.co.uk。我可以通過腳本搜索它，但我無法從源文件中獲取每個單獨的結果，非常感謝任何幫助。編輯：從搜索引擎獲取結果

host = 'http://search.lycos.co.uk/?query=%s&page2=%s' % (str(query), repr(page)) 
req = urllib2.Request(host) 
req.add_header('User-Agent', User_Agent) 
response = urllib2.urlopen(req) 
source = response.read()

不知道在哪裏可以從這裏去得到每個結果。

來源

2012-01-19 PyFan

請詳細說明。代碼示例等 – theglauber

您可以請演示腳本嗎？ – aayoubi

是否存在來自搜索引擎的實際編程API而不是試圖解析/讀取爲人類最終用戶編寫的頁面？ – bot403

我嘗試這樣做：

query='testing!' 
page=1 
host = 'http://search.lycos.co.uk/?query=%s&page2=%s' % (str(query), repr(page)) 
print urllib2.urlopen(host).read()

放在那兒試試，看看它是否工作。它在這裏工作。

而且，我創建了urllib2.Request和它的工作在這裏：

import urllib 
import urllib2 

data = {'query': 'testing', 'page2': '1'} 
req = urllib2.Request(host, data=urllib.urlencode(data)) 
req.add_header('User-Agent', <yours>) 
print urllib2.urlopen(req).read()

跟進，這些都是很好的模塊，如果你想刮的數據：

來源

2012-01-19 22:28:00 wleao

對不起，你可能會誤解，我可以做那部分，但我需要從源頭分別得到每個結果，我不確定如何去 – PyFan

因此，這是另一種問題..看看lxml或BeautifulSoup。我看了一下回復，有一個非常簡單的方法來提取結果。也許你應該編輯一下你的問題。乾杯! – wleao

對不起，我在那裏讀到這個問題，而我一點都不清楚，我會研究它們，對我來說是新事物，所以有一些困難， – PyFan

Lycos加密了他們的搜索結果。但是，你可以嘗試谷歌。

import urllib, urllib2 
from urllib import urlopen 
from bs4 import BeautifulSoup 
import re 
from time import sleep 
from random import choice, random 

def scrapping_google(query): 
    g_url = "http://www.google.com/search?q=%s&num=100&hl=en&start=0" %(urllib.quote_plus(query)) 
    request = urllib2.Request(g_url, None, {'User-Agent':'Mozilla/5.0 (X11; Linux x86_64; rv:35.0) Gecko/20100101 Firefox/35.0'}) 
    open_url = urllib2.urlopen(request) 
    read_url = open_url.read() 
    g_soup = BeautifulSoup(read_url) 

    remove_tag = re.compile(r'<.*?>') 

    g_dict = {} 

    scrap_count = g_soup.find('div', attrs={'id' : 'resultStats'}) 
    count = remove_tag.sub('', str(scrap_count)).replace('.','') 
    only_count = count[0:-16] 
    print 'Prediction result: ', only_count 
    print '\n' 

    for li in g_soup.findAll('li', attrs={'class' : 'g'}): 
     links = li.find('a') 
     print links['href'] 
     scrap_content = li.find('span', attrs={'class' : 'st'}) 
     content = remove_tag.sub('', str(scrap_content)).replace('.','') 
     print content 

    return g_dict 

if __name__ == '__main__': 
    fetch_links = scrapping_google('jokowi')

來源

2015-02-22 15:22:17

從搜索引擎獲取結果

回答

相關問題