如何從谷歌查詢中獲取網址？

嗨，大家好，我已經嘗試從谷歌網址，但它是返回0網址！如何從谷歌查詢中獲取網址？

這是我的代碼有什麼問題呢？

import string, sys, time, urllib2, cookielib, re, random, threading, socket, os, time 
def Search(go_inurl,maxc): 
    header = ['Mozilla/4.0 (compatible; MSIE 5.0; SunOS 5.10 sun4u; X11)', 
      'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.2.2pre) Gecko/20100207 Ubuntu/9.04 (jaunty) Namoroka/3.6.2pre', 
      'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Avant Browser;', 
     'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT 5.0)', 
     'Mozilla/4.0 (compatible; MSIE 7.0b; Windows NT 5.1)', 
     'Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv:1.9.0.6)', 
     'Microsoft Internet Explorer/4.0b1 (Windows 95)', 
     'Opera/8.00 (Windows NT 5.1; U; en)', 
     'amaya/9.51 libwww/5.4.0', 
     'Mozilla/4.0 (compatible; MSIE 5.0; AOL 4.0; Windows 95; c_athome)', 
     'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)', 
     'Mozilla/5.0 (compatible; Konqueror/3.5; Linux) KHTML/3.5.5 (like Gecko) (Kubuntu)', 
     'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0; ZoomSpider.net bot; .NET CLR 1.1.4322)', 
     'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; QihooBot 1.0 [email protected])', 
     'Mozilla/4.0 (compatible; MSIE 5.0; Windows ME) Opera 5.11 [en]'] 
    gnum=100 
    uRLS = [] 
    counter = 0 
     while counter < int(maxc): 
       jar = cookielib.FileCookieJar("cookies") 
       query = 'q='+go_inurl 
       results_web = 'http://www.google.com/cse?'+'cx=011507635586417398641%3Aighy9va8vxw&ie=UTF-8&'+'&'+query+'&num='+str(gnum)+'&hl=en&lr=&ie=UTF-8&start=' + repr(counter) + '&sa=N' 
       request_web = urllib2.Request(results_web) 
     agent = random.choice(header) 
       request_web.add_header('User-Agent', agent) 
     opener_web = urllib2.build_opener(urllib2.HTTPCookieProcessor(jar)) 
       text = opener_web.open(request_web).read() 
     strreg = re.compile('(?<=href=")(.*?)(?=")') 
       names = strreg.findall(text) 
     counter += 100 
       for name in names: 
         if name not in uRLS: 
           if re.search(r'\(', name) or re.search("<", name) or re.search("\A/", name) or re.search("\A(http://)\d", name): 
             pass 
       elif re.search("google", name) or re.search("youtube", name) or re.search(".gov", name) or re.search("%", name): 
             pass 
       else: 
             uRLS.append(name) 
    tmpList = []; finalList = [] 
     for entry in uRLS: 
     try: 
      t2host = entry.split("/",3) 
      domain = t2host[2] 
      if domain not in tmpList and "=" in entry: 
       finalList.append(entry) 
       tmpList.append(domain) 
     except: 
      pass 
    print "[+] URLS (sorted) :", len(finalList) 
    return finalList

我也做了很多的編輯，但仍然沒有發生！請告訴我什麼是我的錯誤..謝謝你們:)

來源

2011-12-19 jack-X

請修復您的縮進。它看起來完全是隨機的。 – 2011-12-19 09:12:10

我看到這兩個問題。首先，您正在使用自定義的Google搜索（顯然）似乎只返回來自google.com的結果。這與一個正則表達式在URL（re.search("google", name)）中查找「google」的發生相結合，並且當發現它時而不是將其添加到URL列表將導致URL的列表始終保持爲空以進行此自定義搜索。

此外，更重要的是，你的邏輯是不正確的。有了固定的格式，你現在這樣做：

if name not in uRLS: 
    if re.search(r'\(', name) or re.search("<", name) or re.search("\A/", name) or re.search("\A(http://)\d", name): 
     pass 
    elif re.search("google", name) or re.search("youtube", name) or re.search(".gov", name) or re.search("%", name): 
     pass 
    else: 
     uRLS.append(name)

（注意elif和else可能一次縮進多了，不過，這個問題將一直存在。）

因爲你檢查是否name不在uRLS,name將永遠不會被添加到該列表中，因爲添加邏輯位於您的else路徑中。

要解決該問題，請刪除else，減少append語句的縮進，並用continue替換pass語句。

來源

2011-12-19 09:25:25 jro

所以我必須做O_O？ – 2011-12-19 09:51:59

1.由於您自定義搜索只返回url中帶有「google」的url，所以're.search（「google」，name）'檢查總是返回'True'：移除此函數調用。 --- 2.從我發佈的代碼片段中刪除帶'else'的行，並在'uRLS.append（name）'行前刪除四個空格。 – jro 2011-12-19 10:55:16

jro是對的，而且Google會定期更改其結果的格式，而不是每月，但是每年超過一次，那麼您的正則表達式可能會失敗，您需要修改它。

我面對過去比你類似的問題，我選擇了一個簡單的解決方案，這些傢伙提供google scraper to extract all URLs from search results，工程很大，你提供的關鍵字和他們湊和解析谷歌的結果，並返回你的鏈接，錨點，描述等等。這是解決方案的另一種方法，但它也可以幫助您。

來源

2012-04-04 23:07:50 Lix

如何從谷歌查詢中獲取網址？

回答

相關問題