2012-09-06 19 views
0
# -*- coding: utf-8 -*- 

import re 
import csv 
import urllib 
import urllib2 
import BeautifulSoup 
Filter = [' ab1',' ab2',' dc4',....] 
urllists = ['myurl1','myurl2','myurl3',...] 
csvfile = file('csv_test.csv','wb') 
writer = csv.writer(csvfile) 
writer.writerow(['keyword','url']) 
for eachUrl in urllists: 
    for kword in Filter: 
     keyword = "site:" + urllib.quote_plus(eachUrl) + kword 
     safeKeyword = urllib.quote_plus(keyword) 
     fullQuery = 'http://www.google.com/search?sourceid=chrome&client=ubuntu&channel=cs& ie=UTF-8&q=' + safeKeyword 

     req = urllib2.Request(fullQuery, headers = {'User-Agent': 'Mozilla/15.0 (X11; Linux x86_64) AppleWebKit/535.11 (KHTML, like Gecko) Ubuntu/12.04 Chrome/21.0.118083 Safari/535.11'}) 
     html = urllib2.urlopen(req).read() 

     soup = BeautifulSoup.BeautifulSoup(html, fromEncoding = 'utf8') 

     resultURLList = [t.a['href'] for t in soup.findAll('h3', {'class':'r'})] 

     if resultURLList: 
      for l in resultURLList: 
       needCheckHtml = urllib2.urlopen(l).read() 
       if needCheckHtml: 
        x = re.compile(r"\b" + kword + r"\b") 
        p = x.search(needCheckHtml) 
        if p: 
         data = [kword, l] 
         writer.writerow(data) 

     else: 
      print '%s: No Results' % kword 
csvfile.close() 

有關檢查URL一個簡單的腳本顯示了谷歌SearchResult所,並打開它,檢查並匹配列表過濾器中使用重,上面的代碼中的關鍵字,可能會引起一些錯誤,例如,HTTPERROR,URLError,但我不知道如何修復和impove代碼,有人可以幫助我嗎?請.. 如果面對一些谷歌拒絕,想使用os.system(「rasdial名稱用戶代碼」)重新連接PPPOE並更改IP,那麼如何修復此代碼 非常感謝!Python的關於檢查谷歌搜索URL的內容循環發出

回答

1

我不確定這有多大幫助,但有一個搜索API,您可以在沒有Google阻止您的請求並且無需更改您的IP地址的情況下使用該API;儘管這裏也有一些限制。

http://ajax.googleapis.com/ajax/services/search/web?v=1.0&q=AnT4i 

{"responseData": {"results":[{"GsearchResultClass":"GwebSearch","unescapedUrl":"http://www.ncbi.nlm.nih.gov/pubmed/11526138","url":"http://www.ncbi.nlm.nih.gov/pubmed/11526138","visibleUrl":"www.ncbi.nlm.nih.gov","cacheUrl":"","title":"Identification of aminoglycoside-modifying enzymes by susceptibility \u003cb\u003e...\u003c/b\u003e","titleNoFormatting":"Identification of aminoglycoside-modifying enzymes by susceptibility ...","content":"In 381 Japanese MRSA isolates, the \u003cb\u003eant(4\u0026#39;)-I\u003c/b\u003e, aac(6\u0026#39;)-aph(2\u0026quot;), and aph(3\u0026#39;)-III genes \u003cb\u003e...\u003c/b\u003e Isolates with only the \u003cb\u003eant(4\u0026#39;)-I\u003c/b\u003e gene had coagulase type II or III, but isolates \u003cb\u003e...\u003c/b\u003e"},{"GsearchResultClass":"GwebSearch","unescapedUrl":"http://www.ncbi.nlm.nih.gov/pubmed/1047990","url":"http://www.ncbi.nlm.nih.gov/pubmed/1047990","visibleUrl":"www.ncbi.nlm.nih.gov","cacheUrl":"","title":"[\u003cb\u003eANT(4\u0026#39;)I\u003c/b\u003e: a new aminoglycoside nucleotidyltransferase found in \u003cb\u003e...\u003c/b\u003e","titleNoFormatting":"[ANT(4\u0026#39;)I: a new aminoglycoside nucleotidyltransferase found in ...","content":"[\u003cb\u003eANT(4\u0026#39;)I\u003c/b\u003e: a new aminoglycoside nucleotidyltransferase found in \u0026quot;staphylococcus aureus\u0026quot; (author\u0026#39;s transl)]. [Article in French]. Le Goffic F, Baca B, Soussy CJ, \u003cb\u003e...\u003c/b\u003e"},{"GsearchResultClass":"GwebSearch","unescapedUrl":"http://jcm.asm.org/content/27/11/2535","url":"http://jcm.asm.org/content/27/11/2535","visibleUrl":"jcm.asm.org","cacheUrl":"","title":"Use of plasmid analysis and determination of aminoglycoside \u003cb\u003e...\u003c/b\u003e","titleNoFormatting":"Use of plasmid analysis and determination of aminoglycoside ...","content":"Aminoglycoside resistance pattern determinations revealed the presence of the \u003cb\u003eANT(4\u0026#39;)-I\u003c/b\u003e enzyme (aminoglycoside 4\u0026#39; adenyltransferase) in all group 1 isolates \u003cb\u003e...\u003c/b\u003e"},{"GsearchResultClass":"GwebSearch","unescapedUrl":"http://ukpmc.ac.uk/articles/PMC88306","url":"http://ukpmc.ac.uk/articles/PMC88306","visibleUrl":"ukpmc.ac.uk","cacheUrl":"","title":"Identification of Aminoglycoside-Modifying Enzymes by \u003cb\u003e...\u003c/b\u003e","titleNoFormatting":"Identification of Aminoglycoside-Modifying Enzymes by ...","content":"The technique used three sets of primers delineating specific DNA fragments of the aph(3\u0026#39;)-III, \u003cb\u003eant(4\u0026#39;)-I\u003c/b\u003e, and aac(6\u0026#39;)-aph(2\u0026quot;) genes, which influence the MICs of \u003cb\u003e...\u003c/b\u003e"}],"cursor":{"resultCount":"342","pages":[{"start":"0","label":1},{"start":"4","label":2},{"start":"8","label":3},{"start":"12","label":4},{"start":"16","label":5},{"start":"20","label":6},{"start":"24","label":7},{"start":"28","label":8}],"estimatedResultCount":"342","currentPageIndex":0,"moreResultsUrl":"http://www.google.com/search?oe\u003dutf8\u0026ie\u003dutf8\u0026source\u003duds\u0026start\u003d0\u0026hl\u003den\u0026q\u003dAnT4i","searchResultTime":"0.25"}}, "responseDetails": null, "responseStatus": 200} 

看到http://googlesystem.blogspot.hu/2008/04/google-search-rest-api.html

相關問題