2015-07-04 37 views
1

我有一個Python pandas腳本的這些開始,它可以搜索Google上的值並抓取它可以在第一頁找到的任何PDF鏈接。熊貓:從BeautifulSoup將所有研究結果寫入csv

我有兩個問題,下面列出。

import pandas as pd 
from bs4 import BeautifulSoup 
import urllib2 
import re 

df = pd.DataFrame(["Shakespeare", "Beowulf"], columns=["Search"])  

print "Searching for PDFs ..." 

hdr = {"User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11", 
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", 
    "Accept-Charset": "ISO-8859-1,utf-8;q=0.7,*;q=0.3", 
    "Accept-Encoding": "none", 
    "Accept-Language": "en-US,en;q=0.8", 
    "Connection": "keep-alive"} 

def crawl(search): 
    google = "http://www.google.com/search?q=" 
    url = google + search + "+" + "PDF" 
    req = urllib2.Request(url, headers=hdr) 

    pdf_links = None 
    placeholder = None #just a column placeholder 

    try: 
     page = urllib2.urlopen(req).read() 
     soup = BeautifulSoup(page) 
     cite = soup.find_all("cite", attrs={"class":"_Rm"}) 
     for link in cite: 
      all_links = re.search(r".+", link.text).group().encode("utf-8") 
      if all_links.endswith(".pdf"): 
       pdf_links = re.search(r"(.+)pdf$", all_links).group() 
      print pdf_links 

    except urllib2.HTTPError, e: 
     print e.fp.read() 

    return pd.Series([pdf_links, placeholder]) 

df[["PDF links", "Placeholder"]] = df["Search"].apply(crawl) 

df.to_csv(FileName, index=False, delimiter=",") 

print pdf_links的結果將是:

davidlucking.com/documents/Shakespeare-Complete%20Works.pdf 
sparks.eserver.org/books/shakespeare-tempest.pdf 
www.w3.org/People/maxf/.../hamlet.pdf 
www.w3.org/People/maxf/.../hamlet.pdf 
www.w3.org/People/maxf/.../hamlet.pdf 
www.w3.org/People/maxf/.../hamlet.pdf 
www.w3.org/People/maxf/.../hamlet.pdf 
www.w3.org/People/maxf/.../hamlet.pdf 
www.w3.org/People/maxf/.../hamlet.pdf 
calhoun.k12.il.us/teachers/wdeffenbaugh/.../Shakespeare%20Sonnets.pdf 
www.yorku.ca/inpar/Beowulf_Child.pdf 
www.yorku.ca/inpar/Beowulf_Child.pdf 
https://is.muni.cz/el/1441/.../2._Beowulf.pdf 
https://is.muni.cz/el/1441/.../2._Beowulf.pdf 
https://is.muni.cz/el/1441/.../2._Beowulf.pdf 
https://is.muni.cz/el/1441/.../2._Beowulf.pdf 
www.penguin.com/static/pdf/.../beowulf.pdf 
www.neshaminy.org/cms/lib6/.../380/text.pdf 
www.neshaminy.org/cms/lib6/.../380/text.pdf 
sparks.eserver.org/books/beowulf.pdf 

以及CSV輸出如下:

Search   PDF Links 
Shakespeare calhoun.k12.il.us/teachers/wdeffenbaugh/.../Shakespeare%20Sonnets.pdf 
Beowulf  sparks.eserver.org/books/beowulf.pdf 

問題:

  • 有沒有辦法將所有結果寫入csv而不是 只是最下面的一行?並且如果可能的話,包含對應於"Shakespeare""Beowulf"的每行的值爲Search
  • 如何寫出完整的pdf鏈接而不用長鏈接自動縮寫爲"..."
+0

什麼搜索詞是你使用? –

+0

嗨@PadraicCunningham!我使用「莎士比亞」和「貝奧武夫」作爲搜索詞(來自DataFrame)。 – Winterflags

+0

錯誤的鏈接http://pastebin.com/Z38X8hWU,除非你真的想要一個數據幀它也可以全部使用csv模塊 –

回答

2

這將使用soup.find_all("a",href=True)讓你所有適當的PDF鏈接,並將其保存在一個數據幀和一個CSV:

hdr = { 
    "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11", 
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", 
    "Accept-Charset": "ISO-8859-1,utf-8;q=0.7,*;q=0.3", 
    "Accept-Encoding": "none", 
    "Accept-Language": "en-US,en;q=0.8", 
    "Connection": "keep-alive"} 


def crawl(columns=None, *search): 
    df = pd.DataFrame(columns= columns) 
    for term in search: 
     google = "http://www.google.com/search?q=" 
     url = google + term + "+" + "PDF" 
     req = urllib2.Request(url, headers=hdr) 
     try: 
      page = urllib2.urlopen(req).read() 
      soup = BeautifulSoup(page) 
      pdfs = [] 
      links = soup.find_all("a",href=True) 
      for link in links: 
       lk = link["href"] 
       if lk.endswith(".pdf"): 
        pdfs.append((term, lk)) 
      df2 = pd.DataFrame(pdfs, columns=columns) 
      df = df.append(df2, ignore_index=True) 
     except urllib2.HTTPError, e: 
      print e.fp.read() 
    return df 


df = crawl(["Search", "PDF link"],"Shakespeare","Beowulf") 
df.to_csv("out.csv",index=False) 

out.csv:

Search,PDF link 
Shakespeare,http://davidlucking.com/documents/Shakespeare-Complete%20Works.pdf 
Shakespeare,http://www.w3.org/People/maxf/XSLideMaker/hamlet.pdf 
Shakespeare,http://sparks.eserver.org/books/shakespeare-tempest.pdf 
Shakespeare,https://phillipkay.files.wordpress.com/2011/07/william-shakespeare-plays.pdf 
Shakespeare,http://www.artsvivants.ca/pdf/eth/activities/shakespeare_overview.pdf 
Shakespeare,http://triggs.djvu.org/djvu-editions.com/SHAKESPEARE/SONNETS/Download.pdf 
Beowulf,http://www.yorku.ca/inpar/Beowulf_Child.pdf 
Beowulf,https://is.muni.cz/el/1441/podzim2013/AJ2RC_STAL/2._Beowulf.pdf 
Beowulf,http://teacherweb.com/IL/Steinmetz/MottramM/Beowulf---Seamus-Heaney.pdf 
Beowulf,http://www.penguin.com/static/pdf/teachersguides/beowulf.pdf 
Beowulf,http://www.neshaminy.org/cms/lib6/PA01000466/Centricity/Domain/380/text.pdf 
Beowulf,http://www.sparknotes.com/free-pdfs/uscellular/download/beowulf.pdf