2014-04-01 22 views
-1

不知道如果我正在嘗試做的事情是可能的...但這裏。我試圖導航和刮從此表(簡體)信息...刮和導航到鏈接,以獲取更多信息

> <tr class="transaction odd" id="transaction_7"><td><a 
> href="/show_customer/11111">Erin</a></td></tr> <tr class="transaction 
> even" id="transaction_6"><td><a 
> href="/show_customer/2222">Jack</a></td></tr> <tr class="transaction 
> odd" id="transaction_5"><td><a 
> href="/show_customer/3333">Carl</a></td></tr> <tr class="transaction 
> even" id="transaction_4"><td><a 
> href="/show_customer/4444">Kelly</a></td></tr> 

這是我以前的臺時產量刮成一個CSV代碼...效果很好。

columns = ["User Name", "Source", "Staff", "Location", "Attended On", "Used", "Date"] 
table = [] 

for row in table_1.find_all('tr'): 
    tds = row.find_all('td') 
    try: 
     data = [td.get_text() for td in tds] 
     for field,value in zip(columns, data): 
      print("{}: {}".format(field, value)) 
     table.append(data) 
    except: 
     print("Bad string value") 


import csv 

with open("myfile.csv", "wb") as outf: 

    outcsv = csv.writer(outf) 

    # header row 
    outcsv.writerow(columns) 

    # data 
    outcsv.writerows(table) 

我需要做的就是導航到這樣

<a> href="/show_customer/11111">Erin</a> 

表中的每一個環節,抓住客戶的電子郵件地址是在這個HTML表單

<div class="field"> 
    <div class = "label">Email</div> 
    <p>[email protected]</p> 
    </div> 

並添加那到我csv中的相關行。

任何幫助將不勝感激!

回答

1

您需要爲td中的每個href發出http請求。這是你將如何修改現有的代碼,這樣做:

from urllib2 import urlopen 

for row in table_1.find_all('tr'): 
    tds = row.find_all('td') 
    # Get all the hrefs to make http request 
    links = row.find_all('a').get('href') 
    try: 
     data = [td.get_text() for td in tds] 
     for field,value in zip(columns, data): 
      print("{}: {}".format(field, value)) 
     # For every href make a request, get the page, 
     # create a BS object 
     for link in links: 
      link_soup = BeautifulSoup(urlopen(link)) 

      # Use link_soup BS instance to get the email 
      # by navigating the div and p and add it to your data 

     table.append(data) 
    except: 
     print("Bad string value") 

請注意,您href相對於該網站的網址。因此,在您提取href後,您必須將其與網站的網址預先拼成一個有效的網址

+0

感謝您的幫助!我得到錯誤'AttributeError:'ResultSet'對象在鏈接上沒有屬性'get''...也許我只是沒有基礎級的知識,但這...任何指針呢? – user3485563

+0

@ user3485563固定。 – shaktimaan

+0

它仍然傳遞相同的錯誤。儘管如此,感謝您的努力。我真的很感激! – user3485563