BeautifulSoup Absoute URLs Print to CSV

我在這裏經歷了大量的線程，看看我能否找到一種方法來修復此代碼，但似乎無法讓這個工作。我試圖從網站上獲取鏈接，然後寫入csv。下面的代碼：BeautifulSoup Absoute URLs Print to CSV

我找到一種方式來獲得的方式出現95％，但我缺少的東西想起來在href：

from bs4 import BeautifulSoup 
    import urllib.request 
    import urllib.parse 
    import csv 

    j = urllib.request.urlopen("http://cnn.com") 
    soup = BeautifulSoup(j, "lxml") 
    data = soup.find_all('a', href=True) 

    for url in soup.find_all('a', href=True): 
#print(url.get('href')) 

     with open('marcel.csv', 'w', newline='') as csvfile: 
      write = csv.writer(csvfile) 
      write.writerows(data)

來源

2017-02-24 Jarman

這可能是你想要做什麼。

from bs4 import BeautifulSoup 
import requests #better than urllib 
import csv 

j = requests.get("http://cnn.com").content 
soup = BeautifulSoup(j, "lxml") 

data = [] 
for url in soup.find_all('a', href=True): 
    print(url['href']) 
    data.append(url['href']) 

print(data) 

with open("marcel.csv",'w') as csvfile: 
    write = csv.writer(csvfile, delimiter = ' ') 
    write.writerows(data)

來源

2017-02-26 23:09:50

解決了它！謝謝:)只是爲了理解目的，添加data = []意味着？ – Jarman

這僅僅意味着「在這種情況下創建一個空列表數據」。通過這種方式，我們可以使用.append方法將其填充到循環中（如果列表尚不存在，此方法不起作用） –

有沒有辦法在輸出中獲取唯一值？我希望得到的是絕對鏈接的列表，例如http://cnn.com/（這裏是刮網）。但是沒有重複值的列表。 – Jarman

我用openpyxl得到它

from openpyxl import Workbook,load_workbook

我覺得這很容易。這是我的項目的一部分，你可以試試

def createExcel(self): 
     wb = Workbook(optimized_write=True) 
     ws = wb.create_sheet(title='書籍列表') 
     row0 = ['編號','條碼號','題名','責任者','借閱日期','歸還日期','館藏地'] 
     ws.append(row0) 
     save_path = 'book_hist.xlsx' 
     wb.save(save_path) 

    def saveToExcel(self,data_list): 
     wb = load_workbook(filename='book_hist.xlsx') 
     ws = wb.get_sheet_by_name('書籍列表') 
     for i in range(len(data_list)): 
      ws.append(data_list[i]) 
     save_path = 'book_hist.xlsx' 
     wb.save(save_path)

來源

2017-02-24 04:21:57 Zeroxus

對不起，也許我誤解了你的話 – Zeroxus

好吧，所以我想出瞭如何獲得95％的方式。這是我改變了： – Jarman

BeautifulSoup Absoute URLs Print to CSV

回答

相關問題