從用python編寫的蜘蛛獲取csv中的重複內容

-2

我創建了一個按照我的預期收集數據的蜘蛛。我現在面臨的唯一問題是結果有很多重複。不過，我想動搖重複的功能關閉，CSV寫結果：從用python編寫的蜘蛛獲取csv中的重複內容

下面是代碼：

import csv 
import requests 
from lxml import html 

def Startpoint(): 
    global writer 
    outfile=open('Data.csv','w',newline='') 
    writer=csv.writer(outfile) 
    writer.writerow(["Name","Price"]) 
    address = "https://www.sephora.ae/en/stores/" 
    page = requests.get(address) 
    tree = html.fromstring(page.text) 
    titles=tree.xpath('//li[contains(@class,"level0")]') 
    for title in titles: 
     href = title.xpath('.//a[contains(@class,"level0")]/@href')[0] 
     Layer2(href) 

def Layer2(address): 
    global writer 
    page = requests.get(address) 
    tree = html.fromstring(page.text) 
    titles=tree.xpath('//li[contains(@class,"amshopby-cat")]') 
    for title in titles: 
     href = title.xpath('.//a/@href')[0] 
     Endpoint(href) 

def Endpoint(address): 
    global writer 
    page = requests.get(address) 
    tree = html.fromstring(page.text) 
    titles=tree.xpath('//div[@class="product-info"]') 
    for title in titles: 
     Name = title.xpath('.//div[contains(@class,"h3")]/a[@title]/text()')[0] 
     Price = title.xpath('.//span[@class="price"]/text()')[0] 
     metco=(Name,Price) 
     print(metco) 
     writer.writerow(metco) 

Startpoint()

來源

2017-04-19 SIM

尋求調試幫助的問題（「爲什麼這個代碼不工作？」）必須包含所需的行爲，特定的問題或錯誤以及在問題本身中重現問題所需的最短代碼。沒有明確問題陳述的問題對其他讀者無益。請參閱：如何創建最小，完整和可驗證示例。 – DyZ

你不需要csv模塊wrtie CSV文件。指定擴展就足夠了。因此，把你的代碼變成

import requests 
from lxml import html 

delimiter = ";" 
file_name = 'data.csv' 

def Startpoint(): 
    address = "https://www.sephora.ae/en/stores/" 
    page = requests.get(address) 
    tree = html.fromstring(page.text) 
    titles=tree.xpath('//li[contains(@class,"level0")]') 
    for title in titles: 
     href = title.xpath('.//a[contains(@class,"level0")]/@href')[0] 
     Layer2(href) 

def Layer2(address): 
    page = requests.get(address) 
    tree = html.fromstring(page.text) 
    titles=tree.xpath('//li[contains(@class,"amshopby-cat")]') 
    for title in titles: 
     href = title.xpath('.//a/@href')[0] 
     Endpoint(href) 

def Endpoint(address): 
    page = requests.get(address) 
    tree = html.fromstring(page.text) 
    titles=tree.xpath('//div[@class="product-info"]') 
    for title in titles: 
     Name = title.xpath('.//div[contains(@class,"h3")]/a[@title]/text()')[0] 
     Price = title.xpath('.//span[@class="price"]/text()')[0] 
     metco=(Name,Price) 
     print(metco) 
     with open(file_name,'a') as outfile: 
      outfile.write(delimiter.join(metco).encode('utf8') + '\n') 

with open(file_name,'w') as outfile: 
    outfile.write(delimiter.join(["Product Name", "Price"])+'\n') 
Startpoint()

應該做的伎倆。請注意0部分，它可以防止您從UnicodeEncodeError開始的寫入過程。此外，請注意open函數中使用的參數'w'和'a'。第一種意思是「寫」，第二種意思是「附加」。但是，即使從啓發式的角度來看，這些代碼的工作原理，遠不是「好」的想法。

來源

2017-04-19 23:15:37 Kanak

感謝Tnerual爲您的答案。你非常接近我的期望。編碼在這種情況下不起作用，所以我放棄了這一行。所有結果現在都在csv文件中，但它們不是位於單獨的列中，而是位於由逗號分隔的單個列中。你能分兩列嗎？謝謝。以下是該輸出的鏈接：「https://www.dropbox.com/s/okb7hxocmv5lbax/data.csv?dl=0」 – SIM

@ SMth80您可以嘗試更改名稱爲「分隔符」的變量。改變分號'「;」'爲逗號'「，」'。 – Kanak

當它看到價格超過千元的時候，它會工作但面臨問題，因爲你知道千位這樣寫成「1,000」。所以，只要符合「，」，您的代碼就會創建一個單獨的列。無論如何，我已經修復了它，並在我的第一篇文章中重寫了已糾正的問題。你能告訴我如何在csv中寫入時刪除重複項？謝謝。 – SIM

從用python編寫的蜘蛛獲取csv中的重複內容

回答

相關問題