2017-04-19 52 views
-2

我創建了一個按照我的預期收集數據的蜘蛛。我現在面臨的唯一問題是結果有很多重複。不過,我想動搖重複的功能關閉,CSV寫結果:從用python編寫的蜘蛛獲取csv中的重複內容

下面是代碼:

import csv 
import requests 
from lxml import html 

def Startpoint(): 
    global writer 
    outfile=open('Data.csv','w',newline='') 
    writer=csv.writer(outfile) 
    writer.writerow(["Name","Price"]) 
    address = "https://www.sephora.ae/en/stores/" 
    page = requests.get(address) 
    tree = html.fromstring(page.text) 
    titles=tree.xpath('//li[contains(@class,"level0")]') 
    for title in titles: 
     href = title.xpath('.//a[contains(@class,"level0")]/@href')[0] 
     Layer2(href) 

def Layer2(address): 
    global writer 
    page = requests.get(address) 
    tree = html.fromstring(page.text) 
    titles=tree.xpath('//li[contains(@class,"amshopby-cat")]') 
    for title in titles: 
     href = title.xpath('.//a/@href')[0] 
     Endpoint(href) 

def Endpoint(address): 
    global writer 
    page = requests.get(address) 
    tree = html.fromstring(page.text) 
    titles=tree.xpath('//div[@class="product-info"]') 
    for title in titles: 
     Name = title.xpath('.//div[contains(@class,"h3")]/a[@title]/text()')[0] 
     Price = title.xpath('.//span[@class="price"]/text()')[0] 
     metco=(Name,Price) 
     print(metco) 
     writer.writerow(metco) 

Startpoint() 
+0

尋求調試幫助的問題(「爲什麼這個代碼不工作?」)必須包含所需的行爲,特定的問題或錯誤以及在問題本身中重現問題所需的最短代碼。沒有明確問題陳述的問題對其他讀者無益。請參閱:如何創建最小,完整和可驗證示例。 – DyZ

回答

1

你不需要csv模塊wrtie CSV文件。指定擴展就足夠了。因此,把你的代碼變成

import requests 
from lxml import html 

delimiter = ";" 
file_name = 'data.csv' 

def Startpoint(): 
    address = "https://www.sephora.ae/en/stores/" 
    page = requests.get(address) 
    tree = html.fromstring(page.text) 
    titles=tree.xpath('//li[contains(@class,"level0")]') 
    for title in titles: 
     href = title.xpath('.//a[contains(@class,"level0")]/@href')[0] 
     Layer2(href) 

def Layer2(address): 
    page = requests.get(address) 
    tree = html.fromstring(page.text) 
    titles=tree.xpath('//li[contains(@class,"amshopby-cat")]') 
    for title in titles: 
     href = title.xpath('.//a/@href')[0] 
     Endpoint(href) 

def Endpoint(address): 
    page = requests.get(address) 
    tree = html.fromstring(page.text) 
    titles=tree.xpath('//div[@class="product-info"]') 
    for title in titles: 
     Name = title.xpath('.//div[contains(@class,"h3")]/a[@title]/text()')[0] 
     Price = title.xpath('.//span[@class="price"]/text()')[0] 
     metco=(Name,Price) 
     print(metco) 
     with open(file_name,'a') as outfile: 
      outfile.write(delimiter.join(metco).encode('utf8') + '\n') 

with open(file_name,'w') as outfile: 
    outfile.write(delimiter.join(["Product Name", "Price"])+'\n') 
Startpoint() 

應該做的伎倆。請注意0​​部分,它可以防止您從UnicodeEncodeError開始的寫入過程。此外,請注意open函數中使用的參數'w''a'。第一種意思是「寫」,第二種意思是「附加」。但是,即使從啓發式的角度來看,這些代碼的工作原理,遠不是「好」的想法。

+0

感謝Tnerual爲您的答案。你非常接近我的期望。編碼在這種情況下不起作用,所以我放棄了這一行。所有結果現在都在csv文件中,但它們不是位於單獨的列中,而是位於由逗號分隔的單個列中。你能分兩列嗎?謝謝。以下是該輸出的鏈接:「https://www.dropbox.com/s/okb7hxocmv5lbax/data.csv?dl=0」 – SIM

+1

@ SMth80您可以嘗試更改名稱爲「分隔符」的變量。改變分號'「;」'爲逗號'「,」'。 – Kanak

+0

當它看到價格超過千元的時候,它會工作但面臨問題,因爲你知道千位這樣寫成「1,000」。所以,只要符合「,」,您的代碼就會創建一個單獨的列。無論如何,我已經修復了它,並在我的第一篇文章中重寫了已糾正的問題。你能告訴我如何在csv中寫入時刪除重複項?謝謝。 – SIM