2016-08-24 24 views
1

我目前使用selenium python刮擦linkedin數據。我可以通過各種網頁進行解析並抓取數據,但由於Unicode錯誤,在前幾頁之後該過程會中斷。這裏是我的代碼:UnicodeEncodeError:'ascii'編解碼器無法在位置448對字符u' u2013'進行編碼:序號不在範圍內(128)

from selenium import webdriver 
from time import sleep 

driver = webdriver.Firefox() 
driver.get('https://www.linkedin.com/jobs/search?locationId=sg%3A0&f_TP=1%2C2&orig=FCTD&trk=jobs_jserp_posted_one_week') 

result = [] 
while True: 
    while True: 
     try: 
      sleep(1) 
      result +=[i.text for i in driver.find_elements_by_class_name('job-title-text')] 
     except: 
      sleep(5) 
     else: 
      break 
    try: 
     for i in range(50): 
      nextbutton = driver.find_element_by_class_name('next-btn') 
      nextbutton.click() 
    except: 
     break 

with open('jobtitles.csv', 'w') as f: 
f.write('\n'.join(i for i in result).encode('utf-8').decode('utf-8')) 
+1

你爲什麼這樣做'.encode( 'UTF-8')。解碼( 'UTF-8')'? 'Actual String' - >'Encode' - >'Decode' - >'Actual String',用法是什麼? –

+0

我想獲取職位名稱的文本格式並將其導出到csv文件 –

+0

我試圖拿走解碼但仍然可以工作,直到第9個網頁停止。實際上有50頁 –

回答

0

您可以使用UnicodeWriter(從Python文檔):

import codecs 
import cStringIO 
import csv 
from time import sleep 

from selenium import webdriver 


class UnicodeWriter: 
    """ 
    A CSV writer which will write rows to CSV file "f", 
    which is encoded in the given encoding. 
    """ 

    def __init__(self, f, dialect=csv.excel, encoding="utf-8", **kwds): 
     # Redirect output to a queue 
     self.queue = cStringIO.StringIO() 
     self.writer = csv.writer(self.queue, dialect=dialect, **kwds) 
     self.stream = f 
     self.encoder = codecs.getincrementalencoder(encoding)() 

    def writerow(self, row): 
     self.writer.writerow([s.encode("utf-8") for s in row]) 
     # Fetch UTF-8 output from the queue ... 
     data = self.queue.getvalue() 
     data = data.decode("utf-8") 
     # ... and reencode it into the target encoding 
     data = self.encoder.encode(data) 
     # write to the target stream 
     self.stream.write(data) 
     # empty queue 
     self.queue.truncate(0) 

    def writerows(self, rows): 
     for row in rows: 
      self.writerow(row) 


driver = webdriver.Firefox() 
driver.get('https://www.linkedin.com/jobs/search?locationId=sg%3A0&f_TP=1%2C2&orig=FCTD&trk=jobs_jserp_posted_one_week') 

result = [] 
while True: 
    while True: 
     try: 
      sleep(1) 
      result +=[i.text for i in driver.find_elements_by_class_name('job-title-text')] 
     except: 
      sleep(5) 
     else: 
      break 
    try: 
     for i in range(50): 
      nextbutton = driver.find_element_by_class_name('next-btn') 
      nextbutton.click() 
    except: 
     break 


with open('jobtitles.csv', 'w') as f: 
    doc = UnicodeWriter(f) 
    doc.writerows(result) 
0

它是...你的要求不正確編碼,一個字節流是由UTF-8編碼,這是不是真的 根據UTF-8實現在引用位置只允許ascii字符(0-127),所以UTF-8解碼失敗...我沒有看到你的代碼在UTF-8解碼時如何以及何時失敗,所以你應該自己跟蹤確切的位置 檢查變量類型(),並請注意,蟒蛇2和3在這方面有差異區域

0
import sys 
reload(sys) 
sys.setdefaultencoding("utf-8") 
print sys.getdefaultencoding() 

將它添加到代碼的頂部。

也,U可能需要預處理的代碼來代替某些nonenglish話

 words=word_tokenize(content) 
     # print words 
     word=[] 
     for w in words: 
      w= re.sub(r'[^\w\s]', '',w) 
      w =re.sub("[^A-Za-z]+"," ",w,flags=re.MULTILINE) 
      w =w .strip("\t\n\r") 
      word.append(w) 
     words=word 
     # print words 
     stop_words = set(stopwords.words('english')) 
     filteredword = [w for w in words if not w in stop_words and 3 < len(w)] 
     # print filteredword 
     words=" ".join(filteredword) 
相關問題