在python中解析製表符分隔文件時出現的奇怪現象

我解析了一個製表符分隔的文件，其中第一個元素是Twitter標籤，第二個元素是tweet內容。在python中解析製表符分隔文件時出現的奇怪現象

我的輸入文件看起來像：

#trumpisanabuser of young black men . calling for the execution of the innocent !url " 
#centralparkfiv of young black men . calling for the execution of the innocent !url " 
#trumppence16 " 
#trumppence16 " 
#america2that @user "

和我的代碼的作用是濾除重複內容如通過檢查第二製表sepearted元件是重複的轉推。

import sys 
import csv 

tweetfile = sys.argv[1] 
tweetset = set() 
with open(tweetfile, "rt") as f: 
    reader = csv.reader(f, delimiter = '\t') 
    for row in reader: 
     print("hashtag: " + str(row[0]) + "\t" + "tweet: " + str(row[1])) 
     row[1] = row[1].replace("\\ n", "").rstrip() 
     if row[1] in tweetset: 
      continue 
     temp = row[1].replace("!url","") 
     temp = temp.replace("@user","") 
     temp = "".join([c if c.isalnum() else "" for c in temp]) 
     if temp: 
      taglines.append(row[0] + "\t" + row[1]) 
     tweetset.add(row[1])

但是，解析很奇怪。當我打印每個解析的項目時，輸出如下所示。任何人都可以解釋爲什麼解析中斷並導致此行被打印（hashtag: #trumppence16 tweet:，換行符，然後#trumppence16）？

hashtag: #centralparkfive tweet: of young black men . calling for the execution of the innocent !url " 
hashtag: #trumppence16 tweet: 
#trumppence16 
hashtag: #america2that tweet: @user "

來源

2017-01-03 pandagrammer

你必須在文件中未結束的引號 – e4c5

對於推文，您有"行。 CSV可以通過報價列通過引用"左右的值，包括換行符。從開頭"到下一個結束"的所有內容都是單列值。

reader = csv.reader(f, delimiter='\t', quoting=csv.QUOTE_NONE)

來源

2017-01-03 07:59:31

哦，我的天哪，這解決了：

您可以通過設置quoting option到csv.QUOTE_NONE禁用報價處理。謝謝！！！！！ – pandagrammer

在python中解析製表符分隔文件時出現的奇怪現象

回答

相關問題