2017-06-15 21 views
1

我繼承了幾百個我想導入熊貓數據幀的CSV。它們的格式,像這樣:將不正確格式的CSV讀入熊貓 - 未轉義的引號

username;date;retweets;favorites;text;geo;mentions;hashtags;id;permalink 
;2011-03-02 11:04;0;0;"ICYMI: "What you have is 87 people who have common goals of working for [the] next generation; that’s why our...";;;;"42993734165594112";https://twitter.com/AustinScottGA08/status/42993734165594112 
;2014-02-25 10:38;3;0;"Will be asking tough questions of #IRS at 2/26 FSGG hearing; supporting bills to make agency more accountable.";;;#IRS;"438352361812426752";https://twitter.com/AnderCrenshaw/status/438352361812426752 
;2017-06-14 12:39;4;6;"Thank you to the brave men and women who have answered the call to defend our great nation. Happy 242nd Birthday @USArmy ! #ArmyBDay pic.twitter.com/brBYCOLBJZ";;@USArmy;#ArmyBDay;"875045042758369281";https://twitter.com/AustinScottGA08/status/875045042758369281 

要扳指成熊貓數據幀,我想:

tweets = pd.read_csv(file, header=0, sep=';', parse_dates = True)

,並得到這個錯誤:

ParserError: Error tokenizing data. C error: Expected 10 fields in line 1, saw 11

我認爲這是因爲該字段中有一個非轉義報價

ICYMI: "What you have is 87 people who have common goals of working for [the] next generation; that’s why our...

所以,我想

tweets = pd.read_csv(file, header=0, sep=';', parse_dates = True, quoting=csv.QUOTE_NONE)

,並得到一個新的錯誤(我假設,因爲有;在現場):

Will be asking tough questions of #IRS at 2/26 FSGG hearing; supporting bills to make agency more accountable. http:// tinyurl.com/n8ozeg5

ParserError: Error tokenizing data. C error: Expected 10 fields in line 2, saw 11

我不能再生這些CSV文件。我想知道的是,我如何預處理/修復它們,以便它們的格式正確(即,在字段中轉義引號)?或者,有沒有辦法直接將它們讀入數據框,即使使用未轉義的引號?

+0

什麼蟒蛇和熊貓的版本您使用的?我用Python 3.6.1和pandas得到了不同的結果0.19.2 –

+0

Python 3.5.3 pandas 0.20.2 - 你會發生什麼? – Libby

+0

對於這種情況,我不需要每一列,並添加'usecols'解決了我眼前的問題。但它並沒有回答我的實際問題。這裏是工作的一行:'tweets = pd.read_csv(file,header = 0,sep =';',parse_dates = True,quoting = csv.QUOTE_NONE,usecols = [「date」,「hashtags」,「permalink」] )' – Libby

回答

-1

我會在讀入熊貓之前清理數據。這是我對你當前問題的解決方案。

編輯:
這將雙引號(基於this答案)

o = open("fileOut.csv", 'w') 
with open("fileIn.txt") as f: 
    for lines in f: 
     o.write(re.sub('\"[^]]*\"', lambda x:x.group(0).replace(';',''), lines)) 
o.close() 

原始內更換;

o = open("fileOut.csv", 'w') 
with open("fileIn.txt") as f: 
    for lines in f: 
     o.write(lines.replace("; ", "")) 
o.close() 
+0

The;在推文中並不總是跟着一個空格,所以這隻適用於一些。例如'; 2013-07-15 15:35; 1; 0;「@ CongressionalPhotoADay 15 - 美麗的東西:從美國國會大廈的揚聲器的陽臺上看到; ... http:// fb.me/2ZHDzR8XQ" ;;@ CongressionalPhotoADay ;;「356874563839201280」; https:// twitter.com/AustinScottGA08/status/356874563839201280' – Libby

+1

@Libby:在這種情況下,使用像https://stackoverflow.com/a/11096811/2204131這樣的正則表達式。 're.sub('\「[^]] * \」',lambda x:x.group(0).replace(';','\;'),lines)'將會替換引號內的';'。 – ramesh