將不正確格式的CSV讀入熊貓 - 未轉義的引號

我繼承了幾百個我想導入熊貓數據幀的CSV。它們的格式，像這樣：將不正確格式的CSV讀入熊貓 - 未轉義的引號

username;date;retweets;favorites;text;geo;mentions;hashtags;id;permalink 
;2011-03-02 11:04;0;0;"ICYMI: "What you have is 87 people who have common goals of working for [the] next generation; that’s why our...";;;;"42993734165594112";https://twitter.com/AustinScottGA08/status/42993734165594112 
;2014-02-25 10:38;3;0;"Will be asking tough questions of #IRS at 2/26 FSGG hearing; supporting bills to make agency more accountable.";;;#IRS;"438352361812426752";https://twitter.com/AnderCrenshaw/status/438352361812426752 
;2017-06-14 12:39;4;6;"Thank you to the brave men and women who have answered the call to defend our great nation. Happy 242nd Birthday @USArmy ! #ArmyBDay pic.twitter.com/brBYCOLBJZ";;@USArmy;#ArmyBDay;"875045042758369281";https://twitter.com/AustinScottGA08/status/875045042758369281

要扳指成熊貓數據幀，我想：

tweets = pd.read_csv(file, header=0, sep=';', parse_dates = True)

，並得到這個錯誤：

ParserError: Error tokenizing data. C error: Expected 10 fields in line 1, saw 11

我認爲這是因爲該字段中有一個非轉義報價

ICYMI: "What you have is 87 people who have common goals of working for [the] next generation; that’s why our...

所以，我想

tweets = pd.read_csv(file, header=0, sep=';', parse_dates = True, quoting=csv.QUOTE_NONE)

，並得到一個新的錯誤（我假設，因爲有;在現場）：

Will be asking tough questions of #IRS at 2/26 FSGG hearing; supporting bills to make agency more accountable. http:// tinyurl.com/n8ozeg5

ParserError: Error tokenizing data. C error: Expected 10 fields in line 2, saw 11

我不能再生這些CSV文件。我想知道的是，我如何預處理/修復它們，以便它們的格式正確（即，在字段中轉義引號）？或者，有沒有辦法直接將它們讀入數據框，即使使用未轉義的引號？

來源

2017-06-15 Libby

什麼蟒蛇和熊貓的版本您使用的？我用Python 3.6.1和pandas得到了不同的結果0.19.2 –

Python 3.5.3 pandas 0.20.2 - 你會發生什麼？ – Libby

對於這種情況，我不需要每一列，並添加'usecols'解決了我眼前的問題。但它並沒有回答我的實際問題。這裏是工作的一行：'tweets = pd.read_csv（file，header = 0，sep =';'，parse_dates = True，quoting = csv.QUOTE_NONE，usecols = [「date」，「hashtags」，「permalink」] ）' – Libby

-1

我會在讀入熊貓之前清理數據。這是我對你當前問題的解決方案。

編輯：
這將雙引號（基於this答案）

o = open("fileOut.csv", 'w') 
with open("fileIn.txt") as f: 
    for lines in f: 
     o.write(re.sub('\"[^]]*\"', lambda x:x.group(0).replace(';',''), lines)) 
o.close()

原始內更換;：

o = open("fileOut.csv", 'w') 
with open("fileIn.txt") as f: 
    for lines in f: 
     o.write(lines.replace("; ", "")) 
o.close()

來源

2017-06-16 00:14:21 ramesh

The;在推文中並不總是跟着一個空格，所以這隻適用於一些。例如'; 2013-07-15 15：35; 1; 0;「@ CongressionalPhotoADay 15 - 美麗的東西：從美國國會大廈的揚聲器的陽臺上看到; ... http：// fb.me/2ZHDzR8XQ" ;;@ CongressionalPhotoADay ;;「356874563839201280」; https：// twitter.com/AustinScottGA08/status/356874563839201280' – Libby

@Libby：在這種情況下，使用像https://stackoverflow.com/a/11096811/2204131這樣的正則表達式。 're.sub（'\「[^]] * \」'，lambda x：x.group（0）.replace（';'，'\;'），lines）'將會替換引號內的';'。 – ramesh

將不正確格式的CSV讀入熊貓 - 未轉義的引號

回答

相關問題