好吧,我有一個包含多行(當前超過40k)的CSV文件。由於大量的行數,我需要逐個讀取並執行一系列操作。這是第一個問題。第二個是:如何讀取csv文件並將其編碼爲utf-8?其次是如何讀取utf-8中的文件,例如:csv documentation。 Mesmo utilizando a classs class UTF8Recoder:
o retorno no meu printé\xe9 s\xf3
。有人可以幫我解決這個問題嗎?Python - CSV閱讀器 - 每次讀取一行
import preprocessing
import pymongo
import csv,codecs,cStringIO
from pymongo import MongoClient
from unicodedata import normalize
from preprocessing import PreProcessing
class UTF8Recoder:
def __init__(self, f, encoding):
self.reader = codecs.getreader(encoding)(f)
def __iter__(self):
return self
def next(self):
return self.reader.next().encode("utf-8")
class UnicodeReader:
def __init__(self, f, dialect=csv.excel, encoding="utf-8-sig", **kwds):
f = UTF8Recoder(f, encoding)
self.reader = csv.reader(f, dialect=dialect, **kwds)
def next(self):
'''next() -> unicode
This function reads and returns the next line as a Unicode string.
'''
row = self.reader.next()
return [unicode(s, "utf-8") for s in row]
def __iter__(self):
return self
with open('data/MyCSV.csv','rb') as csvfile:
reader = UnicodeReader(csvfile)
#writer = UnicodeWriter(fout,quoting=csv.QUOTE_ALL)
for row in reader:
print row
def status_processing(corpus):
myCorpus = preprocessing.PreProcessing()
myCorpus.text = corpus
print "Starting..."
myCorpus.initial_processing()
print "Done."
print "----------------------------"
編輯1:S Ringne先生的解決方案。但是現在,我無法執行我的def
中的操作。下面是新的代碼:
for csvfile in pd.read_csv('data/AracajuAgoraNoticias_facebook_statuses.csv',encoding='utf-8',sep=',', header='infer',engine='c', chunksize=2):
def status_processing(csvfile):
myCorpus = preprocessing.PreProcessing()
myCorpus.text = csvfile
print "Fazendo o processo inicial..."
myCorpus.initial_processing()
print "Feito."
print "----------------------------"
,並在腳本的末尾:
def main():
status_processing(csvfile)
main()
輸出是當我使用BeautifulSoup
刪除鏈接:
ValueError: The truth value of a DataFrame is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
嗯,但如何逐行閱讀?在這種情況下,我讀了一行,在'def status_processing'中執行操作,然後我又回頭閱讀另一行。糾正單詞的過程是非常昂貴的,一次全部閱讀並去做這些操作。 –
@ LeandroS.Matos在pd.read_csv中使用chunksize:for df('matrix.txt',sep =',',header = None,chunksize = 1): – Shubham
@ LeandroS.Matos:http://stackoverflow.com/問題/ 29334463/pandas-read-csv-file-line-by-line – Shubham