這裏是我的代碼。#pandas DataFrame ValueError:傳遞值的形狀是(1,3),索引暗示(3,3)
數據的形狀:
data_dict.items()
Out[57]:
[('Sympathetic', defaultdict(<type 'int'>, {'2011-10-06': 1})),
('protest', defaultdict(<type 'int'>, {'2011-10-06': 16})),
('occupycanada', defaultdict(<type 'int'>, {'2011-10-06': 1})),
('hating', defaultdict(<type 'int'>, {'2011-10-06': 1})),
('AND', defaultdict(<type 'int'>, {'2011-10-06': 4})),
('c', defaultdict(<type 'int'>, {'2011-10-06': 2})),
...]
的data_dict被定義爲
data_dict = defaultdict(lambda: defaultdict(int))
我想構建一個數據幀,這樣的:
columns = ['word','date',"number"]
word date number
"Sympathetic" '2011-10-06' 1
"protest" '2011-10-06' 16
'occupycanada' '2011-10-06' 1
'hating' '2011-10-06' 1
'AND' '2011-10-06' 4
'comunity' '2011-10-06' 2
...
我試圖做到這一點方式,使用熊貓:
import pandas as pd
for d in data_dict:
for date in data_dict[d]:
data=[d,date,data_dict[d][date]]
dat = pd.DataFrame(data, columns = ['word','date',"number"])
print dat
但是當我運行這段代碼,我有以下錯誤:
ValueError Traceback (most recent call last)
<ipython-input-56-80b3affa34fe> in <module>()
3 for date in data_dict[d]:
4 data=[d,date,data_dict[d][date]]
----> 5 dat = pd.DataFrame(data, columns = ['word','date',"number"])
6 print dat
....
ValueError: Shape of passed values is (1, 3), indices imply (3, 3)
我該如何解決呢?
約data_dict附加代碼:
from collections import defaultdict
import csv
import re
import sys
def flushPrint(s):
sys.stdout.write('\r')
sys.stdout.write('%s' % s)
sys.stdout.flush()
data_dict = defaultdict(lambda: defaultdict(int))
error_num = 0
line_num = 0
total_num = 0
bigfile = open('D:/Data/ows/ows_sample.txt', 'rb')
chunkSize = 10000000
chunk = bigfile.readlines(chunkSize)
while chunk:
total_num += len(chunk)
lines = csv.reader((line.replace('\x00','') for line in chunk), delimiter=',', quotechar='"')
for i in lines:
line_num+=1
if line_num%1000000==0:
flushPrint(line_num)
try:
i[1]= re.sub(r'http[s]?://(?:[a-z]|[0-9]|[[email protected]&+]|[!*\(\),]|(?:%[0-9a-f][0-9a-f]))+|(?:@[\w_]+)', "", i[1])
tweets=re.split(r"\W+",i[1])
date=i[3]
for word in tweets: # error
if len(date)==10:
data_dict[word][date] += 1
except Exception, e:
print e
error_num+=1
pass
chunk = bigfile.readlines(chunkSize)
print line_num, total_num,error_num
樣本數據
['"Twitter ID",Text,"Profile Image URL",Day,Hour,Minute,"Created At",Geo,"From User","From User ID",Language,"To User","To User ID",Source\n',
'121813144174727168,"RT @AnonKitsu: ALERT!!!!!!!!!!COPS ARE KETTLING PROTESTERS IN PARK W HELICOPTERS AND PADDYWAGONS!!!! #OCCUPYWALLSTREET #OWS #OCCUPYNY PLEASE RT !!HELP!!!!",,2011-10-06,5,4,"2011-10-06 05:04:51",N;,Anonops_Cop,401240477,en,,0,"<a href=">web</a>"\n',
'121813146137657344,"@jamiekilstein @allisonkilkenny Interesting interview (never aired, wonder why??) by Fox with #ows protester 2011-10-06,5,4,"2011-10-06 05:04:51",N;,KittyHybrid,34532053,en,jamiekilstein,2149053,"<a href=">web</a>"\n',
'121813150000619521,"@Seductivpancake Right! Those guys have a victory condition: regime change. #ows doesn\'t seem to have a goal I can figure out.",2011-10-06,5,4,"2011-10-06 05:04:52",N;,nerdsherpa,95067344,en,Seductivpancake,19695580,"<a href="nofollow">Echofon</a>"\n',
'121813150701072385,"RT @bembel "Occupy Wall Street" als linke Antwort auf die Tea Party? #OccupyWallStreet #OWS",2011-10-06,5,4,"2011-10-06 05:04:52",N;,hamudistan,35862923,en,,0,"<a href="rel="nofollow">Plume\xc2\xa0\xc2\xa0</a>"\n',
'121813163778899968,"#ows White shirt= Brown shirt.",2011-10-06,5,4,"2011-10-06 05:04:56",N;,kl_knox,419580636,en,,0,"<a href=">web</a>"\n',
'121813169999065088,"RT @TheNewDeal: The #NYPD are Out of Control. Is This a Free Country or a Middle-East Dictatorship? #OccupyWallStreet #OWS #p2",2011-10-06,5,4,"2011-10-06 05:04:57",N;,vickycrampton,32151083,en,,0,"<a href=">web</a>"\n',
你可以發佈一些代碼來生成'data_dict'嗎? – MaxU
@MaxU的代碼已經上傳,請幫我 –
我會做完全不同的 - 你可以發佈一個樣本數據集5-7行(與'ows_sample.txt'中的格式相同)? – MaxU