2016-02-12 241 views
3

我正在嘗試處理一些Twitter情緒數據。要做到這一點,我看看pandas。到目前爲止,我所做的是計算每個日期的所有數據的均值。但我也想用score來創建每天的加權平均值。因此,如果推文的分數是2,它應該會影響總計結果,如2條推文。熊貓加權平均值

tweets = [{'tweet_user_verified': 1, 'tweet_user_id': 14631115, 'tweet_favorite_count': 17048, 'tweet_sentiment': 1, 'tweet_retweet_count': 4842, 'tweet_id': 698155842877702144, 'tweet_date': datetime.datetime(2016, 2, 12, 15, 44, 58), 'tweet_lang': 'en', 'tweet_text': 'they fixed my iphone but now the z and the s do nt work ! they re like two of the best letters'}, {'tweet_user_verified': 0, 'tweet_user_id': 73518190, 'tweet_favorite_count': 1, 'tweet_sentiment': 1, 'tweet_retweet_count': 0, 'tweet_id': 698179827900125185, 'tweet_date': datetime.datetime(2016, 2, 12, 17, 20, 17), 'tweet_lang': 'en', 'tweet_text': 'ihs is nt ghetto , it s full of suburban kids who think their ghetto cause their parents ca nt afford to buy them a new iphone .'}, {'tweet_user_verified': 0, 'tweet_user_id': 1832492197, 'tweet_favorite_count': 2, 'tweet_sentiment': 5, 'tweet_retweet_count': 0, 'tweet_id': 698179203376623616, 'tweet_date': datetime.datetime(2016, 2, 12, 17, 17, 48), 'tweet_lang': 'en', 'tweet_text': 'how to brick iphone 5s and above 1. set the date and time to january 1st , 1970 on the device 2. restart 3. profit'}, {'tweet_user_verified': 0, 'tweet_user_id': 70582223, 'tweet_favorite_count': 18, 'tweet_sentiment': 5, 'tweet_retweet_count': 14, 'tweet_id': 698178066292539392, 'tweet_date': datetime.datetime(2016, 2, 12, 17, 13, 17), 'tweet_lang': 'en', 'tweet_text': 'iphone battery s go from 100-75 in seconds'}, {'tweet_user_verified': 0, 'tweet_user_id': 31050061, 'tweet_favorite_count': 72, 'tweet_sentiment': 1, 'tweet_retweet_count': 40, 'tweet_id': 698176382417903618, 'tweet_date': datetime.datetime(2016, 2, 12, 17, 6, 35), 'tweet_lang': 'en', 'tweet_text': 'a @ tmobile iphone ad featuring a woman wearing hijab is up on the # nyc subway platform walls pic.twitter.com/lsb0dzfymd'}, {'tweet_user_verified': 0, 'tweet_user_id': 733813, 'tweet_favorite_count': 14, 'tweet_sentiment': 1, 'tweet_retweet_count': 2, 'tweet_id': 698170656203149312, 'tweet_date': datetime.datetime(2016, 2, 12, 16, 43, 50), 'tweet_lang': 'en', 'tweet_text': 'if a modern 4″ iphone does arrive in march i might go for it . would miss 5″ screen but reaching the top left is a constant micro-irritant .'}, {'tweet_user_verified': 0, 'tweet_user_id': 3098026668, 'tweet_favorite_count': 11, 'tweet_sentiment': 1, 'tweet_retweet_count': 13, 'tweet_id': 698170562250739713, 'tweet_date': datetime.datetime(2016, 2, 12, 16, 43, 28), 'tweet_lang': 'en', 'tweet_text': 'hidden iphone 6s easter egg . pic.twitter.com/op1kqewwqv'}, {'tweet_user_verified': 1, 'tweet_user_id': 1769191, 'tweet_favorite_count': 11, 'tweet_sentiment': 1, 'tweet_retweet_count': 5, 'tweet_id': 698163741158838272, 'tweet_date': datetime.datetime(2016, 2, 12, 16, 16, 21), 'tweet_lang': 'en', 'tweet_text': 'that it took until iphone 7 for this to happen just shows you how hard it is to find a fabricator on par with samsung'}, {'tweet_user_verified': 0, 'tweet_user_id': 64334539, 'tweet_favorite_count': 8, 'tweet_sentiment': 1, 'tweet_retweet_count': 20, 'tweet_id': 698154074160697344, 'tweet_date': datetime.datetime(2016, 2, 12, 15, 37, 57), 'tweet_lang': 'en', 'tweet_text': 'rt @ iphoneteam : when your iphone corrects omw to on my way ! pic.twitter.com/ptpnrjgqqn'}, {'tweet_user_verified': 0, 'tweet_user_id': 3003790936, 'tweet_favorite_count': 8, 'tweet_sentiment': 1, 'tweet_retweet_count': 7, 'tweet_id': 698154004451323904, 'tweet_date': datetime.datetime(2016, 2, 12, 15, 37, 40), 'tweet_lang': 'en', 'tweet_text': 'mathew brady s business dropped off significantly once old abe got an iphone . # lincolnsbirthday pic.twitter.com/b4i8pzpw7z'}, {'tweet_user_verified': 0, 'tweet_user_id': 356837905, 'tweet_favorite_count': 8, 'tweet_sentiment': 5, 'tweet_retweet_count': 8, 'tweet_id': 698153555086221312, 'tweet_date': datetime.datetime(2016, 2, 12, 15, 35, 53), 'tweet_lang': 'en', 'tweet_text': 'millennials supporting a # socialist is priceless . ca nt wait for the government to dictate who can have that iphone or internet access .'}, {'tweet_user_verified': 0, 'tweet_user_id': 2872097713, 'tweet_favorite_count': 20, 'tweet_sentiment': 5, 'tweet_retweet_count': 21, 'tweet_id': 698153222964408321, 'tweet_date': datetime.datetime(2016, 2, 12, 15, 34, 34), 'tweet_lang': 'en', 'tweet_text': 'snapchat no android/iphone pic.twitter.com/bel90svufz'}, {'tweet_user_verified': 0, 'tweet_user_id': 35453314, 'tweet_favorite_count': 8, 'tweet_sentiment': 1, 'tweet_retweet_count': 1, 'tweet_id': 698152530031853568, 'tweet_date': datetime.datetime(2016, 2, 12, 15, 31, 48), 'tweet_lang': 'en', 'tweet_text': 'has anyone seen a place close to the venue that sells iphone chargers ? hurry please , 1 % left i m running out of pow'}, {'tweet_user_verified': 0, 'tweet_user_id': 231879129, 'tweet_favorite_count': 16, 'tweet_sentiment': 1, 'tweet_retweet_count': 7, 'tweet_id': 698152206965592064, 'tweet_date': datetime.datetime(2016, 2, 12, 15, 30, 31), 'tweet_lang': 'en', 'tweet_text': 'when u get a new iphone it s lit the best thing ever bc ur battery lasts u agesssss'}, {'tweet_user_verified': 0, 'tweet_user_id': 407435062, 'tweet_favorite_count': 6, 'tweet_sentiment': 5, 'tweet_retweet_count': 5, 'tweet_id': 698151086222372865, 'tweet_date': datetime.datetime(2016, 2, 12, 15, 26, 4), 'tweet_lang': 'en', 'tweet_text': 'and iphone mad corny for mandatory capitalizing kardashian '}] 

dates = [tw['tweet_date'] for tw in tweets] 
sntms = [tw['tweet_sentiment'] for tw in tweets] 
score = [int(1+math.log(1+tw['tweet_retweet_count']+tw['tweet_favorite_count']+tw['tweet_user_verified'])) for tw in tweets] 

ts = pd.Series(sntms, index=dates) 
cv = ts.resample('D', how='mean') 

回答

1
# Add scores to the sentiment. 
df = pd.concat([ts, pd.Series(np.random.random_integers(1, 10, (len(ts),)), 
           index=ts.index)], axis=1) 

# Weighted daily score. 
>>> df.resample('D', how=lambda x: (x.score * x.sentiment).sum()/
            float(x.score.sum()))['sentiment'] 
2016-02-12 2.247312 
Freq: D, Name: sentiment, dtype: float64 
1

http://pandas.pydata.org/pandas-docs/version/0.17.0/generated/pandas.DataFrame.resample.html

import numpy as np 
import pandas as pd 
import datetime as dt 

def weighted(score, data): 

    return np.array(sum(score*data)/len(score)) 


data = np.array(np.random.random(10)) 
index = pd.date_range(start=dt.date.today(), periods=10, freq='30min') 
df = pd.DataFrame(data, index=index,columns=['col']) 
print df.resample('1h', how=(lambda a: weighted(np.ones(len(a)), a))) 

你可以驗證,如果你傳遞的權重,因爲所有的人,這使正常均值。

或者,你可以通過一行到重採樣:

def weighted2(row): 
    a=row['a'].values 
    b=row['b'].values 
    return sum(a*b)/row.shape[0] 

score = np.ones(10) 
data = np.array(np.random.random(10)) 
index = pd.date_range(start=dt.date.today(), periods=10, freq='30min') 
df = pd.DataFrame(data, index=index,columns=['a']) 
df['b'] = score 
print df.resample('1h', how=weighted2)['a'] 
print df.resample('1h') 

兩者都可以得到:

       a 
2016-02-12 00:00:00 0.633469 
2016-02-12 01:00:00 0.436514 
2016-02-12 02:00:00 0.341746 
2016-02-12 03:00:00 0.745674 
2016-02-12 04:00:00 0.068618 
          a 
2016-02-12 00:00:00 0.633469 
2016-02-12 01:00:00 0.436514 
2016-02-12 02:00:00 0.341746 
2016-02-12 03:00:00 0.745674 
2016-02-12 04:00:00 0.068618