2016-11-10 50 views
2

我遇到了這樣的麻煩:我需要找到用戶第一次點擊一個電子郵件(變量發送),並在發生時在相應的行中放置一個。找到最早的發生

該數據集有幾千個用戶(散列)在通訊中點擊電子郵件的一部分。我試圖通過發送,哈希將它們分組,然後找到最早的日期,但無法使其工作。

所以我去了一小討厭的解決方案,然而返回奇怪的事情:

我的數據集(相關變量):

>>> clicks[['datetime','hash','sending']].head() 

      datetime        hash sending 
0 2016-11-01 19:13:34 0b1f4745df5925dfb1c8f53a56c43995  5 
1 2016-11-01 10:47:14 0a73d5953ebf5826fbb7f3935bad026d  5 
2 2016-10-31 19:09:21 605cebbabe0ba1b4248b3c54c280b477  5 
3 2016-10-31 13:42:36 d26d61fb10c834292803b247a05b6cb7  5 
4 2016-10-31 10:46:30 48f8ab83e8790d80af628e391f3325ad  5 

有6個發送回合,datetimedatetime64[ns]

我這樣做是如下的方式:

所有的
clicks['first'] = 0 

for hash in clicks['hash'].unique(): 
    t = clicks.ix[clicks.hash==hash, ['hash','datetime','sending']] 
    part = t['sending'].unique() 

    for i in part: 
     temp = t.ix[t.sending == i,'datetime'] 
     clicks.ix[t[t.datetime == np.min(temp)].index.values,'first']=1 

首先,我不認爲這是非常Python的,而且是相當緩慢的。但主要是它返回一個奇怪的類型!有0.01.0值,但我不能與他們合作:

>>> type(clicks.first) 
    <type 'instancemethod'> 

>>> clicks.loc[clicks.first==1] 
Traceback (most recent call last): 
    File "<stdin>", line 1, in <module> 
    File "/Users/air/anaconda/lib/python2.7/site-packages/pandas/core/indexing.py", line 1296, in __getitem__ 
    return self._getitem_axis(key, axis=0) 
    File "/Users/air/anaconda/lib/python2.7/site-packages/pandas/core/indexing.py", line 1467, in _getitem_axis 
    return self._get_label(key, axis=axis) 
    File "/Users/air/anaconda/lib/python2.7/site-packages/pandas/core/indexing.py", line 93, in _get_label 
    return self.obj._xs(label, axis=axis) 
    File "/Users/air/anaconda/lib/python2.7/site-packages/pandas/core/generic.py", line 1749, in xs 
    loc = self.index.get_loc(key) 
    File "/Users/air/anaconda/lib/python2.7/site-packages/pandas/indexes/base.py", line 1947, in get_loc 
    return self._engine.get_loc(self._maybe_cast_indexer(key)) 
    File "pandas/index.pyx", line 137, in pandas.index.IndexEngine.get_loc (pandas/index.c:4154) 
    File "pandas/index.pyx", line 156, in pandas.index.IndexEngine.get_loc (pandas/index.c:3977) 
    File "pandas/index.pyx", line 373, in pandas.index.Int64Engine._check_type (pandas/index.c:7634) 
KeyError: False 

所以任何想法,請?非常感謝!

----- UPDATE:------

INSTALLED VERSIONS 
    ------------------ 
    commit: None 
    python: 2.7.12.final.0 
    python-bits: 64 
    OS: Darwin 
    OS-release: 15.6.0 
    machine: x86_64 
    processor: i386 
    byteorder: little 
    LC_ALL: None 
    LANG: en_US.UTF-8 

    pandas: 0.18.1 

回答

3

我認爲你需要groupbyapply其中具有minimal比較值,並輸出布爾 - 需要通過astype轉換爲int01

clicks = pd.DataFrame({'hash': {0: '0b1f4745df5925dfb1c8f53a56c43995', 1: '0a73d5953ebf5826fbb7f3935bad026d', 2: '605cebbabe0ba1b4248b3c54c280b477', 3: '0b1f4745df5925dfb1c8f53a56c43995', 4: '0a73d5953ebf5826fbb7f3935bad026d', 5: '605cebbabe0ba1b4248b3c54c280b477', 6: 'd26d61fb10c834292803b247a05b6cb7', 7: '48f8ab83e8790d80af628e391f3325ad'}, 'sending': {0: 5, 1: 5, 2: 5, 3: 5, 4: 5, 5: 5, 6: 5, 7: 5}, 'datetime': {0: pd.Timestamp('2016-11-01 19:13:34'), 1: pd.Timestamp('2016-11-01 10:47:14'), 2: pd.Timestamp('2016-10-31 19:09:21'), 3: pd.Timestamp('2016-11-01 19:13:34'), 4: pd.Timestamp('2016-11-01 11:47:14'), 5: pd.Timestamp('2016-10-31 19:09:20'), 6: pd.Timestamp('2016-10-31 13:42:36'), 7: pd.Timestamp('2016-10-31 10:46:30')}}) 
print (clicks) 
      datetime        hash sending 
0 2016-11-01 19:13:34 0b1f4745df5925dfb1c8f53a56c43995  5 
1 2016-11-01 10:47:14 0a73d5953ebf5826fbb7f3935bad026d  5 
2 2016-10-31 19:09:21 605cebbabe0ba1b4248b3c54c280b477  5 
3 2016-11-01 19:13:34 0b1f4745df5925dfb1c8f53a56c43995  5 
4 2016-11-01 11:47:14 0a73d5953ebf5826fbb7f3935bad026d  5 
5 2016-10-31 19:09:20 605cebbabe0ba1b4248b3c54c280b477  5 
6 2016-10-31 13:42:36 d26d61fb10c834292803b247a05b6cb7  5 
7 2016-10-31 10:46:30 48f8ab83e8790d80af628e391f3325ad  5 
#if column dtype of column datetime is not datetime (with this sample not necessary) 
clicks.datetime = pd.to_datetime(clicks.datetime) 
clicks['first'] = clicks.groupby(['hash','sending'])['datetime'] \ 
         .apply(lambda x: x == x.min()) \ 
         .astype(int) 
print (clicks) 
      datetime        hash sending first 
0 2016-11-01 19:13:34 0b1f4745df5925dfb1c8f53a56c43995  5  1 
1 2016-11-01 10:47:14 0a73d5953ebf5826fbb7f3935bad026d  5  1 
2 2016-10-31 19:09:21 605cebbabe0ba1b4248b3c54c280b477  5  0 
3 2016-11-01 19:13:34 0b1f4745df5925dfb1c8f53a56c43995  5  1 
4 2016-11-01 11:47:14 0a73d5953ebf5826fbb7f3935bad026d  5  0 
5 2016-10-31 19:09:20 605cebbabe0ba1b4248b3c54c280b477  5  1 
6 2016-10-31 13:42:36 d26d61fb10c834292803b247a05b6cb7  5  1 
7 2016-10-31 10:46:30 48f8ab83e8790d80af628e391f3325ad  5  1 

----- UPDATE:------

INSTALLED VERSIONS 
------------------ 
commit: None 
python: 2.7.12.final.0 
python-bits: 64 
OS: Darwin 
OS-release: 15.6.0 
machine: x86_64 
processor: i386 
byteorder: little 
LC_ALL: None 
LANG: en_US.UTF-8 

pandas: 0.18.1 
+0

Wowza,謝謝!我嘗試了lambda,但沒有讓它工作,不知道如何從中選擇最小值。所以這看起來不錯,但仍然不能對它進行分類,得到相同的錯誤。雖然'clicks.first'最終是整數。你知道爲什麼嗎? –

+0

也許你有重複最小值的問題。它對樣本很好,並且真實的數據不是? – jezrael

+0

每個'hash'和'sending'都不能有重複。該子集的錯誤說:'TypeError:不能在上使用這些索引器[False] '進行位置索引'所以它看起來不再是'DataFrame' –

0

注:我不熟悉的大熊貓模塊,但我確實有蟒蛇經常(它系統工程)

爲什麼工作你不只是使用日期時間模塊?您可以根據時間戳輕鬆對其進行排序。例如:

Python 2.7.12 (default, Oct 26 2016, 11:37:25) 
[GCC 4.2.1 Compatible Apple LLVM 8.0.0 (clang-800.0.38)] on darwin 
Type "help", "copyright", "credits" or "license" for more information. 
>>> import datetime 
>>> fmt = '%Y-%m-%d %H:%S:%M' 
>>> timestamps = ['2016-11-01 19:13:34', '2016-11-01 10:47:14', 
...    '2016-10-31 19:09:21', '2016-10-31 13:42:36', 
...    '2016-10-31 10:46:30'] 
>>> def compare_dates(d1, d2): 
...  d1_dt = datetime.datetime.strptime(d1, fmt) 
...  d2_dt = datetime.datetime.strptime(d2, fmt) 
...  if d1 > d2: 
...   return 1 
...  elif d1 == d2: 
...   return 0 
...  else: 
...   return -1 
... 
>>> timestamps.sort(cmp=compare_dates) 
>>> timestamps 
['2016-10-31 10:46:30', '2016-10-31 13:42:36', '2016-10-31 19:09:21', '2016-11-01 10:47:14', '2016-11-01 19:13:34'] 
>>> 

正如您所看到的,使用日期時間模塊對日期進行排序很容易。看起來微不足道的是編寫一個比較函數,並根據日期對它們進行排序以找出最早發生的事件。

相關問題