熊貓：通過遍歷它們來計算每列中的唯一值？

我有一個非常大的數據框，我想從每列生成唯一的值。這只是一個例子 - 總共有20列。熊貓：通過遍歷它們來計算每列中的唯一值？

  CRASH_DT  CRASH_MO_NO  CRASH_DAY_NO 
      1/1/2013  01    01  
      1/1/2013  01    01 
      1/5/2013  03    05

我期望的輸出是像這樣：

<variable = "CRASH_DT"> 
    <code>1/1/2013</code> 
    <count>2</count> 
    <code>1/5/2013</code> 
    <count>1</count> 
</variable> 
<variable = "CRASH_MO_NO"> 
    <code>01</code> 
    <count>2</count> 
    <code>03</code> 
    <count>1</count> 
</variable> 
<variable = "CRASH_DAY_NO"> 
    <code>01</code> 
    <count>2</count> 
    <code>05</code> 
    <count>1</count> 
</variable>

我一直在嘗試使用的.sum（）或.unique（）函數，其他many questions所建議的這個topic我已經看過了。

他們似乎都不適用於這個問題，他們都說爲了從每列生成唯一值，您應該使用groupby函數或選擇單個列。我有大量的列（超過20），所以它沒有任何意義，只需寫出df.unique ['col1'，'col2'...'col20']

我已經嘗試過.unique（），.value_counts（）和.count，但我無法弄清楚如何將這些應用於多列，而不是groupby函數或上面提到的任何東西鏈接。

我的問題是：如何從真正的海量數據框中的每個列中生成唯一值的計數，最好是通過循環遍歷列本身？（我很抱歉，如果這是重複的，我已經瀏覽了很多有關此主題的問題，雖然他們似乎也應該爲我的問題工作，但我無法弄清楚如何調整它們以使它們工作對我來說）

這是我到目前爲止的代碼：

import pyodbc 
import pandas.io.sql 

conn = pyodbc.connect('DRIVER={Microsoft Access Driver (*.mdb, *.accdb)};DBQ=C:\\Users\\<filename>.accdb') 

sql_crash = "SELECT * FROM CRASH" 
df_crash = pandas.io.sql.read_sql(sql_crash, conn) 
df_c_head = df_crash.head() 
df_c_desc = df_c_head.describe() 

for k in df_c_desc: 
    df_c_unique = df_c_desc[k].unique() 
    print(df_c_unique.value_counts()) #Generates the error "numpy.ndarray object has no attribute .value_counts()

來源

2015-09-15 ale19

不'df_crash.apply（PD。 Series.value_counts）'工作？ – EdChum

我想它會，但我不知道從哪裏開始。如果我理解正確，pd.Series指向每一列，所以我認爲我需要以某種方式遍歷每列。那是對的嗎？ – ale19

我已經發布了一個答案來說明在你的情況下這種用法。 – Romain

我會遍歷value_counts().items()每列：

>>> df["CRASH_DAY_NO"].value_counts() 
01 2 
05 1 
dtype: int64 
>>> df["CRASH_DAY_NO"].value_counts().items() 
<zip object at 0x7fabf49f05c8> 
>>> for value, count in df["CRASH_DAY_NO"].value_counts().items(): 
...  print(value, count) 
...  
01 2 
05 1

因此，像

def vc_xml(df): 
    for col in df: 
     yield '<variable = "{}">'.format(col) 
     for k,v in df[col].value_counts().items(): 
      yield " <code>{}</code>".format(k) 
      yield " <count>{}</count>".format(v) 
     yield '</variable>' 

with open("out.xml", "w") as fp: 
    for line in vc_xml(df): 
     fp.write(line + "\n")

給我

<variable = "CRASH_DAY_NO"> 
    <code>01</code> 
    <count>2</count> 
    <code>05</code> 
    <count>1</count> 
</variable> 
<variable = "CRASH_DT"> 
    <code>1/1/2013</code> 
    <count>2</count> 
    <code>1/5/2013</code> 
    <count>1</count> 
</variable> 
<variable = "CRASH_MO_NO"> 
    <code>01</code> 
    <count>2</count> 
    <code>03</code> 
    <count>1</count> 
</variable>

來源

2015-09-15 18:09:23 DSM

哇，這是完美的！當我使用df.head（）時非常好用。非常感謝你。但是，當我在整個數據框（不僅僅是上面列出的列）上運行該程序（通過一些調整）時，它仍然會給我那些運行時錯誤 - 並且它似乎在沒有生成XML文件的情況下掛起。我列出了我的代碼，錯誤和我認爲可能導致問題的數據：http://pastebin.com/ZwcXEkxQ – ale19

Belay my last comment！從命令行運行它似乎不會產生任何問題，並生成我所需要的。也許這是我的Anaconda安裝或其他問題。我非常感謝這一點，我完全陷入了僵局，這讓我發瘋。 – ale19

這裏是答案啓發this question答案。但是我不知道它在你的情況下是否足夠可擴展。

df = pd.DataFrame({'CRASH_DAY_NO': [1, 1, 5, 2, 2], 
'CRASH_DT': ['10/2/2014 5:00:08 PM', 
    '5/28/2014 1:29:28 PM', 
    '5/28/2014 1:29:28 PM', 
    '7/14/2014 5:42:03 PM', 
    '6/3/2014 10:33:22 AM'], 
'CRASH_ID': [1486150, 1486152, 1486224, 1486225, 1486226], 
'SEG_PT_LRS_MEAS': [79.940226960000004, 
    297.80989999000002, 
    140.56460290999999, 
    759.43600000000004, 
    102.566036], 
'SER_NO': [1, 3, 4, 5, 6]}) 

df = df.apply(lambda x: x.value_counts(sort=False)) 
df.index = df.index.astype(str) 
# Transforming to XML by hand ... 
def func(row): 
    xml = ['<variable = "{0}">'.format(row.name)] 
    for field in row.index: 
     if not pd.isnull(row[field]): 
      xml.append(' <code>{0}</code>'.format(field)) 
      xml.append(' <count>{0}</count>'.format(row[field])) 
    xml.append('</variable>') 
    return '\n'.join(xml) 

print('\n'.join(df.apply(func, axis=0))) 

<variable = "CRASH_DAY_NO"> 
    <code>1</code> 
    <count>2.0</count> 
    <code>2</code> 
    <count>2.0</count> 
    <code>5</code> 
    <count>1.0</count> 
</variable> 
<variable = "CRASH_DT"> 
    <code>5/28/2014 1:29:28 PM</code> 
    <count>2.0</count> 
    <code>7/14/2014 5:42:03 PM</code> 
    <count>1.0</count> 
    <code>10/2/2014 5:00:08 PM</code> 
    <count>1.0</count> 
    <code>6/3/2014 10:33:22 AM</code> 
    <count>1.0</count> 
</variable> 
....

來源

2015-09-15 16:01:43 Romain

當我運行這個（用從Access導入的巨大數據集替換df）時，我重複了各種運行時警告：「RuntimeWarning：無法將類型'Timestamp'與類型'int'進行比較，排序順序未定義爲無法匹配的對象。「RuntimeWarning：無法定義的類型：float（）> str（），排序順序未定義爲無法比較的對象」。我認爲這是因爲我有很多不同類型的數據（日期時間，浮點數，字符串等）。 values_count仍然適用於非int數據類型嗎？ – ale19

對於它的價值，我也得到TypeError（「無法對上的這些索引器[54.263484954833984]使用」，'在索引CRASH_ID處發生標籤索引「） – ale19

我已經簡化了我的答案來抑制對'transpose'的不必要的調用，也許它導致了這個問題。嘗試新版本並告訴我。 – Romain

熊貓：通過遍歷它們來計算每列中的唯一值？

回答

相關問題