避免一個班輪將提高可讀性,使之少一些困惑:
mask = (csv_pd.setA==1) & (csv_pd.setB==0) & (csv_pd.setC==0)
csv_pd[mask].groupby('D').count()
另一種可能性,這恰好是一個一行,是使用the query
method:
csv_pd.query('setA==1 & setB==0 & setC==0').groupby('D').count()
還要注意,您可以將列名稱傳遞給groupby
而不是系列值。因此,groupby('D')
而不是groupby(csv_pd.D)
。
計算所有可能的子集的大小,powerset
recipe和itertools.product
將是有益的:
import itertools as IT
import numpy as np
import pandas as pd
def powerset(iterable, reverse=False, rvals=None):
"""powerset([1,2,3]) -->() (1,) (2,) (3,) (1,2) (1,3) (2,3) (1,2,3)"""
s = list(iterable)
N = len(s)
if rvals is None:
rvals = range(N, -1, -1) if reverse else range(N + 1)
return IT.chain.from_iterable(
IT.combinations(s, r) for r in rvals)
df = pd.DataFrame(np.random.randint(2, size=(10,4)), columns=list('ABCD'))
print(df)
for cols in powerset(df.columns):
if not cols: continue
for vals in IT.product([0,1], repeat=len(cols)):
mask = np.logical_and.reduce([df[c]==v for c, v in zip(cols, vals)])
cond = ' & '.join(['{}={}'.format(c,v) for c, v in zip(cols,vals)])
n = len(df[mask])
print('n({}) = {}'.format(cond, n))
產量
n(A=0) = 8
n(A=1) = 2
n(B=0) = 4
n(B=1) = 6
...
n(A=0 & B=0) = 4
n(A=0 & B=1) = 4
n(A=1 & B=0) = 0
...
n(A=1 & B=1 & C=0 & D=0) = 0
n(A=1 & B=1 & C=0 & D=1) = 1
n(A=1 & B=1 & C=1 & D=0) = 0
n(A=1 & B=1 & C=1 & D=1) = 1