2017-10-17 31 views
1

我有含有約10K(10,000)行如下所示的CSV:處理字符串的列表刪除重複並添加相應的值

1: ['Andhra Pradesh-133', 'Meetai-1358', 'Meetai-2146', 'Meetai-2277'] 
... 
N: ['Andhra Pradesh-20', 'Rajasthan-60', 'Rajasthan-70'] 

我不得不重複值組合,例如:

['Andhra Pradesh-133', 'Meetai-5781'] // 5781 = 1358 + 2146 + 2277 

任何人都可以建議一個快速的方法來做到這一點嗎?

回答

0

使用list comprehensiongroupby

from itertools import groupby 


df = pd.DataFrame({'a':[['Andhra Pradesh-133', 'Meetai-1358', 'Meetai-2146', 'Meetai-2277'], 
         ['Andhra Pradesh-20', 'Rajasthan-60', 'Rajasthan-70']]}) 


data = [] 
for x in df['a']: 
    b = [a.split('-') for a in x] 
    L = [t for k, g in groupby(b, key=lambda x: x[0]) 
     for t in [k + '-' + str(sum((int(j) for i, j in g)))]] 
    data.append(L) 

print (data) 

[['Andhra Pradesh-133', 'Meetai-5781'], ['Andhra Pradesh-20', 'Rajasthan-130']] 

df['b'] = data 
print (df) 

                a \ 
0 [Andhra Pradesh-133, Meetai-1358, Meetai-2146,... 
1 [Andhra Pradesh-20, Rajasthan-60, Rajasthan-70] 

            b 
0 [Andhra Pradesh-133, Meetai-5781] 
1 [Andhra Pradesh-20, Rajasthan-130] 

編輯:

data = [] 
for line in open('file.csv'): 
    #strip new-line characters, split by [ and get second list 
    items = line.strip('\r\n" ]').split('[')[1] 
    #split lines, remove whitespace 
    items = [item.strip("' ") for item in items.split(',')] 
    #split to sublist 
    items = [a.split('-') for a in items] 
    #sum splitted sublists 
    items = [t for k, g in groupby(items, key=lambda x: x[0]) 
       for t in [k + '-' + str(sum((int(j) for i, j in g)))]] 
    data.append(items) 

print (data) 
[['Andhra Pradesh-133', 'Meetai-5781'], ['Andhra Pradesh-20', 'Rajasthan-130']] 

編輯:如果輸入文件

解決方案:

你需要通過[首次出現分裂,然後剝離[]太:

data = [] 
for line in open('file.csv'): 
    #strip new-line characters, split by [ and get second list 
    items = line.strip('\r\n" ]').split('[', 1)[1] 
    #split lines, remove whitespace 
    items = [item.strip("'[] ") for item in items.split(',')] 
    #split to sublist 
    items = [a.split('-') for a in items] 
    print (items) 
    #sum splitted sublists 
    items = [t for k, g in groupby(items, key=lambda x: x[0]) 
       for t in [k + '-' + str(sum((int(j) for i, j in g)))]] 
    data.append(items) 
+0

有一個小疑問在這裏,如果我考慮的是X = [ '潘吉姆-20', '北方邦-23185',「 Gujurat-1013','Uttar Pradesh-51']聲明函數組似乎不起作用。 b = [a.split(' - ')for a x] for k,g in groupby(b,key = lambda x:x [0]):不會被'uttar Pradesh'分組也不是'uttar Pradesh'一樣。你能幫助我們瞭解什麼是錯過的? –

+0

我覺得有問題double'[['。我編輯答案。 – jezrael

+0

對於我正在嘗試處理的名單中的錯字x = ['panjim-20','Uttar Pradesh-23185','Gujurat-1013','Uttar Pradesh-51']表示歉意。 ? –

0

我會爲每一行創建一個字典。通過分割或使用正則表達式解析字符串數字。該串例如'安得拉邦'是關鍵,價值是一個整數。將數字添加到由字符串確定的字典條目的值中。

0

不知道這是做它的最快的途徑,但這個工作對我來說:

data = [ 
    ['Andhra Pradesh-133', 'Meetai-1358', 'Meetai-2146', 'Meetai-2277'], 
    ['Andhra Pradesh-20','Rajasthan-60','Rajasthan-70'] 
] 

values = {} 
for row in data: 
    for x in row: 
    tokens = x.split('-') 
    values[tokens[0]] = int(tokens[1]) if tokens[0] not in values else values[tokens[0]] + int(tokens[1]) 
    out = [x + '-' + str(y) for x,y in values.iteritems()] 

print out # prints: ['Andhra Pradesh-153', 'Meetai-5781', 'Rajasthan-130'] 
0

在熊貓,你可以做

In [3475]: L = ['Andhra Pradesh-133', 'Meetai-1358', 'Meetai-2146', 'Meetai-2277'] 

In [3476]: s = (pd.DataFrame(x.split('-') for x in L) 
        .assign(v=lambda x: x[1].astype(int)) 
        .groupby(0)['v'].sum()) 

In [3478]: (s.index + '-' + s.values.astype(str)).tolist() 
Out[3478]: ['Andhra Pradesh-133', 'Meetai-5781'] 

詳細

In [3480]: pd.DataFrame(x.split('-') for x in L) 
Out[3480]: 
       0  1 
0 Andhra Pradesh 133 
1   Meetai 1358 
2   Meetai 2146 
3   Meetai 2277 

1是類型,我們assign類型荷蘭國際集團列vint

In [3481]: pd.DataFrame(x.split('-') for x in L).assign(v=lambda x: x[1].astype(int)) 
Out[3481]: 
       0  1  v 
0 Andhra Pradesh 133 133 
1   Meetai 1358 1358 
2   Meetai 2146 2146 
3   Meetai 2277 2277 

In [3479]: s 
Out[3479]: 
0 
Andhra Pradesh  133 
Meetai   5781 
Name: v, dtype: int32 
相關問題