2016-07-30 82 views
0

我想知道:
查找特定字符串的一列中,並找到對應於串最大

1)我怎麼在考慮到弦一柱
2)找到一個特定的字符串,怎麼會我覺得它的最大相應
3)如何計算串的數量爲每行中的列

我有sports.csv稱爲csv文件

import pandas as pd 
import numpy as np 

#loading the data into data frame 
X = pd.read_csv('sports.csv') 

感興趣的兩列是TotalsGym柱:

Total Gym 
40 Football|Baseball|Hockey|Running|Basketball|Swimming|Cycling|Volleyball|Tennis|Ballet 
37 Baseball|Tennis 
61 Basketball|Baseball|Ballet 
12 Swimming|Ballet|Cycling|Basketball|Volleyball|Hockey|Running|Tennis|Baseball|Football 
78 Swimming|Basketball 
29 Baseball|Tennis|Ballet|Cycling|Basketball|Football|Volleyball|Swimming 
31 Tennis 
54 Tennis|Football|Ballet|Cycling|Running|Swimming|Baseball|Basketball|Volleyball 
33 Baseball|Hockey|Swimming|Cycling 
17 Football|Hockey|Volleyball 

注意,Gym列有各對應sport.I'm試圖找到一種方法,找到所有已棒球體育館的多串並找到最大的總數。不過,我只在有至少兩個其他體育場館興趣即我不想考慮:

Total Gym 
    37 Baseball|Tennis 
+0

是實際上你怎麼看文件? –

+0

是的,該列的每項運動都由「|」分隔符號 – M3105

+0

我在管道字符後看到一些空格,即| | Swimming | Cycling |',這是否也在您的文件或拼寫錯誤中? –

回答

1

您可以輕鬆地做到這一點使用pandas

首先,拆分串入的標籤分隔符,然後遍歷列表,並與長度大於2選擇的人的列表,你會想棒球一起其他兩項運動爲標準。

In [4]: df['Gym'] = df['Gym'].str.split('|').apply(lambda x: ' '.join([i for i in x if len(x)>2])) 

In [5]: df 
Out[5]: 
    Total            Gym 
0  40 Football Baseball Hockey Running Basketball Sw... 
1  37             
2  61       Basketball Baseball Ballet 
3  12 Swimming Ballet Cycling Basketball Volleyball ... 
4  78             
5  29 Baseball Tennis Ballet Cycling Basketball Foot... 
6  31             
7  54 Tennis Football Ballet Cycling Running Swimmin... 
8  33     Baseball Hockey Swimming Cycling 
9  17       Football Hockey Volleyball 

使用str.contains搜索列Gym字符串Baseball

In [6]: df = df.loc[df['Gym'].str.contains('Baseball')] 

In [7]: df 
Out[7]: 
    Total            Gym 
0  40 Football Baseball Hockey Running Basketball Sw... 
2  61       Basketball Baseball Ballet 
3  12 Swimming Ballet Cycling Basketball Volleyball ... 
5  29 Baseball Tennis Ballet Cycling Basketball Foot... 
7  54 Tennis Football Ballet Cycling Running Swimmin... 
8  33     Baseball Hockey Swimming Cycling 

計算各自的字符串數。

In [8]: df['Count'] = df['Gym'].str.split().apply(lambda x: len([i for i in x])) 

通過選擇對應於最大值在Totals列數據幀的子集中緊跟。

In [9]: df.loc[df['Total'].idxmax()] 
Out[9]: 
Total       61 
Gym  Basketball Baseball Ballet 
Count        3 
Name: 2, dtype: object 
0

你可以做一個合格的,你讀文件:

import csv 
with open("sport.csv") as f: 
    mx, best = float("-inf"), None 
    for row in csv.reader(f, delimiter=" ", skipinitialspace=1): 
     row[1:] = row[1].split("|") 
     if "Baseball" in row and len(row[1:]) > 2 and int(row[0]) > mx: 
      mx = int(row[0]) 
      best = row 
    if best: 
     print(best, mx, len(row[1:])) 

這將使你:

(['61', 'Basketball', 'Baseball', 'Ballet'], 61, 3) 

而不破另一種方法是計算管道字符:

import csv 
with open("sports.csv") as f: 
    mx, best = float("-inf"),None 
    for row in csv.reader(f, delimiter=" ", skipinitialspace=1): 
     print(row[1]) 
     if "Baseball" in row[1] and row[1].count("|") > 1 and int(row[0]) > mx: 
      mx = int(row[0]) 
      best = row 
    if best: 
     print(best, mx, row[1].count("|")) 

這意味着雖然子字符串可能會被匹配而不是確切的單詞。

+0

首先,感謝您的回覆,我非常感謝!我試着運行兩個,它給了我以下錯誤:IndexError:列表索引超出範圍 – M3105

+0

你有空行嗎? –

+0

不,我沒有任何空行 – M3105

0

試試這個:

df3.loc[(df3['Gym'].str.contains('Hockey') == True) & (df3["Gym"].str.count("\|")>1)].sort_values("Total").tail(1) 

Total            Gym 
0  40 Football|Baseball|Hockey|Running|Basketball|Sw... 


df3.loc[(df3['Gym'].str.contains('Baseball') == True) & (df3["Gym"].str.count("\|")>1)].sort_values("Total").tail(1) 

    Total       Gym 
2  61 Basketball|Baseball|Ballet