2014-02-27 68 views
0

我一直在研究我的Python技能。 這是我正在處理的數據的原始文本文件:Titanic data在Python中使用CSV模塊需要幫助

每一行代表一個人在船上。該文件有幾列,包括該人是否存活(第三欄)。我試圖計算船上每個人口的人數(即多少名男性和多少名女性)以及每個羣體的倖存者人數。

我試圖在三個階段做到這一點: 首先,爲與人(先生,女士,小姐)相關的前綴添加一列。 然後,定義一個函數 - get_avg()來標識將找到信息的列以及該列的可能值,並將它們提供給grab_values函數。 第三,grab_values()計算每個組的人數和倖存者數量。

這一切都很好,很花哨......但它不起作用。 我一直得到0的計數和總和。試圖儘可能地堅持打印命令並取得了一些進展,但仍然無法理解我應該做什麼。我有一種感覺,就像函數沒有在所有行(或其中任何一行)上運行,但不知道這是否是真正的原因以及如何處理它。

任何人都可以請幫忙嗎?

import csv 

titanic = open('shorttitanic.txt', "rb") 
reader = csv.reader(titanic) 


prefix_list = ["Mr ", "Mrs", "Mis"]  # used to determine if passanger's name includes a prefix 


# There are several demographic details we can count passengers and survivors with, this is a dictionary to map them out along with col number. 
details = {"embarked":[5, "Southampton", "Cherbourg", "Queenstown", ""], 
      "sex":[10, "male","female"], "pclass":[1,"1st","2nd","3rd"], 
      "prefix":[12,"Mr ", "Mrs", "Mis"]}  # first item is col number (starts at 0), other items are the possible values 



# Adding another column for prefix: 
rownum = 0 
for row in reader: 
    # Finding the header: 
    if rownum == 0: 
     header = row 
     header.append("Prefix") 
#  print header 
    else: 
     prefix_location = row[3].find(",") + 2    # finds the position of the comma, the prefix starts after the comma and after a space (+2) 
     prefix = row[3][prefix_location:prefix_location+3] # grabs the 3 first characters of the prefix 
#  print len(prefix), prefix 
     if prefix in prefix_list:       # if there's a prefix in the passanger's name, it's appended to the row 
      if prefix == "Mis": 
       row.append("Miss")       # Mis is corrected to Miss on appending, since we must work with 3 chars 
      else: 
       row.append(prefix) 
     else: 
      row.append("Other/Unknown")      # for cases where there's no prefix in the passanger's name 


#  print len(row), rownum, row[3], prefix, row[11] 
# print row 

    rownum += 1 


# grab_values() will run on all rows and count the number of passengers in each demographic and the number of survivors 
def grab_values(col_num,i): 
    print col_num, "item name", i 
    count = 0 
    tot = 0 
    for row in reader: 
#  print type(row[col_num][0] 
     print row[col_num] 
     if row[col_num] == i: 
      count += 1 
      if row[2] == int(1): 
       tot += 1 
#  print count, tot 
    return count, tot 



# get_avg() finds the column number and possible values of demographic x. 

def get_avg(x):    # x is the category (sex, embarked...) 
    col_num = details[x][0] 
    for i in details[x][1:]: 
     print col_num, i 
#  print type(i) 


     grab_values(col_num,i) 

     count,tot = grab_values(col_num,i) 
     print count,tot 

#  print i, count, tot 



get_avg("sex") 



titanic.close() 

編輯:改變了前綴值在字典到: 「前綴」:[12, 「夫人」, 「誤」 「MR」]},其中有許多工作要做。

編輯2:這是完成的代碼,以防有人感興趣。我接受了warunsl關於問題性質的建議,但他的解決方案並不奏效,至少在我做出修改時,所以我不能選擇它作爲正確的解決方案,以防其他人會發現此線程並嘗試向其學習。非常感謝幫手!

import csv 

titanic = open('titanic.txt', "rb") 
reader = csv.reader(titanic) 


prefix_list = ["Mr ", "Mrs", "Mis"]  # used to determine if passanger's name includes a prefix. Using 3 chars because of Mr. 


# There are several demographic details we can count passengers and survivors with, this is a dictionary to map them out along with col number. 
details = {"embarked":[5, "Southampton", "Cherbourg", "Queenstown", ""], 
      "sex":[10, "male","female"], "pclass":[1,"1st","2nd","3rd"], 
      "prefix":[11,"Mr ", "Mrs", "Miss", "Unknown"]}  # first item is col number (starts at 0), other items are the possible values 

# try to see how the prefix values can be created by using 11 and a refernce to prefix_list 


# Here we'll do 2 things: 
# I - Add another column for prefix, and - 
# II - Create processed_list with each of the rows in reader, since we can only run over reader once, 
# and since I don't know much about handling CSVs or generator yet we'll run on the processed_list instead 

processed_list = [] 
rownum = 0 
for row in reader: 
    # Finding the header: 
    if rownum == 0: 
     header = row 
     header.append("Prefix") 
    else: 
     prefix_location = row[3].find(",") + 2    # finds the position of the comma, the prefix starts after the comma and after a space (+2) 
     prefix = row[3][prefix_location:prefix_location+3] # grabs the 3 first characters of the prefix 

     if prefix in prefix_list:       # if there's a prefix in the passanger's name, it's appended to the row 
      if prefix == "Mis": 
       row.append("Miss")       # Mis is corrected to Miss on appending, since we must work with 3 chars 
      else: 
       row.append(prefix) 
     else: 
      row.append("Unknown")       # for cases where there's no prefix in the passanger's name 

    processed_list.append(row) 

    rownum += 1 

# grab_values() will run on all rows and count the number of passengers in each demographic and the number of survivors 
def grab_values(col_num,i): 
# print col_num, "item name", i 
    num_on_board = 0 
    num_survived = 0 
    for row in processed_list: 
     if row[col_num] == i: 
      num_on_board += 1 
      if row[2] == "1": 
       num_survived += 1 
    return num_on_board, num_survived 



# get_avg() finds the column number and possible values of demographic x. 

def get_avg(x):    # x is the category (sex, embarked...) 
    col_num = details[x][0] 
    for i in details[x][1:]: 
     print "Looking for: ", i, "at col num: ", col_num 

     grab_values(col_num,i) 

     num_on_board,num_survived = grab_values(col_num,i) 

     try: 
      proportion_survived = float(num_survived)/num_on_board 
     except ZeroDivisionError: 
      proportion_survived = "Cannot be calculated" 


     print "Number of %s passengers on board: " %i , num_on_board, "\n" \ 
       "Number of %s passengers survived: " %i, num_survived, "\n" \ 
       "Proportion of %s passengers survived: " %i, "%.2f%%" % (proportion_survived * 100), "\n" 



print "Hello! I can calculate the proportion of passengers that survived according to these parameters: \n \ 
Embarked \n Sex \n Pclass \n Prefix", "\n" 

def get_choice(): 
    possible_choices = ["embarked","sex","pclass","prefix"] 
    choice = raw_input("Please enter your choice: ").lower() 
    if choice not in possible_choices: 
     print "Sorry, I can only work with Embarked/Sex/Pclass/Prefix. Please try again." 
     get_choice() 
    return choice 

user_choice = get_choice() 

get_avg(user_choice) 

titanic.close() 
+0

您排氣的全'reader'對象你曾經運行兩個函數之前,所以裏面'grab_values'環路什麼都不做。你似乎希望在你的第一個循環中將'row'改爲持久化,但實際上你只是在循環中改變一個局部變量,然後把它扔掉。您可能希望將每行附加到新列表中。 – geoffspear

+0

你用什麼前綴?你計算每個前綴的數量還是男性和女性的數量? – stmfunk

+0

@stmfunk我認爲一個很好的人口統計可以通過前綴查看生存比例。基本上這只是一個很好的練習 - 用一些邏輯添加一個從現有變量創建的變量。 – Optimesh

回答

1

如果你讀的文檔csv.reader你可以看到,調用返回它實現了迭代器協議的讀者對象。這意味着,csv.reader函數按照您的預期返回了一個生成器而不是一個列表。

發電機元件只能使用一次。爲了重用它,你將不得不重新初始化讀者對象。這個answer對Python工作中的生成器有一個全面的解釋。

因此,您可以在第一次閱讀時在其他列表中添加所有行並隨後使用此新列表,或者在再次使用該生成器之前重新初始化生成器。第二個選項是一個更好的辦法尤其是當你正在閱讀一個大文件作爲你:

在你grab_valuesfor row in reader:做到這一點之前:

titanic = open('titanic.txt', "rb") 
reader = csv.reader(titanic) 

和你的代碼工作。

編輯:由於您第一次讀取csv文件時正在修改每一行,因此您必須將修改的行添加到新列表中,並在您的grab_values方法中使用新列表。

# Adding another column for prefix: 
processed_list = [] # Declare a new array 
rownum = 0 
for row in reader: 
    if rownum == 0: 
     header = row 
     header.append("Prefix") 
    else: 
     prefix_location = row[3].find(",") + 2 
     prefix = row[3][prefix_location:prefix_location+3] 
     if prefix in prefix_list: 
      if prefix == "Mis": 
       processed_list.append("Miss") #Change this 
      else: 
       processed_list.append(prefix) #Change this 
     else: 
      processed_list.append("Other/Unknown") #Change this 

在你grab_values,改變for row in readerfor row in processed_list

+0

但他試圖在第一遍中更改值,因此重新打開該文件將無濟於事。 – geoffspear

+0

你說得對,我的回答更多的是重新訪問一個生成器對象。 Over看着他正在預處理行的事實。請更新答案 – shaktimaan

+0

謝謝。 它不適用於前綴。如何使用附加列調用函數的數據? – Optimesh