2017-04-19 72 views
0

我有兩種類型的文件,excel和csv,我正在使用它讀取帶有兩個永久列的數據:問題,答案和兩個臨時列,可能存在或不存在Word和Replacement。如何根據數據可用性從excel或csv文件中讀取數據?

我已經做了不同的功能,從csv和excel文件中讀取數據,這將根據文件的擴展名來調用。

是否有一種方法可以根據它們何時存在以及何時不存在,從臨時列(Word和Replacement)中讀取數據。請參考下面的函數定義:

1)CSV文件:

def read_csv_file(path): 
    quesData = [] 
    ansData = [] 
    asciiIgnoreQues = [] 
    qWithoutPunctuation = [] 
    colnames = ['Question','Answer'] 
    data = pandas.read_csv(path, names = colnames) 
    quesData = data.Question.tolist() 
    ansData = data.Answer.tolist() 
    qWithoutPunctuation = quesData 

    qWithoutPunctuation = [''.join(c for c in s if c not in string.punctuation) for s in qWithoutPunctuation] 

    for x in qWithoutPunctuation: 
     asciiIgnoreQues.append(x.encode('ascii','ignore')) 

    return asciiIgnoreQues, ansData, quesData 

2)功能來讀取Excel數據:

def read_excel_file(path): 
    book = open_workbook(path) 
    sheet = book.sheet_by_index(0) 
    quesData = [] 
    ansData = [] 
    asciiIgnoreQues = [] 
    qWithoutPunctuation = [] 

    for row in range(1, sheet.nrows): 
     quesData.append(sheet.cell(row,0).value) 
     ansData.append(sheet.cell(row,1).value) 

    qWithoutPunctuation = quesData 
    qWithoutPunctuation = [''.join(c for c in s if c not in string.punctuation) for s in qWithoutPunctuation] 

    for x in qWithoutPunctuation: 
     asciiIgnoreQues.append(x.encode('ascii','ignore')) 

    return asciiIgnoreQues, ansData, quesData 
+0

你認爲'pandas.read_csv'和'pandas.read_excel'嗎?他們將根據列出現的情況自動讀取。 – tmrlvi

+0

@tmrlvi,我在讀取csv函數時使用了pandas.read_csv,但列標題必須在colnames中提供。但是如果我沒有單詞和替換曲面怎麼辦? –

+0

你不必提供它們。如果你不這樣做,'pandas'推斷出這些名字。還是你的數據不包含標題? – tmrlvi

回答

0

我不完全相信你試圖達到什麼,但是讀取和轉換數據的方式如下:

def read_file(path, typ): 
    if typ == "excel": 
     df = pd.read_excel(path, sheetname=0) # Default is zero 
    else: # Assuming "csv". You can make it explicit 
     df = pd.read_csv(path) 

    qWithoutPunctuation = df["Question"].apply(lambda s: ''.join(c for c in s if c not in string.punctuation)) 
    df["asciiIgnoreQues"] = qWithoutPunctuation.apply(lambda x: x.encode('ascii','ignore')) 

    return df 

# Call it like this: 
read_data("file1.csv","csv") 
read_data("file2.xls","excel") 
read_data("file2.xlsx","excel") 

如果數據不包括WordReplacement["Question", "Word", "Replacemen", "Answer", "asciiIgnoreQues"](如果包含),則這將返回DataFrame["Question","Answer", "asciiIgnoreQues"]列。

請注意,我已經使用了apply,它使您能夠在所有系列上按元素運行函數。