2014-03-05 21 views
4

我需要解析如下的arff文件而不使用任何外部庫。我不知道如何將屬性與數值相關聯。就像我怎麼能說,每行的第一個數值是年齡,而第二個數字是性別?你能否也將我鏈接到一些用於解析類似場景的Python代碼?如何在不使用外部庫的情況下解析arff文件

@relation cleveland-14-heart-disease 
@attribute 'age' real 
@attribute 'sex' { female, male} 
@attribute 'cp' { typ_angina, asympt, non_anginal, atyp_angina} 
@attribute 'trestbps' real 
@attribute 'chol' real 
@attribute 'fbs' { t, f} 
@attribute 'restecg' { left_vent_hyper, normal, st_t_wave_abnormality} 
@attribute 'thalach' real 
@attribute 'exang' { no, yes} 
@attribute 'oldpeak' real 
@attribute 'slope' { up, flat, down} 
@attribute 'ca' real 
@attribute 'thal' { fixed_defect, normal, reversable_defect} 
@attribute 'class' { negative, positive} 
@data 
63,male,typ_angina,145,233,t,left_vent_hyper,150,no,2.3,down,0,fixed_defect,negative 
37,male,non_anginal,130,250,f,normal,187,no,3.5,down,0,normal,negative 
41,female,atyp_angina,130,204,f,left_vent_hyper,172,no,1.4,up,0,normal,negative 
56,male,atyp_angina,120,236,f,normal,178,no,0.8,up,0,normal,negative 
57,female,asympt,120,354,f,normal,163,yes,0.6,up,0,normal,negative 
57,male,asympt,140,192,f,normal,148,no,0.4,flat,0,fixed_defect,negative 
56,female,atyp_angina,140,294,f,left_vent_hyper,153,no,1.3,flat,0,normal,negative 
44,male,atyp_angina,120,263,f,normal,173,no,0,up,0,reversable_defect,negative 
52,male,non_anginal,172,199,t,normal,162,no,0.5,up,0,reversable_defect,negative 

這裏是我寫的一個示例代碼:

arr=[] 
arff_file = open("heart_train.arff") 
count=0 
for line in arff_file: 
     count+=1 
     #line=line.strip("\n") 
     #line=line.split(',') 
     if not (line.startswith("@")): 
       if not (line.startswith("%")): 
         line=line.strip("\n") 
         line=line.split(',') 
         arr.append(line) 



print(arr[1:30]) 

但是輸出是非常不同的,比我期望它是:

[['37', 'male', 'non_anginal', '130', '250', 'f', 'normal', '187', 'no', '3.5', 'down', '0', 'normal', 'negative'], ['41', 'female', 'atyp_angina', '130', '204', 'f', 'left_vent_hyper', '172', 'no', '1.4', 'up', '0', 'normal', 'negative'], ['56', 'male', 'atyp_angina', '120', '236', 'f', 'normal', '178', 'no', '0.8', 'up', '0', 'normal', 'negative'], ['57', 'female', 'asympt', '120', '354', 'f', 'normal', '163', 'yes', '0.6', 'up', '0', 'normal', 'negative'], ['57', 'male', 'asympt', '140', '192', 'f', 'normal', '148', 'no', '0.4', 'flat', '0', 'fixed_defect', 'negative'], ['56', 'female', 'atyp_angina', '140', '294', 'f', 'left_vent_hyper', '153', 'no', '1.3', 'flat', '0', 'normal', 'negative'], ['44', 'male', 'atyp_angina', '120', '263', 'f', 'normal', '173', 'no', '0', 'up', '0', 'reversable_defect', 'negative'], ['52', 'male', 'non_anginal', '172', '199', 't', 'normal', '162', 'no', '0.5', 'up', '0', 'reversable_defect', 'negative'], ['57', 'male', 'non_anginal', '150', '168', 'f', 'normal', '174', 'no', '1.6', 'up', '0', 'normal', 'negative'], ['54', 'male', 'asympt', '140', '239', 'f', 'normal', '160', 'no', '1.2', 'up', '0', 'normal', 'negative'], ['48', 'female', 'non_anginal', '130', '275', 'f', 'normal', '139', 'no', '0.2', 'up', '0', 'normal', 'negative'], ['49', 'male', 'atyp_angina', '130', '266', 'f', 'normal', '171', 'no', '0.6', 'up', '0', 'normal', 'negative'], ['64', 'male', 'typ_angina', '110', '211', 'f', 'left_vent_hyper', '144', 'yes', '1.8', 'flat', '0', 'normal', 'negative'], ['58', 'female', 'typ_angina', '150', '283', 't', 'left_vent_hyper', '162', 'no', '1', 'up', '0', 'normal', 'negative'], ['50', 'female', 'non_anginal', '120', '219', 'f', 'normal', '158', 'no', '1.6', 'flat', '0', 'normal', 'negative'], ['58', 'female', 'non_anginal', '120', '340', 'f', 'normal', '172', 'no', '0', 'up', '0', 'normal', 'negative'], ['66', 'female', 'typ_angina', '150', '226', 'f', 'normal', '114', 'no', '2.6', 'down', '0', 'normal', 'negative'], ['43', 'male', 'asympt', '150', '247', 'f', 'normal', '171', 'no', '1.5', 'up', '0', 'normal', 'negative'], ['69', 'female', 'typ_angina', '140', '239', 'f', 'normal', '151', 'no', '1.8', 'up', '2', 'normal', 'negative'], ['59', 'male', 'asympt', '135', '234', 'f', 'normal', '161', 'no', '0.5', 'flat', '0', 'reversable_defect', 'negative'], ['44', 'male', 'non_anginal', '130', '233', 'f', 'normal', '179', 'yes', '0.4', 'up', '0', 'normal', 'negative'], ['42', 'male', 'asympt', '140', '226', 'f', 'normal', '178', 'no', '0', 'up', '0', 'normal', 'negative'], ['61', 'male', 'non_anginal', '150', '243', 't', 'normal', '137', 'yes', '1', 'flat', '0', 'normal', 'negative'], ['40', 'male', 'typ_angina', '140', '199', 'f', 'normal', '178', 'yes', '1.4', 'up', '0', 'reversable_defect', 'negative'], ['71', 'female', 'atyp_angina', '160', '302', 'f', 'normal', '162', 'no', '0.4', 'up', '2', 'normal', 'negative'], ['59', 'male', 'non_anginal', '150', '212', 't', 'normal', '157', 'no', '1.6', 'up', '0', 'normal', 'negative'], ['51', 'male', 'non_anginal', '110', '175', 'f', 'normal', '123', 'no', '0.6', 'up', '0', 'normal', 'negative'], ['65', 'female', 'non_anginal', '140', '417', 't', 'left_vent_hyper', '157', 'no', '0.8', 'up', '1', 'normal', 'negative'], ['53', 'male', 'non_anginal', '130', '197', 't', 'left_vent_hyper', '152', 'no', '1.2', 'down', '0', 'normal', 'negative']] 

你知不知道我怎樣才能輸出像由arff庫(來自Weka)創建的以下內容? enter image description here

+0

這看起來非常直接的解析。你有什麼嘗試?當您發佈一些代碼時,堆棧溢出效果會更好。 – kindall

+1

請分享您迄今爲止所嘗試的內容?我們將能夠更好地以這種方式 – Gogo

+0

@kindall更新 –

回答

3

你說「沒有外部庫」,但你至少可以剪切並粘貼到你自己的代碼中嗎?您可能會發現the source code to the arff module有用(200行,大約5.6 KB)。

編輯:

你會發現這個格式參考價值:http://weka.wikispaces.com/ARFF+%28stable+version%29

EDIT2:

只是爲了好玩,我寫我自己.arrf解析器;它幾乎與WEKA代碼一樣長,但應該更易讀 - 只有六個函數,一個調度表和一個非常模塊化的類。您可以遍歷一個類實例來獲取每個數據行作爲namedtuple。

見你在想什麼:

from collections import namedtuple 
from keyword import iskeyword 
import re 

def NotDone(msg): 
    raise NotImplemented(msg) 

def nominal(spec): 
    """ 
    Create an ARFF nominal (enumerated) data type 
    """ 
    spec = spec.lstrip("{ \t").rstrip("} \t") 
    good_values = set(val.strip() for val in spec.split(",")) 

    def fn(s): 
     s = s.strip() 
     if s in good_values: 
      return s 
     else: 
      raise ValueError("'{}' is not a recognized value".format(s)) 

    # patch docstring 
    fn.__name__ = "nominal" 
    fn.__doc__ = """ 
    ARFF nominal (enumerated) data type 

    Legal values are {} 
    """.format(sorted(good_values)) 
    return fn 

def numeric(s): 
    """ 
    Convert string to int or float 
    """ 
    try: 
     return int(s) 
    except ValueError: 
     return float(s) 

field_maker = { 
    "date":  (lambda spec: NotDone("date data type not implemented")), 
    "integer": (lambda spec: int), 
    "nominal": (lambda spec: nominal(spec)), 
    "numeric": (lambda spec: numeric), 
    "string":  (lambda spec: str), 
    "real":  (lambda spec: float), 
    "relational": (lambda spec: NotDone("relational data type not implemented")), 
} 

def file_lines(fname): 
    # lazy file reader; ensures file is closed when done, 
    # returns lines without trailing spaces or newline 
    with open(fname) as inf: 
     for line in inf: 
      yield line.rstrip() 

def no_data_yet(*items): 
    raise ValueError("AarfRow not fully defined (haven't seen a @data directive yet)") 

def make_field_name(s): 
    """ 
    Mangle string to make it a valid Python identifier 
    """ 
    s = s.lower()        # force to lowercase 
    s = "_".join(re.findall("[a-z0-9]+", s)) # strip all invalid chars; join what's left with "_" 
    if iskeyword(s) or re.match("[0-9]", s): # if the result is a keyword or starts with a digit 
     s = "f_"+s        # make it a safe field name 
    return s 

class ArffReader: 
    line_types = ["blank", "comment", "relation", "attribute", "data"] 

    def __init__(self, fname): 
     # get input file 
     self.fname = fname 
     self.lines = file_lines(fname) 

     # prepare to read file header 
     self.relation = '(not specified)' 
     self.data_names = [] 
     self.data_types = [] 
     self.dtype = no_data_yet 

     # read file header 
     line_tests = [ 
      (getattr(self, "line_is_{}".format(item)), getattr(self, "line_do_{}".format(item))) 
      for item in self.__class__.line_types 
     ] 
     for line in self.lines: 
      for is_, do in line_tests: 
       if is_(line): 
        done = do(line) 
        break 
      if done: 
       break 

     # use header fields to build data type (and make it print as requested) 
     class ArffRow(namedtuple('ArffRow', self.data_names)): 
      __slots__ =() 
      def __str__(self): 
       items = (getattr(self, field) for field in self._fields) 
       return "({})".format(", ".join(repr(it) for it in items)) 
     self.dtype = ArffRow 

    # 
    # figure out input-line type 
    # 

    def line_is_blank(self, line): 
     return not line 

    def line_is_comment(self, line): 
     return line.lower().startswith('%') 

    def line_is_relation(self, line): 
     return line.lower().startswith('@relation') 

    def line_is_attribute(self, line): 
     return line.lower().startswith('@attribute') 

    def line_is_data(self, line): 
     return line.lower().startswith('@data') 

    #  
    # handle input-line type 
    #  

    def line_do_blank(self, line): 
     pass 

    def line_do_comment(self, line): 
     pass 

    def line_do_relation(self, line): 
     self.relation = line[10:].strip() 

    def line_do_attribute(self, line): 
     m = re.match(
      "^@attribute"   # line starts with '@attribute' 
      "\s+"     # 
      "("      # name is one of: 
       "(?:'[^']+')"  # ' string in single-quotes ' 
       "|(?:\"[^\"]+\")" # " string in double-quotes " 
       "|(?:[^ \t'\"]+)" # single_word_string (no spaces) 
      ")"      # 
      "\s+"     # 
      "("      # type is one of: 
       "(?:{[^}]+})"  # { set, of, nominal, values } 
       "|(?:\w+)"   # datatype 
      ")"      # 
      "\s*"     # 
      "("      # spec string 
       ".*"    # anything to end of line 
      ")$",     # 
      line, flags=re.I)  # case-insensitive 
     if m: 
      name, type_, spec = m.groups() 
      self.data_names.append(make_field_name(name)) 
      if type_[0] == '{': 
       type_, spec = 'nominal', type_ 
      self.data_types.append(field_maker[type_](spec)) 
     else: 
      raise ValueError("failed parsing attribute line '{}'".format(line)) 

    def line_do_data(self, line): 
     return True # flag end of header 

    # 
    # make the class iterable 
    # 

    def __iter__(self): 
     return self 

    def next(self): 
     """ 
     Return one data row at a time 
     """ 
     data = next(self.lines).split(',') 
     return self.dtype(*(fn(dat) for fn,dat in zip(self.data_types, data))) 

,它可以作爲

for row in ArffReader('mydata.arff'): 
    print(row) 

導致

(63.0, 'male', 'typ_angina', 145.0, 233.0, 't', 'left_vent_hyper', 150.0, 'no', 2.3, 'down', 0.0, 'fixed_defect', 'negative') 
(37.0, 'male', 'non_anginal', 130.0, 250.0, 'f', 'normal', 187.0, 'no', 3.5, 'down', 0.0, 'normal', 'negative') 
(41.0, 'female', 'atyp_angina', 130.0, 204.0, 'f', 'left_vent_hyper', 172.0, 'no', 1.4, 'up', 0.0, 'normal', 'negative') 
(56.0, 'male', 'atyp_angina', 120.0, 236.0, 'f', 'normal', 178.0, 'no', 0.8, 'up', 0.0, 'normal', 'negative') 
(57.0, 'female', 'asympt', 120.0, 354.0, 'f', 'normal', 163.0, 'yes', 0.6, 'up', 0.0, 'normal', 'negative') 
(57.0, 'male', 'asympt', 140.0, 192.0, 'f', 'normal', 148.0, 'no', 0.4, 'flat', 0.0, 'fixed_defect', 'negative') 
(56.0, 'female', 'atyp_angina', 140.0, 294.0, 'f', 'left_vent_hyper', 153.0, 'no', 1.3, 'flat', 0.0, 'normal', 'negative') 
(44.0, 'male', 'atyp_angina', 120.0, 263.0, 'f', 'normal', 173.0, 'no', 0.0, 'up', 0.0, 'reversable_defect', 'negative') 
(52.0, 'male', 'non_anginal', 172.0, 199.0, 't', 'normal', 162.0, 'no', 0.5, 'up', 0.0, 'reversable_defect', 'negative') 

的領域也由名尋址,即

for patient in ArffReader('mydata.arff'): 
    print("{} year old {}".format(patient.age, patient.sex)) 

這給

63.0 year old male 
37.0 year old male 
41.0 year old female 
56.0 year old male 
57.0 year old female 
57.0 year old male 
56.0 year old female 
44.0 year old male 
52.0 year old male 

,您可以通過

>>> print(repr(patient)) 
ArffRow(age=63.0, sex='male', cp='typ_angina', trestbps=145.0, chol=233.0, fbs='t', restecg='left_vent_hyper', thalach=150.0, exang='no', oldpeak=2.3, slope='down', ca=0.0, thal='fixed_defect', f_class='negative') 

字段名稱看文件名是按照該ARFF頭部,強迫小寫(以及在「類」的情況下,除了前面有' f_',因爲class是Python關鍵字,因此不能用作字段名稱)。

+0

我知道我可以做到這一點,但我打算刷新我的知識。我認爲這是一件很好的事情。也感謝您的建議。 –

相關問題