2013-04-12 61 views
0

我想解析this文件。這是一個示例片段。如何解析* .cat文件

'13138' => { 'REFERENCE' => '13138', 'NAME' => 'DRAPER Five 125mm Medium Grade Aluminium Oxide Sanding Discs', 'PRICE' => 108, 'MIN_QUANTITY_ORDERABLE' => 1, 'MAX_QUANTITY_ORDERABLE' => 0, 'OUT_OF_STOCK' => 0, 'DATE_PROMPT' => '', 'OTHER_INFO_PROMPT' => '', 'PRICING_MODEL' => 0, 'TAX_1' => '101=2000.00=0=', 'OPAQUE_SHIPPING_DATA' => '0.054', 'ALT_WEIGHT' => '', 'SHIP_SEPARATELY' => 0, 'SHIP_CATEGORY' => '', 'SHIP_SUPPLEMENT' => 0, 'SHIP_SUPPLEMENT_ONCE' => 0, 'HAND_SUPPLEMENT' => 0, 'HAND_SUPPLEMENT_ONCE' => 0, 'SHIP_QUANTITY' => 1, 'COST_PRICE' => 0, 'EXCLUDE_FROM_SHIP' => 0, 'ASSEMBLY_PRODUCT' => 0, 'STOCK_AISLE' => '', 'STOCK_RACK' => '', 'STOCK_SUB_RACK' => '', 'STOCK_BIN' => '', 'BARCODE' => '', 'REPORT_DESC' => '', 'PRICES' => { 
    1 => [ 
     [0,108], 
    ], 
}, 
    'CUSTOMVARS' => 
     { 
     }, 
'NO_ORDERLINE' => 0, 'AUTOSHIP' => 0, 'PRODUCT_GROUP' => -1, 'THUMBNAIL' => '', 'IMAGE' => '13138_694.jpg', 'ALSOBOUGHT' => [], 'RELATED' => [], }, 
    '13139' => { 'REFERENCE' => '13139', 'NAME' => 'DRAPER Five 125mm Coarse Grade Aluminium Oxide Sanding Discs', 'PRICE' => 96, 'MIN_QUANTITY_ORDERABLE' => 1, 'MAX_QUANTITY_ORDERABLE' => 0, 'OUT_OF_STOCK' => 0, 'DATE_PROMPT' => '', 'OTHER_INFO_PROMPT' => '', 'PRICING_MODEL' => 0, 'TAX_1' => '101=2000.00=0=', 'OPAQUE_SHIPPING_DATA' => '0.066', 'ALT_WEIGHT' => '', 'SHIP_SEPARATELY' => 0, 'SHIP_CATEGORY' => '', 'SHIP_SUPPLEMENT' => 0, 'SHIP_SUPPLEMENT_ONCE' => 0, 'HAND_SUPPLEMENT' => 0, 'HAND_SUPPLEMENT_ONCE' => 0, 'SHIP_QUANTITY' => 1, 'COST_PRICE' => 0, 'EXCLUDE_FROM_SHIP' => 0, 'ASSEMBLY_PRODUCT' => 0, 'STOCK_AISLE' => '', 'STOCK_RACK' => '', 'STOCK_SUB_RACK' => '', 'STOCK_BIN' => '', 'BARCODE' => '', 'REPORT_DESC' => '', 'PRICES' => { 
    1 => [ 
     [0,96], 
    ], 
}, 
    'CUSTOMVARS' => 
     { 
     }, 
'NO_ORDERLINE' => 0, 'AUTOSHIP' => 0, 'PRODUCT_GROUP' => -1, 'THUMBNAIL' => '', 'IMAGE' => '13139_694.jpg', 'ALSOBOUGHT' => [], 'RELATED' => [], }, 
    '13140' => { 'REFERENCE' => '13140', 'NAME' => 'DRAPER Five Extra Coarse Grade Aluminium Oxide Sanding Discs', 'PRICE' => 96, 'MIN_QUANTITY_ORDERABLE' => 1, 'MAX_QUANTITY_ORDERABLE' => 0, 'OUT_OF_STOCK' => 0, 'DATE_PROMPT' => '', 'OTHER_INFO_PROMPT' => '', 'PRICING_MODEL' => 0, 'TAX_1' => '101=2000.00=0=', 'OPAQUE_SHIPPING_DATA' => '0.055', 'ALT_WEIGHT' => '', 'SHIP_SEPARATELY' => 0, 'SHIP_CATEGORY' => '', 'SHIP_SUPPLEMENT' => 0, 'SHIP_SUPPLEMENT_ONCE' => 0, 'HAND_SUPPLEMENT' => 0, 'HAND_SUPPLEMENT_ONCE' => 0, 'SHIP_QUANTITY' => 1, 'COST_PRICE' => 0, 'EXCLUDE_FROM_SHIP' => 0, 'ASSEMBLY_PRODUCT' => 0, 'STOCK_AISLE' => '', 'STOCK_RACK' => '', 'STOCK_SUB_RACK' => '', 'STOCK_BIN' => '', 'BARCODE' => '', 'REPORT_DESC' => '', 'PRICES' => { 
    1 => [ 
     [0,96], 
    ], 
}, 
    'CUSTOMVARS' => 
     { 
     }, 
'NO_ORDERLINE' => 0, 'AUTOSHIP' => 0, 'PRODUCT_GROUP' => -1, 'THUMBNAIL' => '', 'IMAGE' => '13140_694ii.jpg', 'ALSOBOUGHT' => [], 'RELATED' => [], }, 

它包含3個項目。他們從字符串開始,如'13138' => { 'REFERENCE'。並在相同類型的字符串之前結束。我如何分割這些部分?我試過re.search(r"{ 'REFERENCE'.*?(?={ 'REFERENCE')", catstr)。但它不匹配。

+0

這些是紅寶石哈希? –

+0

@limelights不確定。它在由「Actinic」創建的網站上找到。他們在html中聲明'',並且'A000253.cat'是我想要解析的文件。 [樣品](http://pastie.org/7461356)。 –

+0

另外,也許你想糾正'[這] [1]'標記指向一個實際的文件。如果您對標記不確定,可以使用編輯器頂部的「鏈接」按鈕。 –

回答

3

你爲什麼不只是:替換=>

'CUSTOMVARS' : 
     { 
     }, 
'NO_ORDERLINE' : 0, 'AUTOSHIP' : 0, 'PRODUCT_GROUP' : -1, ... 

而且使用ast.literal_eval評估。它僅評估文字,而不是可執行代碼,所以消毒是沒有必要的(可能除了看守過大的輸入):

ast.literal_eval(node_or_string) 

安全評估的表達式節點或含有Python表達式的字符串。 提供的字符串或節點可能只包含以下Python文字結構中的 :字符串,數字,元組,列表, 字典,布爾值和無。

這可用於安全地評估包含來自 不受信任來源的Python表達式的字符串,而不需要自己解析 值。

編輯:工作示例

#!/usr/bin/env python2 
# -*- encoding: utf8 -*- 

import urllib2 
import ast 
import re 
from pprint import PrettyPrinter 

pp = PrettyPrinter() 
resp = urllib2.urlopen("http://pastie.org/pastes/7461356/download") 
content = resp.read() 
content = re.search(r"\s+=\s+({(?:.|\n)+});", content).group(1) 
# Fix following line to handle => inside strings, if needed 
content = re.sub(r"=>", r":", content) 
parsed = ast.literal_eval(content) 
pp.pprint(parsed) 

有關更換=>只能在外面字符串的信息,請參閱 這樣的回答:

編輯

給定文件包含散列本身以外的其他標記。上述 re.search正則表達式去掉多餘的令牌:

\s+=\s+  # This marks the = before the start of the hash 
({   # Capture the first { 
    (?:.|\n)+ # This matches all characters. 
      # The (?: is to prevent capture-inside-capture 
})   # Capture the last } 
;   # This is not captured 
+0

我甚至不會在這樣的情況下提及'eval';) – mata

+0

@mata感謝您的提醒,'literal_eval'在這種情況下顯然更勝一籌。刪除了提及'eval'。 –

+0

唯一不用擔心的是在某個名稱字符串中存在'=>',除非非常小心,否則這種方法會被此方法錯誤地更改。 –