2014-10-03 17 views
1

我正在解析某些來自我的控件外部源的文本,這不是非常方便的格式。我有這樣的線:從一段文本解析基於特定鍵的值

問題分類:人類的努力問題子範疇:空間探索問題類型:未能啓動軟件版本:9.8.77.omni.3問題詳細信息:與信號障礙室問題。

我想通過按鍵這樣的分割線:

Problem_Category = "Human Endeavors" 
Problem_Subcategory = "Space Exploration" 
Problem_Type = "Failure to Launch" 
Software_Version = "9.8.77.omni.3" 
Problem_Details = "Issue with signal barrier chamber." 

鑰匙將始終以相同的順序,並且總是跟着一個分號,但並不一定空間或值和下一個鍵之間的換行符。我不確定可以使用什麼作爲分隔符來解析它,因爲冒號和空格也可以出現在值中。我怎樣才能解析這段文字?

+1

最好的解決辦法,如果可能的話,是「追查其代碼創建文本的該塊的開發商,並要求他們將其輸出爲更多的東西可分析,例如作爲JSON「。那麼你根本不需要做任何正則表達式/分裂的欺騙手段! – Kevin 2014-10-03 14:29:36

+0

我就是那個開發者!我正在從一個龐大的Excel文件中讀取數據。此Excel文件來自數據庫,而數據庫又來自另一個數據庫。我應該發佈我寫的代碼嗎?我認爲這將是我試圖完成的一個分心,但也許(顯然?)我錯了。 – JesseBikman 2014-10-03 14:32:11

+0

此外,我無法以編程方式直接從數據庫訪問此數據以解決此問題。請讓我知道,如果我可以在這裏解決進一步的歧義。 – JesseBikman 2014-10-03 14:39:15

回答

4

如果你的文本塊是這樣的字符串:

text = 'Problem Category: Human Endeavors Problem Subcategory: Space ExplorationProblem Type: Failure to LaunchSoftware Version: 9.8.77.omni.3Problem Details: Issue with signal barrier chamber.' 

然後

import re 
names = ['Problem Category', 'Problem Subcategory', 'Problem Type', 'Software Version', 'Problem Details'] 

text = 'Problem Category: Human Endeavors Problem Subcategory: Space ExplorationProblem Type: Failure to LaunchSoftware Version: 9.8.77.omni.3Problem Details: Issue with signal barrier chamber.' 

pat = r'({}):'.format('|'.join(names)) 
data = dict(zip(*[iter(re.split(pat, text, re.MULTILINE)[1:])]*2)) 
print(data) 

產生的字典

{'Problem Category': ' Human Endeavors ', 
'Problem Details': ' Issue with signal barrier chamber.', 
'Problem Subcategory': ' Space Exploration', 
'Problem Type': ' Failure to Launch', 
'Software Version': ' 9.8.77.omni.3'} 

所以,你可以指定

text = df_dict['NOTE_DETAILS'][0] 
... 
df_dict['NOTE_DETAILS'][0] = data 

,然後你可以訪問與字典索引的子類別:

df_dict['NOTE_DETAILS'][0]['Problem_Category'] 

注意,雖然。深嵌套的字典/數據幀列表的字典通常是不好的設計。正如Zen of Python所說,Flat比嵌套更好。

+0

我看到(和欣賞!)你是什麼,有人建議,但我不能使用冒號作爲分隔符。例如,我需要使用「問題類別:」作爲第一個分隔符。我會更新我的問題,以便更具體。 – JesseBikman 2014-10-03 13:46:19

+1

如果類別名稱本身不包含冒號,那麼您可以使用'line.split(':',1)'分隔* first *冒號。我用一個例子更新了我的答案(特別注意'ProductID'行)。 – unutbu 2014-10-03 14:02:11

+0

類別名稱包含冒號,冒號是類別名稱的一部分。你對我的問題的編輯是誤導性的,因爲文本塊不包含換行符(我將它回滾)。 – JesseBikman 2014-10-03 14:06:57

3

鑑於您提前知道關鍵字,請將文本分爲「當前關鍵字」,「剩餘文本」,然後繼續使用下一個關鍵字對剩餘文本進行分區。

# get input from somewhere 
raw = 'Problem Category: Human Endeavors Problem Subcategory: Space ExplorationProblem Type: Failure to LaunchSoftware Version: 9.8.77.omni.3Problem Details: Issue with signal barrier chamber.' 

# these are the keys, in order, without the colon, that will be captured 
keys = ['Problem Category', 'Problem Subcategory', 'Problem Type', 'Software Version', 'Problem Details'] 
prev_key = None 
remaining = raw 
out = {} 

for key in keys: 
    # get the value from before the key and after the key 
    prev_value, _, remaining = remaining.partition(key + ':') 

    # start storing values after the first iteration, since we need to partition the second key to get the first value 
    if prev_key is not None: 
     out[prev_key] = prev_value.strip() 

    # what key to store next iteration 
    prev_key = key 

# capture the final value (since it lags behind the parse loop) 
out[prev_key] = remaining.strip() 

# out now contains the parsed values, print it out nicely 
for key in keys: 
    print('{}: {}'.format(key, out[key])) 

此打印:

Problem Category: Human Endeavors 
Problem Subcategory: Space Exploration 
Problem Type: Failure to Launch 
Software Version: 9.8.77.omni.3 
Problem Details: Issue with signal barrier chamber. 
3

我討厭和恐懼正則表達式,所以這裏只使用內置的方法解決。

#splits a string using multiple delimiters. 
def multi_split(s, delims): 
    strings = [s] 
    for delim in delims: 
     strings = [x for s in strings for x in s.split(delim) if x] 
    return strings 

s = "Problem Category: Human Endeavors Problem Subcategory: Space ExplorationProblem Type: Failure to LaunchSoftware Version: 9.8.77.omni.3Problem Details: Issue with signal barrier chamber." 
categories = ["Problem Category", "Problem Subcategory", "Problem Type", "Software Version", "Problem Details"] 
headers = [category + ": " for category in categories] 

details = multi_split(s, headers) 
print details 

details_dict = dict(zip(categories, details)) 
print details_dict 

結果(由我可讀性加換行符):

[ 
    'Human Endeavors ', 
    'Space Exploration', 
    'Failure to Launch', 
    '9.8.77.omni.3', 
    'Issue with signal barrier chamber.' 
] 

{ 
    'Problem Subcategory': 'Space Exploration', 
    'Problem Details': 'Issue with signal barrier chamber.', 
    'Problem Category': 'Human Endeavors ', 
    'Software Version': '9.8.77.omni.3', 
    'Problem Type': 'Failure to Launch' 
} 
2

這只是一般的BNF解析這很好地處理不確定性的工作。我使用perl和Marpa,一個普通的BNF解析器。希望這可以幫助。

use 5.010; 
use strict; 
use warnings; 

use Marpa::R2; 

my $g = Marpa::R2::Scanless::G->new({ source => \(<<'END_OF_SOURCE'), 

    :default ::= action => [ name, values ] 

    pairs ::= pair+ 

    pair ::= name (' ') value 

    name ::= 'Problem Category:' 
    name ::= 'Problem Subcategory:' 
    name ::= 'Problem Type:' 
    name ::= 'Software Version:' 
    name ::= 'Problem Details:' 

    value ::= [\s\S]+ 

    :discard ~ whitespace 
    whitespace ~ [\s]+ 

END_OF_SOURCE 
}); 

my $input = <<EOI; 
Problem Category: Human Endeavors Problem Subcategory: Space ExplorationProblem Type: Failure to LaunchSoftware Version: 9.8.77.omni.3Problem Details: Issue with signal barrier chamber. 
EOI 

my $ast = ${ $g->parse(\$input) }; 

my @pairs; 

ast_traverse($ast); 

for my $pair (@pairs){ 
    my ($name, $value) = @$pair; 
    say "$name = $value"; 
} 

sub ast_traverse{ 
    my $ast = shift; 
    if (ref $ast){ 
     my ($id, @children) = @$ast; 
     if ($id eq 'pair'){ 

      my ($name, $value) = @children; 

      chop $name->[1]; 

      shift @$value; 
      $value = join('', @$value); 
      chomp $value; 

      push @pairs, [ $name->[1], '"' . $value . '"' ]; 
     } 
     else { 
      ast_traverse($_) for @children; 
     } 
    } 
} 

此打印:

Problem Category = "Human Endeavors " 
Problem Subcategory = "Space Exploration" 
Problem Type = "Failure to Launch" 
Software Version = "9.8.77.omni.3" 
Problem Details = "Issue with signal barrier chamber."