從一段文本解析基於特定鍵的值

我正在解析某些來自我的控件外部源的文本，這不是非常方便的格式。我有這樣的線：從一段文本解析基於特定鍵的值

問題分類：人類的努力問題子範疇：空間探索問題類型：未能啓動軟件版本：9.8.77.omni.3問題詳細信息：與信號障礙室問題。

我想通過按鍵這樣的分割線：

Problem_Category = "Human Endeavors" 
Problem_Subcategory = "Space Exploration" 
Problem_Type = "Failure to Launch" 
Software_Version = "9.8.77.omni.3" 
Problem_Details = "Issue with signal barrier chamber."

鑰匙將始終以相同的順序，並且總是跟着一個分號，但並不一定空間或值和下一個鍵之間的換行符。我不確定可以使用什麼作爲分隔符來解析它，因爲冒號和空格也可以出現在值中。我怎樣才能解析這段文字？

來源

2014-10-03 JesseBikman

最好的解決辦法，如果可能的話，是「追查其代碼創建文本的該塊的開發商，並要求他們將其輸出爲更多的東西可分析，例如作爲JSON「。那麼你根本不需要做任何正則表達式/分裂的欺騙手段！ – Kevin 2014-10-03 14:29:36

我就是那個開發者！我正在從一個龐大的Excel文件中讀取數據。此Excel文件來自數據庫，而數據庫又來自另一個數據庫。我應該發佈我寫的代碼嗎？我認爲這將是我試圖完成的一個分心，但也許（顯然？）我錯了。 – JesseBikman 2014-10-03 14:32:11

此外，我無法以編程方式直接從數據庫訪問此數據以解決此問題。請讓我知道，如果我可以在這裏解決進一步的歧義。 – JesseBikman 2014-10-03 14:39:15

如果你的文本塊是這樣的字符串：

text = 'Problem Category: Human Endeavors Problem Subcategory: Space ExplorationProblem Type: Failure to LaunchSoftware Version: 9.8.77.omni.3Problem Details: Issue with signal barrier chamber.'

然後

import re 
names = ['Problem Category', 'Problem Subcategory', 'Problem Type', 'Software Version', 'Problem Details'] 

text = 'Problem Category: Human Endeavors Problem Subcategory: Space ExplorationProblem Type: Failure to LaunchSoftware Version: 9.8.77.omni.3Problem Details: Issue with signal barrier chamber.' 

pat = r'({}):'.format('|'.join(names)) 
data = dict(zip(*[iter(re.split(pat, text, re.MULTILINE)[1:])]*2)) 
print(data)

產生的字典

{'Problem Category': ' Human Endeavors ', 
'Problem Details': ' Issue with signal barrier chamber.', 
'Problem Subcategory': ' Space Exploration', 
'Problem Type': ' Failure to Launch', 
'Software Version': ' 9.8.77.omni.3'}

所以，你可以指定

text = df_dict['NOTE_DETAILS'][0] 
... 
df_dict['NOTE_DETAILS'][0] = data

，然後你可以訪問與字典索引的子類別：

df_dict['NOTE_DETAILS'][0]['Problem_Category']

注意，雖然。深嵌套的字典/數據幀列表的字典通常是不好的設計。正如Zen of Python所說，Flat比嵌套更好。

來源

2014-10-03 13:43:18 unutbu

我看到（和欣賞！）你是什麼，有人建議，但我不能使用冒號作爲分隔符。例如，我需要使用「問題類別：」作爲第一個分隔符。我會更新我的問題，以便更具體。 – JesseBikman 2014-10-03 13:46:19

如果類別名稱本身不包含冒號，那麼您可以使用'line.split（'：'，1）'分隔* first *冒號。我用一個例子更新了我的答案（特別注意'ProductID'行）。 – unutbu 2014-10-03 14:02:11

類別名稱包含冒號，冒號是類別名稱的一部分。你對我的問題的編輯是誤導性的，因爲文本塊不包含換行符（我將它回滾）。 – JesseBikman 2014-10-03 14:06:57

鑑於您提前知道關鍵字，請將文本分爲「當前關鍵字」，「剩餘文本」，然後繼續使用下一個關鍵字對剩餘文本進行分區。

# get input from somewhere 
raw = 'Problem Category: Human Endeavors Problem Subcategory: Space ExplorationProblem Type: Failure to LaunchSoftware Version: 9.8.77.omni.3Problem Details: Issue with signal barrier chamber.' 

# these are the keys, in order, without the colon, that will be captured 
keys = ['Problem Category', 'Problem Subcategory', 'Problem Type', 'Software Version', 'Problem Details'] 
prev_key = None 
remaining = raw 
out = {} 

for key in keys: 
    # get the value from before the key and after the key 
    prev_value, _, remaining = remaining.partition(key + ':') 

    # start storing values after the first iteration, since we need to partition the second key to get the first value 
    if prev_key is not None: 
     out[prev_key] = prev_value.strip() 

    # what key to store next iteration 
    prev_key = key 

# capture the final value (since it lags behind the parse loop) 
out[prev_key] = remaining.strip() 

# out now contains the parsed values, print it out nicely 
for key in keys: 
    print('{}: {}'.format(key, out[key]))

此打印：

Problem Category: Human Endeavors 
Problem Subcategory: Space Exploration 
Problem Type: Failure to Launch 
Software Version: 9.8.77.omni.3 
Problem Details: Issue with signal barrier chamber.

來源

2014-10-03 14:22:02 davidism

我討厭和恐懼正則表達式，所以這裏只使用內置的方法解決。

#splits a string using multiple delimiters. 
def multi_split(s, delims): 
    strings = [s] 
    for delim in delims: 
     strings = [x for s in strings for x in s.split(delim) if x] 
    return strings 

s = "Problem Category: Human Endeavors Problem Subcategory: Space ExplorationProblem Type: Failure to LaunchSoftware Version: 9.8.77.omni.3Problem Details: Issue with signal barrier chamber." 
categories = ["Problem Category", "Problem Subcategory", "Problem Type", "Software Version", "Problem Details"] 
headers = [category + ": " for category in categories] 

details = multi_split(s, headers) 
print details 

details_dict = dict(zip(categories, details)) 
print details_dict

結果（由我可讀性加換行符）：

[ 
    'Human Endeavors ', 
    'Space Exploration', 
    'Failure to Launch', 
    '9.8.77.omni.3', 
    'Issue with signal barrier chamber.' 
] 

{ 
    'Problem Subcategory': 'Space Exploration', 
    'Problem Details': 'Issue with signal barrier chamber.', 
    'Problem Category': 'Human Endeavors ', 
    'Software Version': '9.8.77.omni.3', 
    'Problem Type': 'Failure to Launch' 
}

來源

2014-10-03 14:53:56 Kevin

這只是一般的BNF解析這很好地處理不確定性的工作。我使用perl和Marpa，一個普通的BNF解析器。希望這可以幫助。

use 5.010; 
use strict; 
use warnings; 

use Marpa::R2; 

my $g = Marpa::R2::Scanless::G->new({ source => \(<<'END_OF_SOURCE'), 

    :default ::= action => [ name, values ] 

    pairs ::= pair+ 

    pair ::= name (' ') value 

    name ::= 'Problem Category:' 
    name ::= 'Problem Subcategory:' 
    name ::= 'Problem Type:' 
    name ::= 'Software Version:' 
    name ::= 'Problem Details:' 

    value ::= [\s\S]+ 

    :discard ~ whitespace 
    whitespace ~ [\s]+ 

END_OF_SOURCE 
}); 

my $input = <<EOI; 
Problem Category: Human Endeavors Problem Subcategory: Space ExplorationProblem Type: Failure to LaunchSoftware Version: 9.8.77.omni.3Problem Details: Issue with signal barrier chamber. 
EOI 

my $ast = ${ $g->parse(\$input) }; 

my @pairs; 

ast_traverse($ast); 

for my $pair (@pairs){ 
    my ($name, $value) = @$pair; 
    say "$name = $value"; 
} 

sub ast_traverse{ 
    my $ast = shift; 
    if (ref $ast){ 
     my ($id, @children) = @$ast; 
     if ($id eq 'pair'){ 

      my ($name, $value) = @children; 

      chop $name->[1]; 

      shift @$value; 
      $value = join('', @$value); 
      chomp $value; 

      push @pairs, [ $name->[1], '"' . $value . '"' ]; 
     } 
     else { 
      ast_traverse($_) for @children; 
     } 
    } 
}

此打印：

Problem Category = "Human Endeavors " 
Problem Subcategory = "Space Exploration" 
Problem Type = "Failure to Launch" 
Software Version = "9.8.77.omni.3" 
Problem Details = "Issue with signal barrier chamber."

來源

2014-10-03 15:44:30 rns

從一段文本解析基於特定鍵的值

回答

相關問題