這應該是有幫助的
import re
# Variation in places of the numbers in strings:
str1 = "I am a smoker male of 25 years who wants a policy for 30 yrs with a sum assured amount of 1000000 rupees"
str2 = "I am a smoker male of 25 years who wants a for 30 policy yrs with a sum assured amount of 1000000 rupees"
str3 = "I am a smoker male of years 25 who wants a for 30 policy yrs with a sum assured amount of 1000000 rupees"
str4 = "I am a smoker male of 25 years who wants a for 30 policy yrs with a sum assured amount of 1000000 rupees"
regex = r".*?(((\d{2})\s?years)|(years\s?(\d{2}))).*(policy.*?(\d{2})|(\d{2}).*?policy).*(\d{7}).*$"
replacements = r"\9 \7 \8 \3 \5"
res_str1 = re.sub(regex, replacements, str1)
res_str2 = re.sub(regex, replacements, str2)
res_str3 = re.sub(regex, replacements, str3)
res_str4 = re.sub(regex, replacements, str4)
def clean_spaces(string):
return re.sub(r"\s{1,2}", ' ', string)
print(clean_spaces(res_str1))
print(clean_spaces(res_str2))
print(clean_spaces(res_str3))
print(clean_spaces(res_str4))
輸出:
1000000 30 25
1000000 30 25
1000000 30 25
1000000 30 25
更新
以上正則表達式有一些錯誤。當我試圖改善它,我注意到,這是低效的和醜陋的,因爲它每一次解析每一個單個字符。如果我們堅持原來的解析單詞的方法,我們可以做得更好。所以我的新的解決方案是:
# Algorithm
# for each_word in the_list:
# maintain a pre_list of terms that come before a number
# if each_word is number:
# if there is any element of desired_terms_list exists in pre_list:
# pair the number & the desired_term and insert into the_dictionary
# remove this desired_term from desired_terms_list
# reset the pre_list
# else:
# put the number in number_at_hand
# else:
# if no number_at_hand:
# add the current word into pre_list
# else:
# if the current_word an element of desired_terms_list:
# pair the number & the desired_term and insert into the_dictionary
# remove this desired_term from desired_terms_list
# reset number_at_hand
代碼:
from pprint import pprint
class Extractor:
def __init__(self, search_terms, string):
self.pre_list = list()
self.list = string.split()
self.terms_to_look_for = search_terms
self.dictionary = {}
@staticmethod
def is_number(string):
try:
int(string)
return True
except ValueError:
return False
def check_pre_list(self):
for term in self.terms_to_look_for:
if term in self.pre_list:
return term
else:
return None
def extract(self):
number_at_hand = str()
for word in self.list:
if Extractor.is_number(word):
check_result = self.check_pre_list()
if check_result is not None:
self.dictionary[check_result] = word
self.terms_to_look_for.remove(check_result)
self.pre_list = list()
else:
number_at_hand = word
else:
if number_at_hand == '':
self.pre_list.append(word)
else:
if word in self.terms_to_look_for:
self.dictionary[word] = number_at_hand
self.terms_to_look_for.remove(word)
number_at_hand = str()
return self.dictionary
用法:
ex1 = Extractor(['years', 'policy', 'amount'],
'I am a smoker male of 25 years who wants a policy for 30 yrs with a sum assured amount of 1000000 rupees')
ex2 = Extractor(['years', 'policy', 'amount'],
'I am a smoker male of 25 years who wants a for 30 yrs policy with a sum assured amount of 1000000 rupees')
ex3 = Extractor(['years', 'policy', 'amount'],
'I am a smoker male of years 25 who wants a policy for 30 yrs with a sum assured amount of 1000000 rupees')
ex4 = Extractor(['years', 'policy', 'amount'],
'I am a smoker male of years 25 who wants a for 30 yrs policy with a sum assured amount of 1000000 rupees')
pprint(ex1.extract())
pprint(ex2.extract())
pprint(ex3.extract())
pprint(ex4.extract())
輸出:
{'amount': '1000000', 'policy': '30', 'years': '25'}
{'amount': '1000000', 'policy': '30', 'years': '25'}
{'amount': '1000000', 'policy': '30', 'years': '25'}
{'amount': '1000000', 'policy': '30', 'years': '25'}
我期待更好的表現了。
請寫下你的代碼。 – CunivL
你有這個問題的嘗試嗎? – citizen2077
爲了過濾字符串,只獲得整數值,您可以使用該行的代碼:'integer_values = [E爲電子在fmt_string.split()如果isinstance(即INT)]' – CunivL