您好我有一個腳本能夠刪除子標題和段落,但我無法刪除非英文子標題和文字的段落。刪除非英文小標題和段落
例如,(原文):
=== Personal finance ===
Protection against unforeseen personal events, as well as events in the wider economies
Transference of family wealth across generations (bequests and inheritance)
=== Corporate finance ===
Corporate finance deals with the sources of funding and the capital structure of corporations and the actions that managers take to increase the value of the firm to the shareholders.
== External links ==
Business acronyms and abbreviations
Business acronyms
== Kūrybinės Industrijos ==
Kūrybinės industrijos apima sritį ekonominių veiksnių, susitelkusių ties žinių ir informacijos generavimu arba tyrimu.
的(結果)我從我的代碼得到的是:
Protection against unforeseen personal events, as well as events in the wider economies
Transference of family wealth across generations (bequests and inheritance)
Corporate finance deals with the sources of funding and the capital structure of corporations and the actions that managers take to increase the value of the firm to the shareholders.
Kūrybinės industrijos apima sritį ekonominių veiksnių, susitelkusių ties žinių ir informacijos generavimu arba tyrimu.
這是我希望達到(期望的結果):
Protection against unforeseen personal events, as well as events in the wider economies
Transference of family wealth across generations (bequests and inheritance)
Corporate finance deals with the sources of funding and the capital structure of corporations and the actions that managers take to increase the value of the firm to the shareholders.
的腳本如下:
import re
from subprocess import call
f1 = open('asd.text', 'r') # read file that contains the orginal text
f2 = open('NoRef.text', 'w') # write to new file
section_title_re = re.compile("^=+\s+.*\s+=+$")
content = []
skip = False
for l in f1.read().splitlines():
line = l.strip()
if "== external links ==" in line.lower():
skip = True
continue
if section_title_re.match(line):
skip = False
continue
if skip:
continue
content.append(line)
content = '\n'.join(content) + '\n'
f2.write(content+"\n")
f2.close()
問題: 到目前爲止,我的代碼是能夠用已知的名字,如「外部鏈接」的副標題刪除的段落。
但是,我刪除那些非英文的子標題和段落嗎?
謝謝。
你試過Google檢測語言的圖書館嗎?粗略的搜索提出了這個:https://pypi.python.org/pypi/langdetect? –
如果您事先知道所有可能遇到的(英文)標題,只需檢查標題是否在您的列表中(實際上最好使用'set'),如果不是,則跳過整個段落。 – Julien
嗨Julien我不知道所有可能的英文標題,因此存在我的問題在哪裏。 – windboy