自動插入LTR標記

我正在爲項目使用雙向文本（混合英語和希伯來語）。文本以HTML格式顯示，因此有時需要使用LTR或RTL標記（&lrm;或&rlm;）才能使「弱字符」正確顯示爲標點符號。由於技術限制，這些標記在源文本中不存在，所以我們需要添加它們以使最終顯示的文本顯示正確。自動插入LTR標記

例如，以下文本：(example: מדגם) sample呈現爲從右到左模式的sample (מדגם :example)。正確的字符串看起來像&lrm;(example:&lrm; מדגם) sample，並將呈現爲sample (מדגם (example:。

我們希望快速插入這些標記，而不是重新編寫所有文本。起初，這似乎很簡單：只需將&lrm;附加到標點符號的每個實例。但是，一些需要動態修改的文本包含HTML和CSS。造成這種情況的原因是不幸的，也是不可避免的。

解析HTML/CSS的不足之處，是否有一種已知算法用於即時插入Unicode定向標記（僞強字符）？

來源

2011-03-08 Philip Hanson

我不知道如何在不解析它的情況下安全地將方向標記插入到HTML字符串中的算法。將HTML解析爲DOM並操作文本節點是確保您不會不小心向<script>和<style>標記中的文本添加方向標記的最安全方法。

這是一個簡短的Python腳本，它可以幫助您自動轉換文件。如果需要，邏輯應該很容易翻譯成其他語言。我不是你想的編碼規則RTL不夠熟悉，但你可以調整正則表達式'(\W([^\W]+)(\W)'和substituion模式ur"\u200e\1\2\3\u200e"，讓您預期的結果：

import re 
import lxml.html 

_RE_REPLACE = re.compile('(\W)([^\W]+)(\W)', re.M) 

def _replace(text): 
    if not text: 
     return text 
    return _RE_REPLACE.sub(ur'\u200e\1\2\3\u200e', text) 

text = u''' 
<html><body> 
    <div>sample (\u05de\u05d3\u05d2\u05dd :example)</div> 
    <script type="text/javascript">var foo = "ignore this";</script> 
    <style type="text/css">div { font-size: 18px; }</style> 
</body></html> 
''' 

# convert the text into an html dom 
tree = lxml.html.fromstring(text) 
body = tree.find('body') 
# iterate over all children of <body> tag 
for node in body.iterdescendants(): 
    # transform text with trails after the current html tag 
    node.tail = _replace(node.tail) 
    # ignore text inside script and style tags 
    if node.tag in ('script','style'): 
     continue 
    # transform text inside the current html tag 
    node.text = _replace(node.text) 

# render the modified tree back to html 
print lxml.html.tostring(tree)

輸出：

python convert.py 

<html><body> 
    <div>sample (&#1502;&#1491;&#1490;&#1501; &#8206;:example)&#8206;</div> 
    <script type="text/javascript">var foo = "ignore this";</script> 
    <style type="text/css">div { font-size: 18px; }</style> 
</body></html>

來源

2011-03-08 18:58:07 samplebias

一讓這變得更加困難的事情是破壞了HTML，但一個寬容的解析器可以幫助解決這個問題。對於這個應用程序，我們實際上使用HTML片段，因此解析是粗略的。真正的解決方案是在流程的早期推動變革。 – 2011-03-11 15:49:15

自動插入LTR標記

回答

相關問題