蟒蛇中的N個詞之後拆分HTML

是否有任何方法可以在N個詞之後拆分長的HTML字符串？很顯然，我可以使用：蟒蛇中的N個詞之後拆分HTML

' '.join(foo.split(' ')[:n])

得到一個純文本字符串的第n個字，但可能會在一個HTML標籤的中間裂開，而且不會產生有效的HTML，因爲它不會關閉已打開的標籤。

我需要在zope/plone網站上做到這一點 - 如果在那些能夠做到的產品中有標準的東西，那將是理想的。

例如，說我有文字：

<p>This is some text with a 
    <a href="http://www.example.com/" title="Example link"> 
    bit of linked text in it 
    </a>. 
</p>

我問它後5個字分裂，它應該返回：

<p>This is some text with</p>

7個字：

<p>This is some text with a 
    <a href="http://www.example.com/" title="Example link"> 
    bit 
    </a> 
</p>

來源

2008-12-11 rjmunro

你想忽略標籤，這樣他們就不會被分割？換句話說，只能獲取和分割標籤中未包含的文本。 – monkut 2008-12-11 17:03:32

您是否在分解標籤之間的文檔文本（例如，在

和

標籤之間）？ – gotgenes 2008-12-11 17:05:12

查看django.utils.text中的truncate_html_words函數。即使你不使用Django，代碼也完全符合你的要求。

來源

2008-12-11 18:03:44

我聽說Beautiful Soup非常擅長解析html。它可能會幫助你獲得正確的html。

來源

2008-12-11 16:58:58 recursive

我打算提到使用Python構建的基地HTMLParser，因爲我不確定你試圖達到的最終結果是什麼，它可能會或可能不會讓你在那裏，你將與處理程序主要是

來源

2008-12-11 17:07:16 curtisk

您可以混合使用正則表達式，BeautifulSoup或Tidy（我更喜歡BeautifulSoup）。這個想法很簡單 - 先去掉所有的HTML標籤。找到第n個單詞（這裏n = 7），找出第n個單詞出現在字符串中的次數，直到n個單詞爲止 - 因爲它只查找最後一個用於截斷的事件。

下面是一段代碼，雖然有點亂，但工程

import re 
from BeautifulSoup import BeautifulSoup 
import tidy 

def remove_html_tags(data): 
    p = re.compile(r'<.*?>') 
    return p.sub('', data) 

input_string='<p>This is some text with a <a href="http://www.example.com/" '\ 
    'title="Example link">bit of linked text in it</a></p>' 

s=remove_html_tags(input_string).split(' ')[:7] 

###required to ensure that only the last occurrence of the nth word is                      
# taken into account for truncating.                              
# coz if the nth word could be 'a'/'and'/'is'....etc                          
# which may occur multiple times within n words                            
temp=input_string 
k=s.count(s[-1]) 
i=1 
j=0 
while i<=k: 
    j+=temp.find(s[-1]) 
    temp=temp[j+len(s[-1]):] 
    i+=1 
####                                       
output_string=input_string[:j+len(s[-1])] 

print "\nBeautifulSoup\n", BeautifulSoup(output_string) 
print "\nTidy\n", tidy.parseString(output_string)

輸出是什麼ü想

BeautifulSoup 
<p>This is some text with a <a href="http://www.example.com/" title="Example link">bit</a></p> 

Tidy 
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 3.2//EN"> 
<html> 
<head> 
<meta name="generator" content= 
"HTML Tidy for Linux/x86 (vers 6 November 2007), see www.w3.org"> 
<title></title> 
</head> 
<body> 
<p>This is some text with a <a href="http://www.example.com/" 
title="Example link">bit</a></p> 
</body> 
</html>

希望這有助於

編輯：更好正則表達式

`p = re.compile(r'<[^<]*?>')`

來源

2008-12-11 18:24:11

蟒蛇中的N個詞之後拆分HTML

回答

相關問題