如何計算2個預定義單詞之間的單詞數量？

<replace-add>，我不知道你知道導致</replace-add>我可以幫你<replace-del>說哦</replace-del><replace-add>我們</replace-add>感謝，所以我剛剛從</replace-add>我的女兒<replace-del> tenah代爾</replace-del><replace-add>明確可怕</replace-add>如何計算2個預定義單詞之間的單詞數量？

建立一個騎 <replace-del>爲 </replace-del> <replace-add>

如何計算文本中<replace-add>和</replace-add>之間的確切字數。

來源

2017-10-13 Tim

那你是指所有在這些標籤之間出現的以空格分隔的字符串？爲了清楚起見，你能否包括預期的樣本輸出？此外，嘗試用四個空格縮進來格式化代碼。我們可以假設標籤會像這樣發生，還是可以有屬性？ –

我不知道你知道原因輸出將是7，也請注意，我將在文本中有其他標籤，如,~~等。但上的示例就足夠了。 – Tim~~

不使用任何庫：

def get_tag_indexes(text, tag, start_tag): 
    tag_indexes = [] 
    start_index = -1 

    while True: 
     start_index = text.find(tag, start_index + 1) 

     if start_index != -1: 
      if start_tag: 
       tag_indexes.append(start_index + len(tag)) 
      else: 
       tag_indexes.append(start_index) 
     else: 
      return tag_indexes 

text = """<replace-add>that i dont know you know cause</replace-add> i could help you with <replace-del>that oh</replace-del> <replace-add>us</replace-add> thanks so i just set up a ride <replace-del>for</replace-del> <replace-add>from</replace-add> my daughter <replace-del>tenah dyer</replace-del> <replace-add>clear dire</replace-add>""" 

tag_starts = get_tag_indexes(text, "<replace-add>", True) 
tag_ends = get_tag_indexes(text, "</replace-add>", False) 

for start, end in zip(tag_starts, tag_ends): 
    words = text[start:end].split() 
    print "{} words - {}".format(len(words), words)

給你：

7 words - ['that', 'i', 'dont', 'know', 'you', 'know', 'cause'] 
1 words - ['us'] 
1 words - ['from'] 
2 words - ['clear', 'dire']

這將使用函數返回給定文本的位置的列表。這可以用來提取兩個標籤之間的文本。

作爲一個替代方法，這可能實際上還可以使用beautifulsoup完成：

from bs4 import BeautifulSoup 

text = """<replace-add>that i dont know you know cause</replace-add> i could help you with <replace-del>that oh</replace-del> <replace-add>us</replace-add> thanks so i just set up a ride <replace-del>for</replace-del> <replace-add>from</replace-add> my daughter <replace-del>tenah dyer</replace-del> <replace-add>clear dire</replace-add>""" 
soup = BeautifulSoup(text, "lxml") 

for block in soup.find_all('replace-add'): 
    words = block.text.split() 
    print "{} words - {}".format(len(words), words)

來源

2017-10-13 09:34:40

嘿馬丁，我不應該導入任何圖書館。 – Tim

@Tim一點都沒有？！你允許標準庫的東西？這是一項任務或某事的要求嗎？ –

我的意思是它們可以像導入操作系統，difflib等一樣使用，但最好遠離它，除非它們是必不可少的，並且不屬於任務。 – Tim

根據如何值得信賴的來源是，你可以做兩件事情。鑑於

source = """<replace-add>that i dont know you know cause</replace-add> i could help you with <replace-del>that oh</replace-del> <replace-add>us</replace-add> thanks so i just set up a ride <replace-del>for</replace-del> <replace-add>from</replace-add> my daughter <replace-del>tenah dyer</replace-del> <replace-add>clear dire</replace-add>"""

你可以使用正則表達式，像這樣：

import re 

from itertools import chain 

word_pattern = re.compile(r"(?<=<replace-add>).*?(?=</replace-add>)") 
re_words = list(chain.from_iterable(map(str.split, word_pattern.findall(source))))

這如果源這些標籤完全匹配只會工作，沒有任何屬性等

的另一種選擇標準庫是HTML解析：

from html.parser import HTMLParser 

class MyParser(HTMLParser): 
    def get_words(self, html): 
     self.read_words = False 
     self.words = [] 
     self.feed(html) 
     return self.words 

    def handle_starttag(self, tag, attrs): 
     if tag == "replace-add": 
      self.read_words = True 

    def handle_data(self, data): 
     if self.read_words: 
      self.words.extend(data.split()) 

    def handle_endtag(self, tag): 
     if tag == "replace-add": 
      self.read_words = False 


parser = MyParser() 
html_words = parser.get_words(source)

這種方法會更可靠，一個d可能會更有效一些，因爲它使用完全集中於此任務的工具。

現在，做

print(re_words) 
print(html_words)

我們得到

['that', 'i', 'dont', 'know', 'you', 'know', 'cause', 'us', 'from', 'clear', 'dire'] 
['that', 'i', 'dont', 'know', 'you', 'know', 'cause', 'us', 'from', 'clear', 'dire']

（當然，這個名單的len是單詞的數量。）

如果嚴格只是需要數的話，你可以只保留一個運行總數，並將data.split的長度添加到每個遇到的數據中。

如果你真的不能進行任何導入，你要麼做出一些犧牲，要麼必須實現你自己的正則表達式引擎/ html解析器。如果這是家庭作業的一項要求，那麼你真的應該表現出一些事先的努力來發布這個問題。

來源

2017-10-13 09:47:07

如何計算2個預定義單詞之間的單詞數量？

回答

相關問題