查找連接令牌

我寫代碼，獲取文本標記爲輸入：查找連接令牌

tokens = ["Tap-", "Berlin", "Was-ISt", "das", "-ist", "cool", "oh", "Man", "-Hum", "-Zuh-UH-", "glit"]

的代碼應該查找包含連字符或連接到彼此連字符的所有標記：基本上輸出應該是：

[["Tap-", "Berlin"], ["Was-ISt"], ["das", "-ist"], ["Man", "-Hum", "-Zuh-UH-", "glit"]]

我寫了一個碼，但不知何故，我不是跟hypens得到連接令牌回：要嘗試一下：http://goo.gl/iqov0q

def find_hyphens(self): 
    tokens_with_hypens =[] 


    for i in range(len(self.tokens)): 

     hyp_leng = 0 

     while self.hypen_between_two_tokens(i + hyp_leng): 
      hyp_leng += 1 

     if self.has_hypen_in_middle(i) or hyp_leng > 0: 
      if hyp_leng == 0: 
       tokens_with_hypens.append(self.tokens[i:i + 1]) 
      else: 
       tokens_with_hypens.append(self.tokens[i:i + hyp_leng]) 
       i += hyp_leng - 1 

    return tokens_with_hypens

我該怎麼做？是否有更高性能的解決方案？由於

來源

2015-11-29 John Smith

我發現在你的代碼3個錯誤：

1）您在這裏比較tok1最後2個字符，而不是最後的tok1和第一tok2：

if "-" in joined[len(tok1) - 2: len(tok1)]: 
# instead, do this: 
if "-" in joined[len(tok1) - 1: len(tok1) + 1]:

2）您在此省略最後一個匹配的標記。 1增加你的切片這裏的最終指數：

tokens_with_hypens.append(self.tokens[i:i + hyp_leng]) 
# instead, do this: 
tokens_with_hypens.append(self.tokens[i:i + 1 + hyp_leng])

3）你不能操縱在Python中for i in range循環的指標。下一次迭代將檢索下一個索引，並覆蓋您的更改。相反，你可以使用while循環是這樣的：

i = 0 
while i < len(self.tokens): 
    [...] 
    i += 1

這3個更正導致測試合格：http://goo.gl/fd07oL

不過我忍不住從頭開始寫一個算法，解決你的問題儘可能簡單：

def get_hyphen_groups(tokens): 
    i_start, i_end = 0, 1 
    while i_start < len(tokens): 
     while (i_end < len(tokens) and 
       (tokens[i_end].startswith("-")^tokens[i_end - 1].endswith("-"))): 
      i_end += 1 
     yield tokens[i_start:i_end] 
     i_start, i_end = i_end, i_end + 1 


tokens = ["Tap-", "Berlin", "Was-ISt", "das", "-ist", "cool", "oh", "Man", "-Hum", "-Zuh-UH-", "glit"] 

for group in get_hyphen_groups(tokens): 
    print ("".join(group))

要在您預期的結果排除1元團一樣，包裹yield這個if：

if i_end - i_start > 1: 
    yield tokens[i_start:i_end]

要包含1元團已經有一個連字符，即if改變這個例如：

這是不對您的方法

if i_end - i_start > 1 or "-" in tokens[i_start]: 
    yield tokens[i_start:i_end]

來源

2015-11-29 20:45:40 Felk

有一件事是試圖改變在for i in range(len(self.tokens))循環中的值爲i。它不會工作，因爲i的值將在每次迭代中從range獲得下一個值，而忽略您的更改。

我改變了你的算法，使用迭代算法從列表中彈出一個項目，並決定如何處理它。它使用緩衝區來存儲屬於一個鏈的物品，直到它完成。

完整的代碼是：

class Hyper: 

    def __init__(self, tokens): 
     self.tokens = tokens 

    def find_hyphens(self): 
     tokens_with_hypens =[] 

     copy = list(self.tokens) 

     buffer = [] 
     while len(copy) > 0: 
      item = copy.pop(0) 
      if self.has_hyphen_in_middle(item) and item[0] != '-' and item[-1] != '-': 
       # words with hyphens that are not part of a bigger chain 
       tokens_with_hypens.append([item]) 
      elif item[-1] == '-' or (len(copy) > 0 and copy[0][0] == '-'): 
       # part of a chain - append to the buffer 
       buffer.append(item) 
      elif len(buffer) > 0: 
       # the last word in a chain - the buffer contains the complete chain 
       buffer.append(item) 
       tokens_with_hypens.append(buffer) 
       buffer = [] 

     return tokens_with_hypens 

    @staticmethod 
    def has_hyphen_in_middle(input): 
     return len(input) > 2 and "-" in input[1:-2] 


tokens = ["Tap-", "Berlin", "Was-ISt", "das", "-ist", "cool", "oh", "Man", "-Hum", "-Zuh-UH-", "glit"] 

hyper = Hyper(tokens) 

result = hyper.find_hyphens() 

print(result) 

print(result == [["Tap-", "Berlin"], ["Was-ISt"], ["das", "-ist"], ["Man", "-Hum", "-Zuh-UH-", "glit"]])

來源

2015-11-29 20:45:56 Szymon

查找連接令牌

回答

相關問題