2016-08-14 24 views
-1

我有一個元組的句子,其表明的那裏可以是一個國家或一個數字的位置:條件列表修真元組時長超過一個

sample = In the first 11 months of 2004 Hong Kong 's international airport at Chek Lap Kok handled daily an average of 592 flights , 92,630 passengers , and more than 7,734 tons of cargo. 

然後:

tokenIDs2number = {(22,): 592.00, (25,): 92630.00,(34,): 7734.00} 
tokenIDs2location = {(8,9): Hong Kong} 

我需要爲這些元組的不同組合創建各種句子組合,我稱之爲槽句子:

In the first 11 months of 2004 LOCATION_SLOT 's international airport at Chek Lap Kok handled daily an average of NUMBER_SLOT flights , 92,630 passengers , and more than 7,734 tons of cargo. 

In the first 11 months of 2004 LOCATION_SLOT 's international airport at Chek Lap Kok handled daily an average of 592 flights , NUMBER_SLOT passengers , and more than 7,734 tons of cargo. 

In the first 11 months of 2004 LOCATION_SLOT 's international airport at Chek Lap Kok handled daily an average of 592 flights , 92,630 passengers , and more than NUMBER_SLOT tons of cargo. 

然而,我當前的代碼基本上取所述元組中的元素的組合,因此,我有兩個這樣的句子:

In the first 11 months of 2004 LOCATION_SLOT Kong 's international airport at Chek Lap Kok handled daily an average of NUMBER_SLOT flights , 92,630 passengers , and more than 7,734 tons of cargo. 

In the first 11 months of 2004 Hong LOCATION_SLOT 's international airport at Chek Lap Kok handled daily an average of NUMBER_SLOT flights , 92,630 passengers , and more than 7,734 tons of cargo. 

作爲一個例子。

我該如何解決這個問題,以便當我有一個元組密鑰len>1時,根據我的願望,將該密鑰中的所有插槽填充爲一個LOCATION或NUMBER插槽?

當前代碼:

for locationTokenIDs, location in tokenIDs2location.items(): 
        for numberTokenIDs, number in tokenIDs2number.items():  
         sentenceDict = {}  
         sentenceDict["sentence"] = sample  
         sentenceDict["location-value-pair"] = {location:number} 
         for locationTokenID in locationTokenIDs: 
          for numberTokenID in numberTokenIDs:         
           finalTokens = cleanSample.split() 
           finalTokens[numberTokenID] = "NUMBER_SLOT" 
           finalTokens[locationTokenID] = "LOCATION_SLOT" 
           slotSentence = (" ").join(finalTokens) 
           sentenceDict["parsedSentence"] = slotSentence 

注意,我要創建一個字典,還跟蹤位置值對與原句爲每個插槽句子組合。關鍵部分是生成正確的slotSentence

請注意,這只是一個示例,數字甚至可能爲24000000,其中句子中的值爲24 million,等於萬億,百萬,十億和千。

如果這是不可能的,另外一個選項是填補所有插槽的組合

In the first 11 months of 2004 LOCATION_SLOT LOCATION_SLOT 's international airport at Chek Lap Kok handled daily an average of NUMBER_SLOT flights , 92,630 passengers , and more than 7,734 tons of cargo. 

然後也許適應了一句刪除連續插槽,但我更傾向於將到盡一切努力。

回答

0

我已經解決了我的使用情況,但使用一種迂迴的方式(的sentenceList)。

我第一允許其包含多個LOCATION_SLOTNUMBER_SLOT槽句子 - 如果在組合一個元組包含2首或更多個狹槽,我填寫所有:

sentences2location2values = [] 

for locationTokenIDs, location in tokenIDs2location.items(): 
        for numberTokenIDs, number in tokenIDs2number.items():  
         sentenceDict = {}  
         sentenceDict["sentence"] = sample  
         sentenceDict["location-value-pair"] = {location:number} 
         for locationTokenID in locationTokenIDs: 
          sampleTokens[locationTokenID] = "LOCATION_SLOT" 

         for numberTokenID in numberTokenIDs: 
          sampleTokens[numberTokenID] = "NUMBER_SLOT" 

        slotSentence = (" ").join(sampleTokens) 
        sentenceDict["parsedSentence"] = slotSentence 
        sentences2location2values.append(sentenceDict) 

然後,我改變所解析的句子,以除去連續的位置和數字插槽:

for i,sentence in enumerate(sentences2location2values): 
     sampleTokens = sentence['parsedSentence'].split() 
     newTokens = [] 
     for i,token in enumerate(sampleTokens): 
      if i>0 and ((token == "LOCATION_SLOT" and sampleTokens[i-1]=="LOCATION_SLOT") or (token == "NUMBER_SLOT" and sampleTokens[i-1]=="NUMBER_SLOT")): 
       continue 
      else: 
       newTokens.append(token) 

     sentence['parsedSentence']=(' ').join(newTokens) 
0

當locationTokenID真正代表應該被視爲插槽的令牌片段的端點時,代碼將每個locationTokenID視爲一個插槽。因此,我們需要刪除for locationTokenID in locationTokenIDs:循環(循環遍歷每個locationTokenID,就好像它是一個插槽),並用一個插槽替換locationTokenID對定義的相應切片。

下面的代碼解決了OP中解決的問題,但其他問題仍然存在(例如只保留最後生成的slotSentence;我會讓你解決這個問題,因爲我不知道你要存儲哪些數據結構在插槽的句子):

sample = "In the first 11 months of 2004 Hong Kong 's international airport at Chek Lap Kok handled daily an average of 592 flights , 92,630 passengers , and more than 7,734 tons of cargo." 

tokenIDs2number = {(21,): 592, (24,): 92630,(30,): 7734} 
tokenIDs2location = {(7,8): 'Hong Kong'} 

for locationTokenIDs, location in tokenIDs2location.items(): 
    for numberTokenIDs, number in tokenIDs2number.items():  
     sentenceDict = {}  
     sentenceDict["sentence"] = sample  
     sentenceDict["location-value-pair"] = {location:number} 
     for numberTokenID in numberTokenIDs:         
      finalTokens = sample.split() 
      finalTokens[numberTokenID] = "NUMBER_SLOT" 
      finalTokens[locationTokenIDs[0]:(locationTokenIDs[1]+1)] = "LOCATION_SLOT" 
      slotSentence = (" ").join(finalTokens) 
      sentenceDict["parsedSentence"] = slotSentence 
      print(slotSentence) 

輸出:

在頭11個月2004 LOCATION的_ SLOT 's 赤Kok角國際機場每天平均處理 NUMBER_SLOT航班,92,630名乘客,以及超過7,734噸的 貨物。

在首11個月的2004年L O,C A T I O 4 N _ S L O,牛逼 的國際赤角機場每天處理的 592航班,NUMBER_SLOT人次,超過7734噸 貨物的平均值。

在首11個月的2004年L O,C A T I O 4 N _ S L O,牛逼 的國際機場在赤角每天處理的 592航班,92630人次,超過NUMBER_SLOT噸 貨物的平均值。

這可以被擴展到用於包含任意數目的空格的位置和數字工作。

sample = "In the first 11 months of 2004 Hong Kong Central 's international airport at Chek Lap Kok handled daily an average of 592 flights , 92 630 passengers , and more than 7 734 tons of cargo." 

tokenIDs2number = {(22,22): '592', (25,26): '92 630',(32,33): '7 734'} 
tokenIDs2location = {(7,9): 'Hong Kong Central'} 

for locationTokenIDs, location in tokenIDs2location.items(): 
    for numberTokenIDs, number in tokenIDs2number.items():  
     finalTokens = sample.split() 
     finalTokens[numberTokenIDs[0]:(numberTokenIDs[1]+1)] = "NUMBER_SLOT" 
     finalTokens[locationTokenIDs[0]:(locationTokenIDs[1]+1)] = "LOCATION_SLOT" 
     slotSentence = (" ").join(finalTokens) 
     print(slotSentence) 

輸出::

在頭11個月2004我們通過具有兩個numberTokenIDs和locationTokenIDs是2長度元組指定一個範圍的令牌的每個位置/數實現此** LOCATION _ SLOT **位於赤Kok角的 國際機場平均每天處理592 航班,** NUMBER _ SLOT **旅客和7 734噸以上的貨物 。

在首11個月2004 **位置_ SLOT **的赤鱲角國際 每天機場處理的592個 航班,92 630人次的平均,而且比** NUMBER _ SLOT更多**噸貨物 。

2004年頭11個月** LOCATION _ SLOT **位於赤Kok角的 國際機場每天平均處理** NU MBER _ SLOT **航班,92 630乘客以及超過7 734 噸貨物。

+0

這是一個很好的答案,邏輯上合理,您能解釋爲什麼位置插槽被空白分隔嗎?另外我怎樣才能使這個通用的(有時插槽跨越不止兩個空間,例如像「剛果民主共和國」的國家,也可能有多個插槽的數字不僅僅是位置。正在使用'len(locationTokenIDs )''但是我沒有掩蓋必要的國家 –

+0

這適用於具有任意數量空格的國家,因爲locationTokenIDs中的值代表切片端點並在代碼中被視爲這樣。我更新了我的答案,代碼適用於具有任意數量空格的位置和數字 –

+0

我剛調整了你的代碼,但不幸的是,這不允許我在單獨的'sentenceDicts'中添加多個槽句子的例子。我還必須包含一個if語句,比如'if len(numberTokenIDs)> 1: finalTokens [numberTokenIDs [0] :(numberTokenIDs [1] +1)] =「NUMBER_SLOT」 else: finalTokens [numberTokenID] =「NUMBER_SLOT」 –

0

考慮使用str.replace()而不是分割和切分句子串。爲此,您需要將tokenID2number中的元素與千位分隔符進行轉換,作爲@JonClements註釋可以使用Python 2的format(int, ',')進行處理。7+:

sample = "In the first 11 months of 2004 Hong Kong 's international airport " + \ 
     "at Chek Lap Kok handled daily an average of 592 flights " + \ 
     "92,630 passengers , and more than 7,734 tons of cargo."  
tokenIDs2number = {(22,): 592, (25,): 92630,(34,): 7734} 
tokenIDs2location = {(8,9): 'Hong Kong'} 

sentenceList = [] 
# ITERATE ACROSS A LIST COMPREHENSION FOR ALL POSSIBLE COMBINATIONS 
for item in [[s,i,j] for s in [sample] \ 
        for i in tokenIDs2location.items() \ 
        for j in tokenIDs2number.items()]: 
    sentenceDict = {} 
    sentenceDict["sentence"] = item[0] 
    sentenceDict["location-value-pair"] = {item[1][1]: item[2][1]} 
    sentenceDict["parsedSentence"] = sample.replace(item[1][1], 'LOCATION_SLOT').\ 
              replace(format(item[2][1], ','), 'NUMBER_SLOT') 
    sentenceList.append(sentenceDict) 

輸出

[{'sentence': "In the first 11 months of 2004 Hong Kong 's international airport at Chek Lap Kok handled daily an average of 592 flights 92,630 passengers , and more than 7,734 tons of cargo.", 'parsedSentence': "In the first 11 months of 2004 LOCATION_SLOT 's international airport at Chek Lap Kok handled daily an average of 592 flights 92,630 passengers , and more than NUMBER_SLOT tons of cargo.", 'location-value-pair': {'Hong Kong': 7734}} 
{'sentence': "In the first 11 months of 2004 Hong Kong 's international airport at Chek Lap Kok handled daily an average of 592 flights 92,630 passengers , and more than 7,734 tons of cargo.", 'parsedSentence': "In the first 11 months of 2004 LOCATION_SLOT 's international airport at Chek Lap Kok handled daily an average of NUMBER_SLOT flights 92,630 passengers , and more than 7,734 tons of cargo.", 'location-value-pair': {'Hong Kong': 592}} 
{'sentence': "In the first 11 months of 2004 Hong Kong 's international airport at Chek Lap Kok handled daily an average of 592 flights 92,630 passengers , and more than 7,734 tons of cargo.", 'parsedSentence': "In the first 11 months of 2004 LOCATION_SLOT 's international airport at Chek Lap Kok handled daily an average of 592 flights NUMBER_SLOT passengers , and more than 7,734 tons of cargo.", 'location-value-pair': {'Hong Kong': 92630}}] 
+0

雖然它很好,但是你相信Mike DeSimone的配方......對於2.7+你現在可以寫成'format(int_value,',')'... –

+0

@JonClements這意味着我可以替換replace(intWithCommas( ([item] [2] [1]),'NUMBER_SLOT')''替換(format(item [2] [1],','),'NUMBER_SLOT')'? –

+0

@DhruvGhulati yes ... –

相關問題