2016-08-14 24 views


sample = In the first 11 months of 2004 Hong Kong 's international airport at Chek Lap Kok handled daily an average of 592 flights , 92,630 passengers , and more than 7,734 tons of cargo. 


tokenIDs2number = {(22,): 592.00, (25,): 92630.00,(34,): 7734.00} 
tokenIDs2location = {(8,9): Hong Kong} 


In the first 11 months of 2004 LOCATION_SLOT 's international airport at Chek Lap Kok handled daily an average of NUMBER_SLOT flights , 92,630 passengers , and more than 7,734 tons of cargo. 

In the first 11 months of 2004 LOCATION_SLOT 's international airport at Chek Lap Kok handled daily an average of 592 flights , NUMBER_SLOT passengers , and more than 7,734 tons of cargo. 

In the first 11 months of 2004 LOCATION_SLOT 's international airport at Chek Lap Kok handled daily an average of 592 flights , 92,630 passengers , and more than NUMBER_SLOT tons of cargo. 


In the first 11 months of 2004 LOCATION_SLOT Kong 's international airport at Chek Lap Kok handled daily an average of NUMBER_SLOT flights , 92,630 passengers , and more than 7,734 tons of cargo. 

In the first 11 months of 2004 Hong LOCATION_SLOT 's international airport at Chek Lap Kok handled daily an average of NUMBER_SLOT flights , 92,630 passengers , and more than 7,734 tons of cargo. 




for locationTokenIDs, location in tokenIDs2location.items(): 
        for numberTokenIDs, number in tokenIDs2number.items():  
         sentenceDict = {}  
         sentenceDict["sentence"] = sample  
         sentenceDict["location-value-pair"] = {location:number} 
         for locationTokenID in locationTokenIDs: 
          for numberTokenID in numberTokenIDs:         
           finalTokens = cleanSample.split() 
           finalTokens[numberTokenID] = "NUMBER_SLOT" 
           finalTokens[locationTokenID] = "LOCATION_SLOT" 
           slotSentence = (" ").join(finalTokens) 
           sentenceDict["parsedSentence"] = slotSentence 


請注意,這只是一個示例,數字甚至可能爲24000000,其中句子中的值爲24 million,等於萬億,百萬,十億和千。


In the first 11 months of 2004 LOCATION_SLOT LOCATION_SLOT 's international airport at Chek Lap Kok handled daily an average of NUMBER_SLOT flights , 92,630 passengers , and more than 7,734 tons of cargo. 





我第一允許其包含多個LOCATION_SLOTNUMBER_SLOT槽句子 - 如果在組合一個元組包含2首或更多個狹槽,我填寫所有:

sentences2location2values = [] 

for locationTokenIDs, location in tokenIDs2location.items(): 
        for numberTokenIDs, number in tokenIDs2number.items():  
         sentenceDict = {}  
         sentenceDict["sentence"] = sample  
         sentenceDict["location-value-pair"] = {location:number} 
         for locationTokenID in locationTokenIDs: 
          sampleTokens[locationTokenID] = "LOCATION_SLOT" 

         for numberTokenID in numberTokenIDs: 
          sampleTokens[numberTokenID] = "NUMBER_SLOT" 

        slotSentence = (" ").join(sampleTokens) 
        sentenceDict["parsedSentence"] = slotSentence 


for i,sentence in enumerate(sentences2location2values): 
     sampleTokens = sentence['parsedSentence'].split() 
     newTokens = [] 
     for i,token in enumerate(sampleTokens): 
      if i>0 and ((token == "LOCATION_SLOT" and sampleTokens[i-1]=="LOCATION_SLOT") or (token == "NUMBER_SLOT" and sampleTokens[i-1]=="NUMBER_SLOT")): 

     sentence['parsedSentence']=(' ').join(newTokens) 

當locationTokenID真正代表應該被視爲插槽的令牌片段的端點時,代碼將每個locationTokenID視爲一個插槽。因此,我們需要刪除for locationTokenID in locationTokenIDs:循環(循環遍歷每個locationTokenID,就好像它是一個插槽),並用一個插槽替換locationTokenID對定義的相應切片。


sample = "In the first 11 months of 2004 Hong Kong 's international airport at Chek Lap Kok handled daily an average of 592 flights , 92,630 passengers , and more than 7,734 tons of cargo." 

tokenIDs2number = {(21,): 592, (24,): 92630,(30,): 7734} 
tokenIDs2location = {(7,8): 'Hong Kong'} 

for locationTokenIDs, location in tokenIDs2location.items(): 
    for numberTokenIDs, number in tokenIDs2number.items():  
     sentenceDict = {}  
     sentenceDict["sentence"] = sample  
     sentenceDict["location-value-pair"] = {location:number} 
     for numberTokenID in numberTokenIDs:         
      finalTokens = sample.split() 
      finalTokens[numberTokenID] = "NUMBER_SLOT" 
      finalTokens[locationTokenIDs[0]:(locationTokenIDs[1]+1)] = "LOCATION_SLOT" 
      slotSentence = (" ").join(finalTokens) 
      sentenceDict["parsedSentence"] = slotSentence 


在頭11個月2004 LOCATION的_ SLOT 's 赤Kok角國際機場每天平均處理 NUMBER_SLOT航班,92,630名乘客,以及超過7,734噸的 貨物。

在首11個月的2004年L O,C A T I O 4 N _ S L O,牛逼 的國際赤角機場每天處理的 592航班,NUMBER_SLOT人次,超過7734噸 貨物的平均值。

在首11個月的2004年L O,C A T I O 4 N _ S L O,牛逼 的國際機場在赤角每天處理的 592航班,92630人次,超過NUMBER_SLOT噸 貨物的平均值。


sample = "In the first 11 months of 2004 Hong Kong Central 's international airport at Chek Lap Kok handled daily an average of 592 flights , 92 630 passengers , and more than 7 734 tons of cargo." 

tokenIDs2number = {(22,22): '592', (25,26): '92 630',(32,33): '7 734'} 
tokenIDs2location = {(7,9): 'Hong Kong Central'} 

for locationTokenIDs, location in tokenIDs2location.items(): 
    for numberTokenIDs, number in tokenIDs2number.items():  
     finalTokens = sample.split() 
     finalTokens[numberTokenIDs[0]:(numberTokenIDs[1]+1)] = "NUMBER_SLOT" 
     finalTokens[locationTokenIDs[0]:(locationTokenIDs[1]+1)] = "LOCATION_SLOT" 
     slotSentence = (" ").join(finalTokens) 


在頭11個月2004我們通過具有兩個numberTokenIDs和locationTokenIDs是2長度元組指定一個範圍的令牌的每個位置/數實現此** LOCATION _ SLOT **位於赤Kok角的 國際機場平均每天處理592 航班,** NUMBER _ SLOT **旅客和7 734噸以上的貨物 。

在首11個月2004 **位置_ SLOT **的赤鱲角國際 每天機場處理的592個 航班,92 630人次的平均,而且比** NUMBER _ SLOT更多**噸貨物 。

2004年頭11個月** LOCATION _ SLOT **位於赤Kok角的 國際機場每天平均處理** NU MBER _ SLOT **航班,92 630乘客以及超過7 734 噸貨物。


這是一個很好的答案,邏輯上合理,您能解釋爲什麼位置插槽被空白分隔嗎?另外我怎樣才能使這個通用的(有時插槽跨越不止兩個空間,例如像「剛果民主共和國」的國家,也可能有多個插槽的數字不僅僅是位置。正在使用'len(locationTokenIDs )''但是我沒有掩蓋必要的國家 –


這適用於具有任意數量空格的國家,因爲locationTokenIDs中的值代表切片端點並在代碼中被視爲這樣。我更新了我的答案,代碼適用於具有任意數量空格的位置和數字 –


我剛調整了你的代碼,但不幸的是,這不允許我在單獨的'sentenceDicts'中添加多個槽句子的例子。我還必須包含一個if語句,比如'if len(numberTokenIDs)> 1: finalTokens [numberTokenIDs [0] :(numberTokenIDs [1] +1)] =「NUMBER_SLOT」 else: finalTokens [numberTokenID] =「NUMBER_SLOT」 –


考慮使用str.replace()而不是分割和切分句子串。爲此,您需要將tokenID2number中的元素與千位分隔符進行轉換,作爲@JonClements註釋可以使用Python 2的format(int, ',')進行處理。7+:

sample = "In the first 11 months of 2004 Hong Kong 's international airport " + \ 
     "at Chek Lap Kok handled daily an average of 592 flights " + \ 
     "92,630 passengers , and more than 7,734 tons of cargo."  
tokenIDs2number = {(22,): 592, (25,): 92630,(34,): 7734} 
tokenIDs2location = {(8,9): 'Hong Kong'} 

sentenceList = [] 
for item in [[s,i,j] for s in [sample] \ 
        for i in tokenIDs2location.items() \ 
        for j in tokenIDs2number.items()]: 
    sentenceDict = {} 
    sentenceDict["sentence"] = item[0] 
    sentenceDict["location-value-pair"] = {item[1][1]: item[2][1]} 
    sentenceDict["parsedSentence"] = sample.replace(item[1][1], 'LOCATION_SLOT').\ 
              replace(format(item[2][1], ','), 'NUMBER_SLOT') 


[{'sentence': "In the first 11 months of 2004 Hong Kong 's international airport at Chek Lap Kok handled daily an average of 592 flights 92,630 passengers , and more than 7,734 tons of cargo.", 'parsedSentence': "In the first 11 months of 2004 LOCATION_SLOT 's international airport at Chek Lap Kok handled daily an average of 592 flights 92,630 passengers , and more than NUMBER_SLOT tons of cargo.", 'location-value-pair': {'Hong Kong': 7734}} 
{'sentence': "In the first 11 months of 2004 Hong Kong 's international airport at Chek Lap Kok handled daily an average of 592 flights 92,630 passengers , and more than 7,734 tons of cargo.", 'parsedSentence': "In the first 11 months of 2004 LOCATION_SLOT 's international airport at Chek Lap Kok handled daily an average of NUMBER_SLOT flights 92,630 passengers , and more than 7,734 tons of cargo.", 'location-value-pair': {'Hong Kong': 592}} 
{'sentence': "In the first 11 months of 2004 Hong Kong 's international airport at Chek Lap Kok handled daily an average of 592 flights 92,630 passengers , and more than 7,734 tons of cargo.", 'parsedSentence': "In the first 11 months of 2004 LOCATION_SLOT 's international airport at Chek Lap Kok handled daily an average of 592 flights NUMBER_SLOT passengers , and more than 7,734 tons of cargo.", 'location-value-pair': {'Hong Kong': 92630}}] 

雖然它很好,但是你相信Mike DeSimone的配方......對於2.7+你現在可以寫成'format(int_value,',')'... –


@JonClements這意味着我可以替換replace(intWithCommas( ([item] [2] [1]),'NUMBER_SLOT')''替換(format(item [2] [1],','),'NUMBER_SLOT')'? –


@DhruvGhulati yes ... –
