對具有相同電話號碼的文檔進行分組

我的數據庫包含一個大型號的集合。的酒店（約121,000）。對具有相同電話號碼的文檔進行分組

這是我收集的樣子：

{ 
    "_id" : ObjectId("57bd5108f4733211b61217fa"), 
    "autoid" : 1, 
    "parentid" : "P01982.01982.110601173548.N2C5", 
    "companyname" : "Sheldan Holiday Home", 
    "latitude" : 34.169552, 
    "longitude" : 77.579315, 
    "state" : "JAMMU AND KASHMIR", 
    "city" : "LEH Ladakh", 
    "pincode" : 194101, 
    "phone_search" : "9419179870|253013", 
    "address" : "Sheldan Holiday Home|Changspa|Leh Ladakh-194101|LEH Ladakh|JAMMU AND KASHMIR", 
    "email" : "", 
    "website" : "", 
    "national_catidlineage_search" : "/10255012/|/10255031/|/10255037/|/10238369/|/10238380/|/10238373/", 
    "area" : "Leh Ladakh", 
    "data_city" : "Leh Ladakh" 
}

每個文檔可以有1個或多個被分割的電話號碼「|」分隔符。

我必須將具有相同電話號碼的文檔編組在一起。

實時，我的意思是，當用戶打開特定的酒店在網絡界面上查看其詳細信息時，我應該能夠顯示所有按常用電話號碼分組的酒店。

分組時，如果一個酒店鏈接到另一個酒店並且該酒店鏈接到另一個酒店，則應將所有3個酒店分組在一起。

Example : Hotel A has phone numbers 1|2, B has phone numbers 3|4 and C has phone numbers 2|3, then A, B and C should be grouped together.

from pymongo import MongoClient 
from pprint import pprint #Pretty print 
import re #for regex 
#import unicodedata 

client = MongoClient() 

cLen = 0 
cLenAll = 0 
flag = 0 
countA = 0 
countB = 0 
list = [] 
allHotels = [] 
conContact = [] 
conId = [] 
hotelTotal = [] 
splitListAll = [] 
contactChk = [] 

#We'll be passing the value later as parameter via a function call 
#hId = 37443; 

regx = re.compile("^Vivanta", re.IGNORECASE) 

#Connection 
db = client.hotel 
collection = db.hotelData 

#Finding hotels wrt search input 
for post in collection.find({"companyname":regx}): 
    list.append(post) 

#Copying all hotels in a list 
for post1 in collection.find(): 
    allHotels.append(post1) 

hotelIndex = 11 #Index of hotel selected from search result 
conIndex = hotelIndex 
x = list[hotelIndex]["companyname"] #Name of selected hotel 
y = list[hotelIndex]["phone_search"] #Phone numbers of selected hotel 

try: 
    splitList = y.split("|") #Splitting of phone numbers and storing in a list 'splitList' 
except: 
    splitList = y 


print "Contact details of",x,":" 

#Printing all contacts... 
for contact in splitList: 
    print contact 
    conContact.extend(contact) 
    cLen = cLen+1 

print "No. of contacts in",x,"=",cLen 


for i in allHotels: 
    yAll = allHotels[countA]["phone_search"] 
    try: 
     splitListAll.append(yAll.split("|")) 
     countA = countA+1 
    except: 
     splitListAll.append(yAll) 
     countA = countA + 1 
# print splitListAll 

#count = 0 

#This block has errors 
#Add code to stop when no new links occur and optimize the outer for loop 
#for j in allHotels: 
for contactAll in splitListAll: 
    if contactAll in conContact: 
     conContact.extend(contactAll) 
#  contactChk = contactAll 
#  if (set(conContact) & set(contactChk)): 
#   conContact = contactChk 
#   contactChk[:] = [] #drop contactChk list 
     conId = allHotels[countB]["autoid"] 
    countB = countB+1 

print "Printing the list of connected hotels..." 
for final in collection.find({"autoid":conId}): 
    print final

這是一個代碼，我用Python寫的。在這一個中，我嘗試在for循環中執行線性搜索。我現在收到了一些錯誤，但它應該在糾正後運行。

我需要一個優化版本，因爲班輪搜索的時間複雜性較差。

我對此很新穎，所以歡迎任何其他改進代碼的建議。

謝謝。

來源

2016-08-30 Anubhav

我沒有關於它的很多知識，但可以elasticsearch幫我解決這個問題呢？ – Anubhav

任何Python內存中搜索的最簡單答案是「使用字典」。字典給出O（ln N）密鑰訪問速度，列表給出O（N）。

另外請記住，您可以將Python對象放入儘可能多的字典（或列表），並將多次放入一個字典或列表中，因爲它需要。他們不被複制。這只是一個參考。

所以要領看起來像

for hotel in hotels: 
    phones = hotel["phone_search"].split("|") 
    for phone in phones: 
     hotelsbyphone.setdefault(phone,[]).append(hotel)

在這個循環結束，hotelsbyphone["123456"]將酒店的對象其中有「123456」作爲自己phone_search字符串的一個列表。關鍵編碼功能是.setdefault(key, [])方法，如果密鑰不在字典中，它將初始化一個空列表，以便您可以將其添加到該列表中。

一旦你建立了這個指標，這將是快速

try: 
    hotels = hotelsbyphone[x] 
    # and process a list of one or more hotels 
except KeyError: 
    # no hotels exist with that number

或者到try ... except，測試if x in hotelsbyphone:

來源

2016-08-30 09:14:03 nigel222

感謝您的回答。如何在for循環中迭代「手機」？請糾正我，如果我錯了，但由於它是一個整數值，它不能被迭代，可以嗎？ – Anubhav

我正在使用try catch塊，因爲在只包含一個電話號碼的文檔中，我收到錯誤，因爲「|」分隔符不存在。 – Anubhav

'hotel [「phone_search」]'是一個字符串，'.split（「|」）方法返回一個字符串列表（如果沒有垂直條，則爲一個元素列表）。因此'電話中的電話'迭代該列表中的每個字符串。 – nigel222

對具有相同電話號碼的文檔進行分組

回答

相關問題