2017-07-28 73 views
0

我有一個url列表,我試圖用特定的關鍵詞來說明word1和word2,以及一個停用詞表[stop1,stop2,stop3]。有沒有一種方法可以在不使用許多條件的情況下過濾鏈接?當我使用每個停用詞的條件時,我得到了正確的輸出,這看起來不是一個可行的選擇。以下是蠻力法:Python word match

for link in url: 
    if word1 or word2 in link: 
     if stop1 not in link: 
      if stop2 not in link: 
       if stop3 not in link: 
        links.append(link) 
+1

搜索列表理解 – Abend

+0

會像'stop_words = [stop1,stop2,stop3]'和'key_words = [word1,word2]'那麼'在key_words中爲單詞:''如果在stop_words中有單詞:''#filter代碼「爲你工作? –

+0

我想過濾其中的word1或word2的網址,但在網址中沒有任何停用詞。我試圖蠻力,這是多項式時間。喜歡的東西: '每個在網址:'' 如果字詞1或字詞2中的每個:'' 如果任何(X爲STOP_WORDS X)不是每個:'' 打印each.' – Joe

回答

0

如果您可以舉一個例子,那麼它會有所幫助。如果我們把網址範例像

def urlSearch(): 
    word = [] 
    end_words = ['gmail', 'finance'] 
    Key_word = ['google'] 
    urlList= ['google.com//d/gmail', 'google.com/finance', 'google.com/sports', 'google.com/search'] 
    for i in urlList: 
     main_part = i.split('/',i.count('/')) 
     if main_part[len(main_part) - 1] in end_words: 
      word = [] 
      for k in main_part[:-1]: 
       for j in k.split('.'): 
        word.append(j) 
      print (word) 
     for p in Key_word: 
      if p in word: 
       print ("Url is: " + i) 

urlSearch() 
1

這裏有一對夫婦,如果我是你的情況,我會考慮的選項。

您可以使用內置的anyall功能列表解析來過濾掉從列表中選擇不需要的網址:

urls = ['http://somewebsite.tld/word', 
     'http://somewebsite.tld/word1', 
     'http://somewebsite.tld/word1/stop3', 
     'http://somewebsite.tld/word2', 
     'http://somewebsite.tld/word2/stop2', 
     'http://somewebsite.tld/word3', 
     'http://somewebsite.tld/stop3/word1', 
     'http://somewebsite.tld/stop4/word1'] 

includes = ['word1', 'word2'] 
excludes = ['stop1', 'stop2', 'stop3'] 

filtered_url_list = [url for url in urls if any(include in url for include in includes) if all(exclude not in url for exclude in excludes)] 

或者你可以做一個函數,它接受一個URL作爲參數,並返回那些True你想保留的URL和False你不這樣做,那麼傳遞函數的URL的未篩選列表一起內置的filter功能:

def urlfilter(url): 
    includes = ['word1', 'word2'] 
    excludes = ['stop1', 'stop2', 'stop3'] 
    for include in includes: 
     if include in url: 
      for exclude in excludes: 
       if exclude in url: 
        return False 
      else: 
       return True 

urls = ['http://somewebsite.tld/word', 
     'http://somewebsite.tld/word1', 
     'http://somewebsite.tld/word1/stop3', 
     'http://somewebsite.tld/word2', 
     'http://somewebsite.tld/word2/stop2', 
     'http://somewebsite.tld/word3', 
     'http://somewebsite.tld/stop3/word1', 
     'http://somewebsite.tld/stop4/word1'] 

filtered_url_list = filter(urlfilter, urls) 
-1

我會用集和列表理解:

must_in = set([word1, word2]) 
musnt_in = set([stop1, stop2, stop3]) 
links = [x for x in url if must_in & set(x) and not (musnt_in & set(x))] 
print links 

上述代碼可以用任何數目的字被用於停止,不限於兩個單詞(WORD1,單詞2)和三個站(停止1,停止2,停止3) 。

+0

爲什麼不只是在must_in中使用''x而在x中不使用'musnt_in'''? – perigon

+0

@perigon x是一個列表,你應該使用列表檢查元素的元素,設置更有效和更清晰(請參閱Python的禪宗)。 – Zoltan