刪除重複的url的python

-1

我想從具有url列表的文件中刪除重複的url。我bugun_url_given.txt有「http://www.bugun.com.tr/ara/Ak%20Parti/1」，並獲取所有的URL，他們正在重複.. 它保存唯一的URL的所有「bugun_url_collection.tx」這裏是我的代碼：刪除重複的url的python

from cookielib import CookieJar 
import urllib2 
import json 
from bs4 import BeautifulSoup 
cj = CookieJar() 
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj)) 
try: 
    text_file = open('bugun_url_given.txt', 'r') 
    for line in text_file: 
     print line 
     soup = BeautifulSoup(opener.open(line)) 
     links = soup.select('div.nwslist a') 
     for link in links: 
      print link 
      #unique_url = set(map(lambda url : url.strip("/ "), links)) 
      with open('bugun_url_collection.txt', 'a') as f: 
       for link in links: 
        f.write(link.get('href') + '\n') 
except ValueError: 
    pass

來源

2014-10-10 manjula reddy

那你試試這麼遠嗎？ – 2014-10-10 22:44:18

  for link in links: 
       f.write(link.get('href') + '\n')

能成爲

  for link in set(link.get('href') for link in links): 
       f.write(link + '\n')

在迴應評論（這是正確的），讓我們正確地改寫這個：

from cookielib import CookieJar 
import urllib2 
import json 
from bs4 import BeautifulSoup 
cj = CookieJar() 
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj)) 


def write_links_to_file(links): 
    with open('bugun_url_collection.txt', 'a') as f: 
     f.writeline(link) 


def get_links_from_file(text_file): 
    for line in text_file: 
     print line 
     soup = BeautifulSoup(opener.open(line)) 
     links = soup.select('div.nwslist a') 
     for link in links: 
      yield link.get('href') 


with open('bugun_url_given.txt', 'r') as text_file: 
    links = get_links_from_file(text_file) 

unique_links = set(link for link in links) 
write_links_to_file(unique_links)

來源

2014-10-10 22:44:42 mjallday

給出問題中的代碼;代碼中的set（）不會刪除所有重複項。 – jfs 2014-10-10 22:47:06

你可以做

hrefs = [] 
for link in links: 
    print link 
    hrefs.append(link.get('href')) 
hrefs = list(set(hrefs)) 
with open('bugun_url_collection.txt', 'a') as f: 
    f.write('\n'.join(hrefs))

來源

2014-10-10 22:50:45 Sourabh

你應該分開，產生從保存它們的代碼鏈接代碼：

def generate_urls(filename, urlopen): 
    with open(filename) as file: 
     for line in file: 
      soup = BeautifulSoup(urlopen(line.strip())) 
      for link in soup.select('div.nwslist a[href^="http"]'): 
       yield link['href'] 

links = set(generate_urls('bugun_url_given.txt', opener.open)) 
with open('bugun_url_collection.txt', 'w') as file: 
    file.write("\n".join(links))

來源

2014-10-10 22:53:37 jfs

你嵌套你for循環，所以你遍歷鏈接len(links)倍。

links = soup.select('div.nwslist a') 
    for link in links: 
     ... 
     with open('bugun_url_collection.txt', 'a') as f: 
      for link in links: 
       f.write(link.get('href') + '\n')

你真正想要的是：

with open('bugun_url_given.txt', 'r') as text_file, text_file = open('bugun_url_given.txt', 'r'): 
    for line in text_file: 
     print line 
     soup = BeautifulSoup(opener.open(line)) 
     links = set(link for link in soup.select('div.nwslist a')) 
     for link in links: 
      print link 
      #unique_url = set(map(lambda url : url.strip("/ "), links)) 
      f.write(link.get('href') + '\n')

來源

2014-10-10 22:53:56 pcurry

@ pcurry：當我使用你的代碼時，它只給出一個URL，實際上有14個URL全部在一起..剩下的怎麼樣？ – 2014-10-10 23:17:00

刪除重複的url的python

回答

相關問題