2014-10-10 60 views
-1

我想從具有url列表的文件中刪除重複的url。我bugun_url_given.txt有「http://www.bugun.com.tr/ara/Ak%20Parti/1」,並獲取所有的URL,他們正在重複.. 它保存唯一的URL的所有「bugun_url_collection.tx」 這裏是我的代碼:刪除重複的url的python

from cookielib import CookieJar 
import urllib2 
import json 
from bs4 import BeautifulSoup 
cj = CookieJar() 
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj)) 
try: 
    text_file = open('bugun_url_given.txt', 'r') 
    for line in text_file: 
     print line 
     soup = BeautifulSoup(opener.open(line)) 
     links = soup.select('div.nwslist a') 
     for link in links: 
      print link 
      #unique_url = set(map(lambda url : url.strip("/ "), links)) 
      with open('bugun_url_collection.txt', 'a') as f: 
       for link in links: 
        f.write(link.get('href') + '\n') 
except ValueError: 
    pass    
+0

那你試試這麼遠嗎? – 2014-10-10 22:44:18

回答

2
  for link in links: 
       f.write(link.get('href') + '\n') 

能成爲

  for link in set(link.get('href') for link in links): 
       f.write(link + '\n') 

在迴應評論(這是正確的),讓我們正確地改寫這個:

from cookielib import CookieJar 
import urllib2 
import json 
from bs4 import BeautifulSoup 
cj = CookieJar() 
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj)) 


def write_links_to_file(links): 
    with open('bugun_url_collection.txt', 'a') as f: 
     f.writeline(link) 


def get_links_from_file(text_file): 
    for line in text_file: 
     print line 
     soup = BeautifulSoup(opener.open(line)) 
     links = soup.select('div.nwslist a') 
     for link in links: 
      yield link.get('href') 


with open('bugun_url_given.txt', 'r') as text_file: 
    links = get_links_from_file(text_file) 

unique_links = set(link for link in links) 
write_links_to_file(unique_links) 
+0

給出問題中的代碼;代碼中的set()不會刪除所有重複項。 – jfs 2014-10-10 22:47:06

0

你可以做

hrefs = [] 
for link in links: 
    print link 
    hrefs.append(link.get('href')) 
hrefs = list(set(hrefs)) 
with open('bugun_url_collection.txt', 'a') as f: 
    f.write('\n'.join(hrefs)) 
0

你應該分開,產生從保存它們的代碼鏈接代碼:

def generate_urls(filename, urlopen): 
    with open(filename) as file: 
     for line in file: 
      soup = BeautifulSoup(urlopen(line.strip())) 
      for link in soup.select('div.nwslist a[href^="http"]'): 
       yield link['href'] 

links = set(generate_urls('bugun_url_given.txt', opener.open)) 
with open('bugun_url_collection.txt', 'w') as file: 
    file.write("\n".join(links)) 
0

你嵌套你for循環,所以你遍歷鏈接len(links)倍。

links = soup.select('div.nwslist a') 
    for link in links: 
     ... 
     with open('bugun_url_collection.txt', 'a') as f: 
      for link in links: 
       f.write(link.get('href') + '\n') 

你真正想要的是:

with open('bugun_url_given.txt', 'r') as text_file, text_file = open('bugun_url_given.txt', 'r'): 
    for line in text_file: 
     print line 
     soup = BeautifulSoup(opener.open(line)) 
     links = set(link for link in soup.select('div.nwslist a')) 
     for link in links: 
      print link 
      #unique_url = set(map(lambda url : url.strip("/ "), links)) 
      f.write(link.get('href') + '\n') 
+0

@ pcurry:當我使用你的代碼時,它只給出一個URL,實際上有14個URL全部在一起..剩下的怎麼樣? – 2014-10-10 23:17:00