我想從具有url列表的文件中刪除重複的url。我bugun_url_given.txt有「http://www.bugun.com.tr/ara/Ak%20Parti/1」,並獲取所有的URL,他們正在重複.. 它保存唯一的URL的所有「bugun_url_collection.tx」 這裏是我的代碼:刪除重複的url的python
from cookielib import CookieJar
import urllib2
import json
from bs4 import BeautifulSoup
cj = CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
try:
text_file = open('bugun_url_given.txt', 'r')
for line in text_file:
print line
soup = BeautifulSoup(opener.open(line))
links = soup.select('div.nwslist a')
for link in links:
print link
#unique_url = set(map(lambda url : url.strip("/ "), links))
with open('bugun_url_collection.txt', 'a') as f:
for link in links:
f.write(link.get('href') + '\n')
except ValueError:
pass
那你試試這麼遠嗎? – 2014-10-10 22:44:18