2013-10-17 80 views
0

試圖編寫一個從網站中提取URL的程序。輸出是好的,但是當我嘗試將輸出寫入文件時,只寫入最後一條記錄。下面是代碼:Python:只寫最後一行輸出

import re 
import urllib.request 

# Retrieves URLs from the HTML source code of a website 
def extractUrls(url, unique=True, sort=True, restrictToTld=None): 
    # Prepend "www." if not present 
    if url[0:4] != "www.": 
     url = "".join(["www.",url]) 
    # Open a connection 
    with urllib.request.urlopen("http://" + url) as h: 
     # Grab the headers 
     headers = h.info() 
     # Default charset 
     charset = "ISO-8859-1" 
     # If a charset is in the headers then override the default 
     for i in headers: 
      match = re.search(r"charset=([\w\-]+)", headers[i], re.I) 
      if match != None: 
       charset = match.group(1).lower() 
       break 
     # Grab and decode the source code 
     source = h.read().decode(charset) 
     # Find all URLs in the source code 
     matches = re.findall(r"http\:\/\/(www.)?([a-z0-9\-\.]+\.[a-z]{2,6})\b", source, re.I) 
     # Abort if no URLs were found 
     if matches == None: 
      return None 
     # Collect URLs 
     collection = [] 
     # Go over URLs one by one 
     for url in matches: 
      url = url[1].lower() 
      # If there are more than one dot then the URL contains 
      # subdomain(s), which we remove 
      if url.count(".") > 1: 
       temp = url.split(".") 
       tld = temp.pop() 
       url = "".join([temp.pop(),".",tld]) 
      # Restrict to TLD if one is set 
      if restrictToTld: 
       tld = url.split(".").pop() 
       if tld != restrictToTld: 
        continue 
      # If only unique URLs should be returned 
      if unique: 
       if url not in collection: 
        collection.append(url) 
      # Otherwise just add the URL to the collection 
      else: 
       collection.append(url) 
     # Done 
     return sorted(collection) if sort else collection 

# Test 
url = "msn.com" 
print("Parent:", url) 
for x in extractUrls(url): 
    print("-", x) 

f = open("f2.txt", "w+", 1) 
f.write(x) 
f.close() 

輸出是:

Parent: msn.com 
- 2o7.net 
- atdmt.com 
- bing.com 
- careerbuilder.com 
- delish.com 
- discoverbing.com 
- discovermsn.com 
- facebook.com 
- foxsports.com 
- foxsportsarizona.com 
- foxsportssouthwest.com 
- icra.org 
- live.com 
- microsoft.com 
- msads.net 
- msn.com 
- msnrewards.com 
- myhomemsn.com 
- nbcnews.com 
- northjersey.com 
- outlook.com 
- revsci.net 
- rsac.org 
- s-msn.com 
- scorecardresearch.com 
- skype.com 
- twitter.com 
- w3.org 
- yardbarker.com 
[Finished in 0.8s] 

只有 「yardbarker.com」 被寫入文件。我很感激幫助,謝謝。

+0

打開文件後,將其關閉,並寫入內部 – alexvassel

+0

我試過了,但它仍然結束與同樣的問題。 –

回答

2
url = "msn.com" 
print("Parent:", url) 
f = open("f2.txt", "w",) 
for x in extractUrls(url): 
    print("-", x) 
    f.write(x) 
f.close() 
0

你需要打開你的文件,然後在for循環中寫入每個X.

最後您可以關閉該文件。

f = open("f2.txt", "w+",1) 

for x in extractUrls(url): 
    print("-", x) 
    f.write(x) 

f.close() 
+0

我已經試過了,但沒有奏效。我仍然有同樣的問題。不過謝謝你的建議。 –

0
f = open("f2.txt", "w+", 1) 

for x in extractUrls(url): 
    print("-", x) 
    f.write(x) 

f.close() 
1

按其他的答案文件寫入必須在循環內部也儘量x後寫一個新的行字符\n

f = open("f2.txt", "w+") 
for x in extractUrls(url): 
    print("-", x) 
    f.write(x +'\n') 
f.close() 

還行return sorted(collection) if sort else collection有兩個縮進,它應該有一個。

而且您的子代碼可能不會給你所期望的東西像www.something.com.au只將前`for`環路返回.com.au