java.net.MalformedURLException：無協議：/ intl/en/policies/GET請求

我一直在努力製作一個簡單的程序，該程序運行通過頁面中的所有鏈接，然後訪問它們，然後遞歸。但似乎作爲其運行，一旦停止與錯誤java.net.MalformedURLException：無協議：/ intl/en/policies/GET請求

java.net.MalformedURLException: no protocol: /intl/en/policies/ 
at java.net.URL.<init>(Unknown Source) 
at java.net.URL.<init>(Unknown Source) 
at java.net.URL.<init>(Unknown Source) 
at me.dylan.WebCrawler.WebC.sendGetRequest(WebC.java:67) 
at me.dylan.WebCrawler.WebC.<init>(WebC.java:27) 
at me.dylan.WebCrawler.WebC.main(WebC.java:36)

我的代碼：

package me.dylan.WebCrawler; 

import java.io.BufferedReader; 
import java.io.IOException; 
import java.io.InputStreamReader; 
import java.net.HttpURLConnection; 
import java.net.MalformedURLException; 
import java.net.URL; 
import java.util.ArrayList; 

import javax.swing.text.BadLocationException; 
import javax.swing.text.EditorKit; 
import javax.swing.text.MutableAttributeSet; 
import javax.swing.text.html.HTML; 
import javax.swing.text.html.HTMLDocument; 
import javax.swing.text.html.HTMLEditorKit; 

public class WebC { 
// FileUtil f; 
    int linkamount=0; 
    ArrayList<URL> visited = new ArrayList<URL>(); 
    ArrayList<String> urls = new ArrayList<String>(); 
    public WebC() { 

     try { 
//   f= new FileUtil(); 
      sendGetRequest("http://www.google.com"); 
     } catch (IOException e) { 
      e.printStackTrace(); 
     } 
     catch (BadLocationException e) { 
      e.printStackTrace(); 
     } 
    } 
    public static void main(String[] args) { 
     new WebC(); 
    } 
    public void sendGetRequest(String path) throws IOException, BadLocationException, MalformedURLException { 

     URL url = new URL(path); 
     HttpURLConnection con = (HttpURLConnection) url.openConnection(); 
     con.setRequestMethod("GET"); 
     con.setRequestProperty("Content-Language", "en-US"); 
     BufferedReader rd = new BufferedReader(new InputStreamReader(con.getInputStream())); 
     EditorKit kit = new HTMLEditorKit(); 
     HTMLDocument doc = (HTMLDocument)kit.createDefaultDocument(); 
     doc.putProperty("IgnoreCharsetDirective", new Boolean(true)); 
     kit.read(rd, doc, 0); 

     //Get all <a> tags (hyperlinks) 
     HTMLDocument.Iterator it = doc.getIterator(HTML.Tag.A); 
     while (it.isValid()) 
     { 
      MutableAttributeSet mas = (MutableAttributeSet)it.getAttributes(); 
      //get the HREF attribute value in the <a> tag 
      String link = (String)mas.getAttribute(HTML.Attribute.HREF); 
      if(link!=null && link!="") { 
       urls.add(link); 
      } 

      it.next(); 
     } 
     for(int i=urls.size()-1;i>=0;i--) { 
      if(urls.get(i)!=null) { 
       if(/*f.searchforString(urls.get(i)) ||*/ visited.contains(new URL(urls.get(i)))) { 
        urls.remove(i); 
        continue; 
       } else { 
        System.out.println(linkamount++); 
        System.out.println(path); 
        visited.add(new URL(path)); 
        //f.write(urls.get(i)); 
        sendGetRequest(urls.get(i)); 
       } 
       try { 
        Thread.sleep(100); 
       } catch (InterruptedException e) { 
        e.printStackTrace(); 
       } 
      } 
     }   
    } 
}

老實說，我不知道如何解決這個問題。顯然谷歌有一個href標籤是不是一個有效的網址，我將如何解決這個問題？

來源

2013-03-29 Dylan Katz

您必須在URL部分追加baseURl。 URL對象的格式爲http://abc.com/int/etc/etc。

雖然表格將採用相對格式的格式，但在調用獲取的每個HREF前，只需附加http://www.google.com即可。

來源

2013-03-29 17:25:00

感謝，得到它現在的工作！ –

另一個問題彈出，與此代碼： http://pastie.org/7164939 我得到： http://pastie.org/7164948 –

問題是「..」在www的結尾處.google.com。您希望在打開連接前檢查這種邊界情況。在這種特殊情況下，您想檢查HREF是否包含「。」在追加到鏈接之前。 –

快速解決方法是在呼叫前追加urls.get(i)至requestPath。這將給它一個協議和一個域使用。唯一的缺點是，如果你不掃描當前URL中環路的協議和域，你最終可能會像這樣：

http://www.google.com/http://www.yahoo.com/policies

來源

2013-03-29 17:26:41

java.net.MalformedURLException：無協議：/ intl/en/policies/GET請求

回答

相關問題