我編寫了代碼來抓取並保存網頁中的圖像。由於某種原因,我得到 一個錯誤,我不知道如何解決。java.lang.IllegalArgumentException當在Java中使用Jsoup時
我正在使用一種方法來確保每個圖像,我索引實際上存在,所以我不知道爲什麼會發生這種情況。
這裏是我的代碼:
import org.jsoup.Jsoup;
import org.jsoup.helper.Validate;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import java.net.*;
import java.awt.Image;
import java.awt.image.RenderedImage;
import java.io.*;
import java.io.IOException;
import javax.imageio.ImageIO;
import javax.imageio.ImageReader;
import javax.imageio.stream.ImageInputStream;
public class jsoup {
public static void main(String[] args) throws IOException {
crawl("http://www.istockphoto.com/photo");
}
public static void crawl(String crawlurl) throws IOException{
Document doc = Jsoup.connect(crawlurl).get();
getImgFromLinks(doc);
}
public static void getImgFromLinks(Document doc) throws IOException{
Elements links = doc.select("a[href]");
//System.out.println(links);
for(int i=0;i<links.size();i++){
if(exists(links.get(i).attr("href"))){
System.out.println("crawled: " + links.get(i).attr("href"));
getImages(doc, links.get(i).attr("href"));
}else{
System.out.println("I couldnt crawl: " + links.get(i).attr("href"));
}
}
}
public static String smartUrl(String url, String src) {
if(exists(src)){
return(src);
}else{
return(url + src);
}
}
public static void getImages(Document doc, String url) throws IOException{
for(int i=0; i<doc.getElementsByTag("img").size();i++){
Element image = doc.select("img").get(i);
String imgsrc = image.attr("src");
if(imgsrc.toLowerCase().contains("png") || imgsrc.toLowerCase().contains("jpg") || imgsrc.toLowerCase().contains("jpeg") || imgsrc.toLowerCase().contains("gif")){
int slashIndex = smartUrl(url, imgsrc).lastIndexOf('/');
String finalUrl = smartUrl(url, imgsrc).substring(slashIndex);
URL imgurl = new URL(smartUrl(url, imgsrc));
if(exists(imgurl.toString())){
Image crawledimg = ImageIO.read(imgurl);
ImageIO.write((RenderedImage) crawledimg, "gif",new File("/Users/Jonathan/Desktop/crawledimages" + finalUrl));
System.out.println("I got an image from:" + url + " Image Name: " + finalUrl);
}
}
}
}
public static boolean exists(String URLName) {
try {
HttpURLConnection.setFollowRedirects(false);
//HttpURLConnection.setInstanceFollowRedirects(false);
HttpURLConnection con =
(HttpURLConnection) new URL(URLName).openConnection();
con.setRequestMethod("HEAD");
return (con.getResponseCode() == HttpURLConnection.HTTP_OK);
}
catch (Exception e) {
return false;
}
}
}
這裏是輸出:
crawled: http://www.istockphoto.com/
I got an image from:http://www.istockphoto.com/ Image Name: /blank.gif
I got an image from:http://www.istockphoto.com/ Image Name: /blank.gif
I got an image from:http://www.istockphoto.com/ Image Name: /blank.gif
I got an image from:http://www.istockphoto.com/ Image Name: /blank.gif
I got an image from:http://www.istockphoto.com/ Image Name: /blank.gif
I got an image from:http://www.istockphoto.com/ Image Name: /blank.gif
I got an image from:http://www.istockphoto.com/ Image Name: /blank.gif
I got an image from:http://www.istockphoto.com/ Image Name: /blank.gif
I got an image from:http://www.istockphoto.com/ Image Name: /facebook.png
I got an image from:http://www.istockphoto.com/ Image Name: /twitter.png
I got an image from:http://www.istockphoto.com/ Image Name: /blank.gif
I got an image from:http://www.istockphoto.com/ Image Name: /cartWhite.png
I couldnt crawl: #
I couldnt crawl: http://www.istockphoto.com/sign-in/aHR0cCUzQSUyRiUyRnd3dy5pc3RvY2twaG90by5jb20lMkZwaG90bw==
I couldnt crawl: http://www.istockphoto.com/join/aHR0cCUzQSUyRiUyRnd3dy5pc3RvY2twaG90by5jb20lMkZwaG90bw==
crawled: http://www.istockphoto.com/photo
I got an image from:http://www.istockphoto.com/photo Image Name: /blank.gif
I got an image from:http://www.istockphoto.com/photo Image Name: /blank.gif
I got an image from:http://www.istockphoto.com/photo Image Name: /blank.gif
I got an image from:http://www.istockphoto.com/photo Image Name: /blank.gif
I got an image from:http://www.istockphoto.com/photo Image Name: /blank.gif
I got an image from:http://www.istockphoto.com/photo Image Name: /blank.gif
I got an image from:http://www.istockphoto.com/photo Image Name: /blank.gif
I got an image from:http://www.istockphoto.com/photo Image Name: /blank.gif
I got an image from:http://www.istockphoto.com/photo Image Name: /facebook.png
I got an image from:http://www.istockphoto.com/photo Image Name: /twitter.png
I got an image from:http://www.istockphoto.com/photo Image Name: /blank.gif
Exception in thread "main" java.lang.IllegalArgumentException: im == null!
at javax.imageio.ImageIO.write(ImageIO.java:1457)
at javax.imageio.ImageIO.write(ImageIO.java:1527)
at jsoup.getImages(jsoup.java:68)
at jsoup.getImgFromLinks(jsoup.java:34)
at jsoup.crawl(jsoup.java:24)
at jsoup.main(jsoup.java:19)
的圖像被保存,直到發生錯誤。
有誰知道如何解決這個問題?
此外,出於某種原因,頁面上的相同圖像正在多次保存。
謝謝你的時間,
喬納森奧倫。
您是否嘗試在調試器中運行代碼以確定如何獲取空值? – jtahlborn 2013-02-22 19:51:52