我可以獲取其內容/類型爲text/html的所有url,但是如果我想要那些內容/類型不是text/html的url。那我們怎麼檢查一下。而對於字符串,我們可以使用contains
方法,但它並沒有像notcontains
東西..任何建議,可以理解的。而也通過抓取其內容類型不是文本/ html的URL獲取URL
The key variable contains:
Content-Type=[text/html; charset=ISO-8859-1]
這是下面的代碼檢查text/html的我也嘗試了不是text/html的內容類型,但它也打印出內容類型也是text/html的內容類型。
try {
URL url1 = new URL(url);
System.out.println("URL:- " +url1);
URLConnection connection = url1.openConnection();
Map responseMap = connection.getHeaderFields();
Iterator iterator = responseMap.entrySet().iterator();
while (iterator.hasNext())
{
String key = iterator.next().toString();
if (key.contains("text/html") || key.contains("text/xhtml"))
{
System.out.println(key);
// Content-Type=[text/html; charset=ISO-8859-1]
if (filters.matcher(key) != null){
System.out.println(url1);
try {
final File parentDir = new File("crawl_html");
parentDir.mkdir();
final String hash = MD5Util.md5Hex(url1.toString());
final String fileName = hash + ".txt";
final File file = new File(parentDir, fileName);
boolean success =file.createNewFile(); // Creates file crawl_html/abc.txt
System.out.println("hash:-" + hash);
System.out.println(file);
// Create file if it does not exist
// File did not exist and was created
FileOutputStream fos = new FileOutputStream(file, true);
PrintWriter out = new PrintWriter(fos);
// Also could be written as follows on one line
// Printwriter out = new PrintWriter(new FileWriter(args[0]));
// Write text to file
Tika t = new Tika();
String content= t.parseToString(new URL(url1.toString()));
out.println("===============================================================");
out.println(url1);
out.println(key);
out.println(success);
out.println(content);
out.println("===============================================================");
out.close();
fos.flush();
fos.close();
} catch (FileNotFoundException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} catch (TikaException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
// http://google.com
}
}
else if (!connection.getContentType().startsWith("text/html"))//print duplicate records of each url
//else if (!key.contains("text/html"))
{
if (filters.matcher(key) != null){
try {
final File parentDir = new File("crawl_media");
parentDir.mkdir();
final String hash = MD5Util.md5Hex(url1.toString());
final String fileName = hash + ".txt";
final File file = new File(parentDir, fileName);
// Create file if it does not exist
boolean success =file.createNewFile(); // Creates file crawl_html/abc.txt
System.out.println("hash:-" + hash);
Tika t = new Tika();
String content_media= t.parseToString(new URL(url1.toString()));
// File did not exist and was created
FileOutputStream fos = new FileOutputStream(file, true);
PrintWriter out = new PrintWriter(fos);
// Also could be written as follows on one line
// Printwriter out = new PrintWriter(new FileWriter(args[0]));
// Write text to file
out.println("===============================================================");
out.println(url1);
out.println(key);
out.println(success);
out.println(content_media);
//out.println("===============================================================");
out.close();
fos.flush();
fos.close();
} catch (FileNotFoundException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} catch (TikaException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
}
}
} catch (MalformedURLException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}
System.out.println("=============");
}
}
一種方法是逐個檢查每個內容類型像PDF它是應用程序/ PDF
if (key.contains("application/pdf")
和XML的同樣的方式......但任何其他方法比這其他...
這不起作用..並且它也需要其內容類型爲text/html的鏈接..任何其他想法..並且我還更新了text/html和non text/html都使用.. – ferhan
喜歡這個?如果getContentType也返回「[」,然後通過使用getContentType.substring(1) – emboss
剝離它,但它正在工作,但它正在打印重複記錄。至於特定的url,響應中有很多標題,因此它正在檢查每個頭文件,如果該頭文件不是以「text/html」開頭,那麼它會打印出這個網址。所以假設如果一個不是text/html的特定url在響應中有8個頭文件,那麼它會打印出那個url 8 times ..希望你明白我在說什麼.. – ferhan