通過抓取其內容類型不是文本/ html的URL獲取URL

我可以獲取其內容/類型爲text/html的所有url，但是如果我想要那些內容/類型不是text/html的url。那我們怎麼檢查一下。而對於字符串，我們可以使用contains方法，但它並沒有像notcontains東西..任何建議，可以理解的。而也通過抓取其內容類型不是文本/ html的URL獲取URL

The key variable contains: 

Content-Type=[text/html; charset=ISO-8859-1]

這是下面的代碼檢查text/html的我也嘗試了不是text/html的內容類型，但它也打印出內容類型也是text/html的內容類型。

try { 
      URL url1 = new URL(url); 
      System.out.println("URL:- " +url1); 
      URLConnection connection = url1.openConnection(); 

      Map responseMap = connection.getHeaderFields(); 
      Iterator iterator = responseMap.entrySet().iterator(); 
      while (iterator.hasNext()) 
      { 
       String key = iterator.next().toString(); 

       if (key.contains("text/html") || key.contains("text/xhtml")) 
       { 
        System.out.println(key); 
        // Content-Type=[text/html; charset=ISO-8859-1] 
        if (filters.matcher(key) != null){ 
         System.out.println(url1); 
         try { 
          final File parentDir = new File("crawl_html"); 
          parentDir.mkdir(); 
          final String hash = MD5Util.md5Hex(url1.toString()); 
          final String fileName = hash + ".txt"; 
          final File file = new File(parentDir, fileName); 
          boolean success =file.createNewFile(); // Creates file crawl_html/abc.txt 


          System.out.println("hash:-" + hash); 

            System.out.println(file); 
          // Create file if it does not exist 



           // File did not exist and was created 
           FileOutputStream fos = new FileOutputStream(file, true); 

           PrintWriter out = new PrintWriter(fos); 

           // Also could be written as follows on one line 
           // Printwriter out = new PrintWriter(new FileWriter(args[0])); 

              // Write text to file 
           Tika t = new Tika(); 
           String content= t.parseToString(new URL(url1.toString())); 


           out.println("==============================================================="); 
           out.println(url1); 
           out.println(key); 
           out.println(success); 
           out.println(content); 

           out.println("==============================================================="); 
           out.close(); 
           fos.flush(); 
           fos.close(); 



         } catch (FileNotFoundException e) { 
          // TODO Auto-generated catch block 
          e.printStackTrace(); 
         } catch (IOException e) { 
          // TODO Auto-generated catch block 

          e.printStackTrace(); 
         } catch (TikaException e) { 
          // TODO Auto-generated catch block 
          e.printStackTrace(); 
         } 


         // http://google.com 
        } 
       } 
    else if (!connection.getContentType().startsWith("text/html"))//print duplicate records of each url 
       //else if (!key.contains("text/html")) 
       { 
        if (filters.matcher(key) != null){ 
        try { 
         final File parentDir = new File("crawl_media"); 
         parentDir.mkdir(); 
         final String hash = MD5Util.md5Hex(url1.toString()); 
         final String fileName = hash + ".txt"; 
         final File file = new File(parentDir, fileName); 
        // Create file if it does not exist 
         boolean success =file.createNewFile(); // Creates file crawl_html/abc.txt 


         System.out.println("hash:-" + hash); 

         Tika t = new Tika(); 
         String content_media= t.parseToString(new URL(url1.toString())); 



          // File did not exist and was created 
          FileOutputStream fos = new FileOutputStream(file, true); 

          PrintWriter out = new PrintWriter(fos); 

          // Also could be written as follows on one line 
          // Printwriter out = new PrintWriter(new FileWriter(args[0])); 

             // Write text to file 
          out.println("==============================================================="); 
          out.println(url1); 
          out.println(key); 
          out.println(success); 
          out.println(content_media); 
          //out.println("==============================================================="); 
          out.close(); 
          fos.flush(); 
          fos.close(); 




        } catch (FileNotFoundException e) { 
         // TODO Auto-generated catch block 
         e.printStackTrace(); 
        } catch (IOException e) { 
         // TODO Auto-generated catch block 

         e.printStackTrace(); 
        } catch (TikaException e) { 
         // TODO Auto-generated catch block 
         e.printStackTrace(); 
        } 
        } 

       } 



      } 
     } catch (MalformedURLException e) { 
      e.printStackTrace(); 
     } catch (IOException e) { 
      e.printStackTrace(); 
     } 



     System.out.println("============="); 
    } 
}

一種方法是逐個檢查每個內容類型像PDF它是應用程序/ PDF

if (key.contains("application/pdf")

和XML的同樣的方式......但任何其他方法比這其他...

來源

2011-07-11 ferhan

這會有幫助嗎？

if (!connection.getContentType.startsWith("text/html"))

來源

2011-07-11 18:25:43 emboss

這不起作用..並且它也需要其內容類型爲text/html的鏈接..任何其他想法..並且我還更新了text/html和non text/html都使用.. – ferhan

喜歡這個？如果getContentType也返回「[」，然後通過使用getContentType.substring（1） – emboss

剝離它，但它正在工作，但它正在打印重複記錄。至於特定的url，響應中有很多標題，因此它正在檢查每個頭文件，如果該頭文件不是以「text/html」開頭，那麼它會打印出這個網址。所以假設如果一個不是text/html的特定url在響應中有8個頭文件，那麼它會打印出那個url 8 times ..希望你明白我在說什麼.. – ferhan

什麼是錯的使用：

if (key.contains("text/html") || key.contains("text/xhtml")) { 
    //Do stuff 
} else if (key.contains("application/pdf") { 
    //Do other stuff 
} else { 
    //All other cases 
}

由於對其他格式的內容類型可以從每個類型而有所改變，你可能需要爲每個內容類型明確的情況下。如果遇到通用內容類型，那麼通用方法（else）應該足夠嗎？ Strategy Pattern可能對您有用。

我很抱歉，如果我誤解了你的問題。您能否提供一個示例打印輸出key的不同值是通過測試運行的嗎？（你的代碼的第10行）

來源

2011-07-11 18:54:45 Grambot

感謝您回覆..問題是，我不知道有多少內容類型，所以在我的情況下，我需要兩件事情，一個是所有那些內容類型爲text/html或text/xhtml和第二個所有這些url的內容類型不是text/html或text/xhtml。因此，一種方法是打印出每個網址並查看內容類型，然後爲該內容類型添加if if循環。但是，將來如果有人添加任何其他內容類型的其他頁面，那麼我可能會錯過該內容類型。希望您現在能夠理解...... – ferhan

'key'包含特定url的響應頭的值。所以每個網址都有內容類型，這就是爲什麼我要檢查text/html。 – ferhan

就您的情況而言，最好是對所有已知內容類型實施解決方案，並在遇到未知內容類型時向用戶提供警告。編寫一個可以在100％的情況下工作的系統是不可能的，因此您的目標是防止系統在未知內容類型事件中發生嚴重錯誤。使用'else'的情況下捕獲未知/未處理的內容類型，並打印警告或給它一個「盡力而爲」的方法（可能使用正則表達式），但爲意外行爲做好準備。 – Grambot

通過抓取其內容類型不是文本/ html的URL獲取URL

回答

相關問題