更改難以字符串與未知的子串

-2

更新：我使用Jsoup來解析文本
解析一個網站時，我遇到了問題：當我得到HTML文本時，一些鏈接隨機空間損壞。例如：更改難以字符串與未知的子串

What a pretty flower! <a href="www.goo gle.com/...">here</a> and <a href="w ww.google.com...">here</a>

正如你可能會注意到，在空間中的位置完全是隨機的，但有一點是肯定的：它是一個href標籤內。當然，我可以使用replace(" ", "")方法，但可能有兩個或多個鏈接。我該如何解決這個問題？

來源

2014-02-21 Groosha

在所有href值上使用replace（「」，「」）'有什麼問題？另外，爲什麼試圖修復返回垃圾網站的數據？ –

也有正則表達式，你可以用它來識別你的鏈接，如果你只想使用'replace'就可以了。或[JSoup]（http://jsoup.org/）（請參閱[此問題]（http://stackoverflow.com/questions/9071568/parse-web-site-html-with-java）） – eebbesen

是的，我使用Jsoup解析，但改變substring不會改變初始字符串，對吧？ – Groosha

這是一個古老的解決方案，但我會嘗試使用舊的退役apache ECS來解析您的html，然後，只有對於href鏈接，您可以刪除空格，然後重新創建所有內容:-)如果我沒記錯的話，有一種方法可以從html解析ECS「DOM」。

http://svn.apache.org/repos/asf/jakarta/ecs/branches/ecs/src/java/org/apache/ecs/html2ecs/Html2Ecs.java

另一種選擇是使用類似XPath的選擇讓您的HREF，但你必須處理畸形的HTML（你可以給整潔的機會 - http://infohound.net/tidy/）

來源

2014-02-21 19:04:58 Leo

我會試試看，thnx – Groosha

你可以使用正則表達式找到並「提煉」網址：

public class URLRegex { 

    /** 
    * @param args the command line arguments 
    */ 
    public static void main(String[] args) { 

     final String INPUT = "Hello World <a href=\"http://ww w.google.com\">Google</a> Second " + 
          "Hello World <a href=\"http://www.wiki pedia.org\">Wikipedia</a> Test" + 
          "<a href=\"https://www.example.o rg\">Example</a> Test Test"; 
     System.out.println(INPUT); 

     // This pattern matches a sequence of one or more spaces. 
     // Precompile it here, so we don't have to do it in every iteration of the loop below. 
     Pattern SPACES_PATTERN = Pattern.compile("\\u0020+"); 

     // The regular expression below is very primitive and does not really check whether the URL is valid. 
     // Moreover, only very simple URLs are matched. If an URL includes different protocols, account credentials, ... it is not matched. 
     // For more sophisticated regular expressions have a look at: http://stackoverflow.com/questions/161738/ 
     Pattern PATTERN_A_HREF = Pattern.compile("https?://[A-Za-z0-9\\.\\-\\u0020\\?&\\=#/]+"); 
     Matcher m = PATTERN_A_HREF.matcher(INPUT); 

     // Iterate through all matching strings: 
     while (m.find()) { 
      String urlThatMightContainSpaces = m.group(); // Get the current match 
      Matcher spaceMatcher = SPACES_PATTERN.matcher(urlThatMightContainSpaces); 
      System.out.println(spaceMatcher.replaceAll("")); // Replaces all spaces by nothing. 
     } 

    } 
}

來源

2014-02-21 19:32:19 MrSnrub

嗯..看起來很有前途 – Groosha

更改難以字符串與未知的子串

回答

相關問題