從字符串中提取網址的類型爲

我試圖從字符串中提取網址，它們不是標準化的，所以有些是在href標記中，其他是在它們自己的標記中。從字符串中提取網址的類型爲

我也需要他們加以分類，因此，例如以下字符串：

var txt1: String = "Some text! <a href="http://www.google.com/test.mp3">MP3</a>" 
var txt2: String = "Some text! <a href="http://www.google.com/test.jpg">IMG</a>" 
var txt3: String = "Some more! <a href="http://www.google.com/">Link!</a>"

因此，這些字符串是所有的連接，幷包含3個網址，我正在尋找的線沿線的東西：

var result: List = List(

    "mp3" -> List("http://www.google.com/test.mp3"), 
    "img" -> List("http://www.google.com/test.jpg"), 
    "url" -> List("http://www.google.com/") 
)

我已經研究過正則表達式，但只有這麼遠，提取的HREF去沒有定義類型，這也沒有對自己的標籤之外取得的URL

val hrefRegex = new Regex("""\<a.*?href=\"(http:.*?)\".*?\>.*?\</a>"""); 
val hrefs:List[String]= hrefRegex.findAllIn(txt1.mkString).toList;

任何幫助非常感謝，謝謝提前:)

來源

2011-10-14 jhdevuk

您應該使用一個HTML解析器像jsoup。 –

謝謝金，你知道任何物品讓我開始？依賴，進口等？ – jhdevuk

假設val txt = txt1 + txt2 + txt3，你可以用文本轉換成XML元素爲一個字符串，然後分析它爲XML和使用XML標準庫中提取的錨。

// can do other cleanup if necessary here such as changing "link!" 
def normalize(t: String) = t.toLowerCase() 

val txtAsXML = xml.XML.loadString("<root>" + txt + "</root>") 
val anchors = txtAsXML \\ "a" 
// returns scala.xml.NodeSeq containing the <a> tags

然後你只需要後期處理，直到你有一個像你組織的數據要：

val tuples = anchors.map(a => normalize(a.text) -> a.attributes("href").toString) 
// Seq[String, String] containing elements 
// like "mp3" -> http://www.google.com/test.mp3 

val byTypes = tuples.groupBy(_._1).mapValues(seq => seq.map(_._2)) 
// here grouped by types: 
// Map(img -> List(http://www.google.com/test.jpg), 
//  link! -> List(http://www.google.com/), 
//  mp3 -> List(http://www.google.com/test.mp3))

來源

2011-10-14 13:39:04 huynhjl

非常感謝，讓我在正確的方向肯定:) – jhdevuk

從字符串中提取網址的類型爲

回答

相關問題