使用具有多個屬性的scala-xml API進行解析

我有我正在嘗試使用的XML Scala XML API。我有XPath查詢來從XML標籤中檢索數據。我想從<market>中檢索<price>標記值，但使用了兩個屬性_id和type。我想寫一個&&的條件，以便我爲每個價格標籤獲得一個唯一值，例如，其中MARKET _ID = 1 && TYPE = "A"。使用具有多個屬性的scala-xml API進行解析

對於低於參考下面的XML：

<publisher> 
    <book _id = "0"> 
     <author _id="0">Dev</author> 
     <publish_date>24 Feb 1995</publish_date> 
     <description>Data Structure - C</description> 
     <market _id="0" type="A"> 
      <price>45.95</price>    
     </market> 
     <market _id="0" type="B"> 
      <price>55.95</price> 
     </market> 
    </book> 
    <book _id="1"> 
     <author _id = "1">Ram</author> 
     <publish_date>02 Jul 1999</publish_date> 
     <description>Data Structure - Java</description> 
     <market _id="1" type="A"> 
      <price>145.95</price>   
     </market> 
     <market _id="1" type="B"> 
      <price>155.95</price>   
     </market> 
    </book> 
</publisher>

下面的代碼工作正常

import scala.xml._ 

object XMLtoCSV extends App { 

    val xmlLoad = XML.loadFile("C:/Users/sharprao/Desktop/FirstTry.xml") 

    val price = (((xmlLoad \ "book" filter { _ \ "@_id" exists (_.text == "0")}) \ "market" filter { _ \ "@_id" exists (_.text == "0")}) \ "price").text //45.95 
    val price1 = (((xmlLoad \ "book" filter { _ \ "@_id" exists (_.text == "1")}) \ "market" filter { _ \ "@_id" exists (_.text == "1")}) \ "price").text //155.95 

    println("price = " + price) 
    println("price1 = " + price1) 
}

輸出是：

price = 45.9555.95 
price1 = 145.95155.95

我上面的代碼是給我兩個值因爲我無法把& &條件。

請指教，而不是過濾什麼SCALA功能我可以使用。
也讓我知道如何獲得所有的屬性名稱。
如果可能，請告訴我從哪裏可以讀取所有這些API。

在此先感謝。

來源

2017-08-17 Pardeep Sharma

你可以寫一個自定義的謂詞來檢查多個屬性：

def checkMarket(marketId: String, marketType: String)(node: Node): Boolean = { 
    node.attribute("_id").exists(_.text == marketId) && 
    node.attribute("type").exists(_.text == marketType) 
}

然後把它作爲一個過濾器：

val price1 = (((xmlLoad \ "book" filter (_ \ "@_id" exists (_.text == "0"))) \ "market" filter checkMarket("0", "A")) \ "price").text 
// 45.95 

val price2 = (((xmlLoad \ "book" filter (_ \ "@_id" exists (_.text == "1"))) \ "market" filter checkMarket("1", "B")) \ "price").text 
// 155.95

來源

2017-08-17 16:20:25 chunjef

我很感謝您的解決方案，但沒有編寫函數我們可以做到 - 有沒有任何SCALA函數可以適應這種情況。 –

還有一件事，我已經與你分享了一個樣本XML。但我的xml非常大。幾乎200個標籤意味着我必須編寫200個函數，因爲屬性對於不同的標籤是不同的，從一個到六個不同的屬性。我想我必須編寫6個函數，並且必須更改參數。 –

@PardeepSharma用一些標籤的樣本問另一個問題。 – ashawley

這將是這樣寫的，如果你有興趣獲得一份CSV數據的文件：

(xmlload \ "book").flatMap { bk => 
    (bk \ "market").flatMap { mkt => 
    (mkt \ "price").map { p => 
     Seq(
     bk \@ "_id", 
     mkt \@ "_id", 
     mkt \@ "type", 
     p.text.toFloat 
    ) 
    } 
    } 
}.map { cols => 
    cols.mkString("\t") 
}.foreach { 
    println 
}

它會輸出以下內容：

0  0  A  45.95 
0  0  B  55.95 
1  1  A  145.95 
1  1  B  155.95

而一個常用的模式寫入斯卡拉時，認識到：這就是最flatMapflatMap ... map可以改寫爲for -comprehensions：

for { 
    book <- xmlload \ "book" 
    market <- book \ "market" 
    price <- market \ "price" 
} yield { 
    val cols = Seq(
    book \@ "_id", 
    market \@ "_id", 
    market \@ "type", 
    price.text.toFloat 
) 
    println(cols.mkString("\t")) 
}

來源

2017-08-17 19:40:08 ashawley

-1

我使用的Spark與hiveContext我能解析xPath。

object xPathReader extends App{ 

    System.setProperty("hadoop.home.dir","D:\\IBM\\DB\\Hadoop\\winutils") // Path for my winutils.exe 

    val sparkConf = new SparkConf().setAppName("XMLParcing").setMaster("local[2]") 
    val sc = new SparkContext(sparkConf) 
    val hiveContext = new HiveContext(sc) 
    val myXmlPath = "D:\\IBM\\DB\\xml" 
    val xmlRDDList = XmlFileUtil.withCharset(sc, myXmlPath, "UTF-8", "publisher") //XmlFileUtil - this is a private class in scala hence I created a Java class to use it. 

    import hiveContext.implicits._ 

    val xmlDf = xmlRDDList.toDF("tempXMLTable") 
    xmlDf.registerTempTable("tempTable") 

    hiveContext.sql("select xpath_string(tempXMLTable,\"/book/@_id\") as BookId, xpath_float(tempXMLTable,\"/book/market[@_id='1' and @type='B']/price\") as Price from tempTable").show()  

    /* Output 
     +------+------+ 
     |BookId| Price| 
     +------+------+ 
     |  0| 55.95| 
     |  1|155.95| 
     +------+------+ 
    */ 
}

來源

2017-08-18 14:37:46

這與原始問題無關，這個問題是關於使用scala-xml解析XML，而不是Spark中的XPath。 – ashawley

我提供了一個替代方案，我沒有說這是我解決方案的答案。 –

因爲XmlFile.withCharset是私有對象，所以我無法使用它，因此我實現了xmlFileUtil。公共類XmlFileUtil { public static RDD withCharset（SparkContext上下文，字符串位置，字符串字符集，字符串rowTag）返回XmlFile.withCharset（context，location，charset，rowTag）; } } –

使用具有多個屬性的scala-xml API進行解析

回答

相關問題