星火XML解析

我試圖解析使用com.databricks.spark.xml星火XML解析

Dataset<Row> df = spark.read().format("com.databricks.spark.xml") 
      .option("rowTag", "row").load("../1000.xml"); 

df.show(10);

大型XML文件我得到的輸出如下

++ ||
++
++

我這麼想嗎？

這是我的示例XML行。

<row Id="7" PostTypeId="2" ParentId="4" CreationDate="2008-07-31T22:17:57.883" Score="316" Body="&lt;p&gt;An explicit cast to double isn't necessary.&lt;/p&gt;&#xA;&#xA;&lt;pre&gt;&lt;code&gt;double trans = (double)trackBar1.Value/5000.0;&#xA;&lt;/code&gt;&lt;/pre&gt;&#xA;&#xA;&lt;p&gt;Identifying the constant as &lt;code&gt;5000.0&lt;/code&gt; (or as &lt;code&gt;5000d&lt;/code&gt;) is sufficient:&lt;/p&gt;&#xA;&#xA;&lt;pre&gt;&lt;code&gt;double trans = trackBar1.Value/5000.0;&#xA;double trans = trackBar1.Value/5000d;&#xA;&lt;/code&gt;&lt;/pre&gt;&#xA;" />

非常感謝。

來源

2017-02-23 udarajag

這意味着你的XML中的數據沒有映射到柱狀結構，你的數據集是空的。 – FaigB

嘗試使用_符號位於模式中的XML屬性名稱之前。如果它不工作 - 嘗試使用@符號。觀看example，但它提供了舊的Spark版本。

來源

2017-03-28 14:23:06

問題與您的xml數據。與您的代碼示例

<row id="7"> 
     <author>Corets, Eva</author> 
     <title>Maeve Ascendant</title> 
     <genre>Fantasy</genre> 
     <price>5.95</price> 
     <publish_date>2000-11-17</publish_date> 
     <description>After the collapse of a nanotechnology 
     society in England, the young survivors lay the 
     foundation for a new society.</description> 
    </row>

：嘗試它作爲示例XML數據

Dataset<Row> df = spark.read().format("com.databricks.spark.xml") 
      .option("rowTag", "row").load("../1000.xml");

要提供自定義模式：

import org.apache.spark.sql.SQLContext 
import org.apache.spark.sql.types.{StructType, StructField, StringType, DoubleType}; 

val sqlContext = new SQLContext(sc) 
val customSchema = StructType(Array(
    StructField("_id", StringType, nullable = true), 
    StructField("author", StringType, nullable = true), 
    StructField("description", StringType, nullable = true), 
    StructField("genre", StringType ,nullable = true), 
    StructField("price", DoubleType, nullable = true), 
    StructField("publish_date", StringType, nullable = true), 
    StructField("title", StringType, nullable = true))) 


val df = sqlContext.read 
    .format("com.databricks.spark.xml") 
    .option("rowTag", "book") 
    .schema(customSchema) 
    .load("books.xml") 

val selectedData = df.select("author", "_id") 
selectedData.write 
    .format("com.databricks.spark.xml") 
    .option("rootTag", "books") 
    .option("rowTag", "book") 
    .save("newbooks.xml")

請參考databricks xml documentation

來源

2017-02-23 14:44:27 FaigB

沒有辦法在spark.xml中傳遞XML屬性。如果我創建一個自定義模式如何創建一個傳遞XML屬性？ – udarajag

嗨FaigB感謝您的答案，我也讀通過databricks文檔，但我無法找出一種方法來傳遞XML標籤的屬性。因爲我不能像你提到的那樣改變XML格式。 – udarajag

回答

相關問題