2014-04-23 37 views
0

我有一個多級別的XML,我找不到任何示例如何加載它。使用PIG加載XML拉丁文

XML文件:

<?xml version="1.0" encoding="UTF-8" ?> 
     <Feed xmlns="http://www.xx.com/PRR/ProductFeed/1.0" 
       name="xx" 
       incremental="false" 
       extractDate="2014-04-22T11:00:00.000000"><Categories><Category> <ExternalId>2_5</ExternalId><ParentExternalId></ParentExternalId><Name><![CDATA[Baby]]></Name><CategoryPageUrl>http://www.xx.com/en-US/Clearance/Baby-0-3yrs-Clothing.html</CategoryPageUrl></Category><Category><ExternalId>2_3</ExternalId><ParentExternalId></ParentExternalId><Name><![CDATA[Boys 1½-12yrs]]></Name><CategoryPageUrl>http://www.xx.com/en-US/Clearance/Boys-1H-12yrs-Clothing.html</CategoryPageUrl></Category></Categories> 
       <Products><Product><ExternalId>78094</ExternalId><Name><![CDATA[Sleep Bag]]></Name><Description><![CDATA[A cover they can't throw off in the night. Pure cotton with one of our uniquely lovely prints. In its own gift box. An ultra thoughtful, luxurious present.]]></Description><Brand>xx</Brand><CategoryExternalId>1_5_1</CategoryExternalId><ProductPageUrl>http://www.xx.com/en-US/Baby-0-3yrs-Accessories/78094/Baby-0-3yrs-Sleep-Bag.html</ProductPageUrl><ImageUrl>http://www.xx.com/productimages/productThumb160x207/14USPR_78094_MUL.jpg</ImageUrl><SwatchImageUrl>http://www.xx.com/productimages/grsw/14USPR_78094_MUL_s.jpg</SwatchImageUrl><Price>54.0000</Price><Wasprice>54.0000</Wasprice><ManufacturerPartNumber></ManufacturerPartNumber><EAN></EAN><Colours><Variation><Tier2>MUL</Tier2><Tier2Descr><![CDATA[Multi Elephant Party]]></Tier2Descr><Tier2Url>http://www.xx.com/en-US/Baby-0-3yrs-Accessories/78094-MUL/Baby-0-3yrs-Multi-Elephant-Party-Sleep-Bag.html</Tier2Url><Tier2ImageUrl>http://www.xx.com/productimages/productThumb160x207/14USPR_78094_MUL.jpg</Tier2ImageUrl><Tier3>03 06</Tier3><Tier3Descr><![CDATA[3-6m]]></Tier3Descr><StockStatus>-2</StockStatus><SwatchUrl>http://www.xx.com/productimages/grsw/14USPR_78094_MUL_s.jpg</SwatchUrl></Variation><Variation><Tier2>MUL</Tier2><Tier2Descr><![CDATA[Multi Elephant Party]]></Tier2Descr><Tier2Url>http://www.xx.com/en-US/Baby-0-3yrs-Accessories/78094-MUL/Baby-0-3yrs-Multi-Elephant-Party-Sleep-Bag.html</Tier2Url><Tier2ImageUrl>http://www.xx.com/productimages/productThumb160x207/14USPR_78094_MUL.jpg</Tier2ImageUrl><Tier3>06 18</Tier3><Tier3Descr><![CDATA[6-18m]]></Tier3Descr> <StockStatus>-2</StockStatus> <SwatchUrl>http://www.xx.com/productimages/grsw/14USPR_78094_MUL_s.jpg</SwatchUrl>  </Variation></Colours></Product> 
       </Products> 
     </Feed> 

我已經試過這樣的,但它給回空行,我也需要產品,以及不僅種類

REGISTER 'lib/pig/piggybank.jar' 

-- load raw 

raw = load '$Input' using org.apache.pig.piggybank.storage.XMLLoader('Category') 
    as (x:chararray); 

raw_flatten = foreach raw GENERATE FLATTEN(REGEX_EXTRACT_ALL(x, 
    '<Category>\\n\\s*<ExternalId>(.*)</ExternalId>\\n\\s*<ParentExternalId>(.*)</ParentExternalId>\\n\\s*<Name>(.*)</Name>\\n\\s*<CategoryPageUrl>(.*)</CategoryPageUrl>\\n\\s*</Category>')) 
    as (external_id:chararray, parent_external_id:chararray, name:chararray, categorypageurl:chararray); 

我如何可以加載以上XML?

在此先感謝

更新:如果我把斷行的每個字段後,然後我就可以讀取數據...我怎麼能解決這個問題?其他工具不需要換行符,我不能更改源數據。

格式的XML:

<?xml version="1.0" encoding="UTF-8" ?> 
<Feed xmlns="http://www.xx.com/PRR/ProductFeed/1.0" 
       name="xx" 
       incremental="false" 
       extractDate="2014-04-22T11:00:00.000000"> 
<Categories> 
    <Category> 
    <ExternalId>2_5</ExternalId> 
    <ParentExternalId></ParentExternalId> 
    <Name>Baby</Name> 
    <CategoryPageUrl>http://www.xx.com/en-US/Clearance/Baby-0-3yrs-Clothing.html</CategoryPageUrl> 
    </Category> 
    <Category> 
    <ExternalId>2_3</ExternalId> 
    <ParentExternalId></ParentExternalId> 
    <Name>Boys 1½-12yrs</Name> 
    <CategoryPageUrl>http://www.xx.com/en-US/Clearance/Boys-1H-12yrs-Clothing.html</CategoryPageUrl> 
    </Category> 
</Categories> 
<Products> 
    <Product> 
    <ExternalId>78094</ExternalId> 
    <Name>Sleep Bag</Name> 
    <Description>A cover they can't throw off in the night. Pure cotton with one of our uniquely lovely prints. In its own gift box. An ultra thoughtful, luxurious present.</Description> 
    <Brand>xx</Brand> 
    <CategoryExternalId>1_5_1</CategoryExternalId> 
    <ProductPageUrl>http://www.xx.com/en-US/Baby-0-3yrs-Accessories/78094/Baby-0-3yrs-Sleep-Bag.html</ProductPageUrl> 
    <ImageUrl>http://www.xx.com/productimages/productThumb160x207/14USPR_78094_MUL.jpg</ImageUrl> 
    <SwatchImageUrl>http://www.xx.com/productimages/grsw/14USPR_78094_MUL_s.jpg</SwatchImageUrl> 
    <Price>54.0000</Price> 
    <Wasprice>54.0000</Wasprice> 
    <ManufacturerPartNumber></ManufacturerPartNumber> 
    <EAN></EAN> 
    <Colours> 
    <Variation> 
    <Tier2>MUL</Tier2> 
    <Tier2Descr>Multi Elephant Party</Tier2Descr> 
    <Tier2Url>http://www.xx.com/en-US/Baby-0-3yrs-Accessories/78094-MUL/Baby-0-3yrs-Multi-Elephant-Party-Sleep-Bag.html</Tier2Url> 
    <Tier2ImageUrl>http://www.xx.com/productimages/productThumb160x207/14USPR_78094_MUL.jpg</Tier2ImageUrl> 
    <Tier3>03 06</Tier3> 
    <Tier3Descr>3-6m</Tier3Descr> 
    <StockStatus>-2</StockStatus> 
    <SwatchUrl>http://www.xx.com/productimages/grsw/14USPR_78094_MUL_s.jpg</SwatchUrl> 
    </Variation> 
    <Variation> 
    <Tier2>MUL</Tier2> 
    <Tier2Descr>Multi Elephant Party</Tier2Descr> 
    <Tier2Url>http://www.xx.com/en-US/Baby-0-3yrs-Accessories/78094-MUL/Baby-0-3yrs-Multi-Elephant-Party-Sleep-Bag.html</Tier2Url> 
    <Tier2ImageUrl>http://www.xx.com/productimages/productThumb160x207/14USPR_78094_MUL.jpg</Tier2ImageUrl> 
    <Tier3>06 18</Tier3> 
    <Tier3Descr>6-18m</Tier3Descr> 
    <StockStatus>-2</StockStatus> 
    <SwatchUrl>http://www.xx.com/productimages/grsw/14USPR_78094_MUL_s.jpg</SwatchUrl> 
    </Variation> 
    </Colours> 
    </Product> 
</Products> 
</Feed> 
+0

我能夠格式化XML,現在可以讀取類別,但無法讀取產品,因爲其中存在嵌入的差異。我如何加載這個XML? – clairvoyant

回答

0

你的正則表達式的字符串似乎是期待一個新的行字符:

\\n\\s* 

此更改爲[\ n \ S *與它應該工作