Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
276 views
in Technique[技术] by (71.8m points)

parsing - How to use lateral view explode in Hive for XML data format?

I am trying to load sales data which is in XML format to the Hive table. Below is a small sample of the data.

I am aware that I can load the data below to Hive if I separate it to several tables and then join them as required. But just wanted to know if I can load them in a single table and the expected output should look like the screenshot attached.

Please help me with the table structure I should use and how can I use the lateral view explode option effectively to achieve this.

Sample data:

  <Store>
    <Version>1.1</Version>
    <StoreId>16695</StoreId>    
    <Bskt>
      <TillNo>4</TillNo>
      <BsktNo>1753</BsktNo>
      <DateTime>2017-10-31T11:19:34.000+11:00</DateTime>
      <OpID>50056</OpID>
      <Itm>
        <ItmSeq>1</ItmSeq>
        <GTIN>29559</GTIN>
        <ItmDsc>CHOCALATE</ItmDsc>
      <ItmProm>
          <PromCD>CM</PromCD>
        </ItmProm>
      </Itm>
      <Itm>
        <ItmSeq>2</ItmSeq>
        <GTIN>59653</GTIN>
        <ItmDsc>CORN FLAKES</ItmDsc>
      </Itm>
        <Itm>
        <ItmSeq>3</ItmSeq>
        <GTIN>42260</GTIN>
        <ItmDsc> MILK CHOCOLATE 162GM</ItmDsc>
        <ItmProm>
          <PromCD>MTSRO</PromCD>
          <OfferID>11766</OfferID>
        </ItmProm>
      </Itm>
    </Bskt>
    <Bskt>
      <TillNo>5</TillNo>
      <BsktNo>1947</BsktNo>
      <DateTime>2017-10-31T16:24:59.000+11:00</DateTime>
      <OpID>50063</OpID>
      <Itm>
        <ItmSeq>1</ItmSeq>
        <GTIN>24064</GTIN>
        <ItmDsc>TOMATOES 2KG</ItmDsc>
        <ItmProm>
          <PromCD>INSTORE</PromCD>
        </ItmProm>
      </Itm>
      <Itm>
        <ItmSeq>2</ItmSeq>
        <GTIN>81287</GTIN>
        <ItmDsc>ROTHMANS BLUE</ItmDsc>
        <ItmProm>
          <PromCD>TF</PromCD>
        </ItmProm>
      </Itm>
    </Bskt>
  </Store>  

Desired Output

enter image description here

Table structure:

CREATE EXTERNAL TABLE IF NOT EXISTS POC_BASKET_ITEM_PROMO (
`Version` string,
`StoreId` string,
`DateTime` array<string>,
`BsktNo` array<double>,
`TillNo` array<int>,
`Item_Seq_num` array<int>,
`GTIN` array<string>,
`ItmDsc` array<string>,
`Promo_CD` array<string>,
`Offer_ID` array<int>
)

ROW FORMAT SERDE 'com.ibm.spss.hive.serde2.xml.XmlSerDe'
WITH SERDEPROPERTIES (

"column.xpath.Version"="/Store/Version/text()",
"column.xpath.StoreId"="/Store/StoreId/text()",
"column.xpath.DateTime"="/Store/Bskt/DateTime/text()",
"column.xpath.BsktNo"="/Store/Bskt/BsktNo/text()",
"column.xpath.TillNo"="/Store/Bskt/TillNo/text()",
"column.xpath.Item_Seq_num"="/Store/Bskt/Itm/ItmSeq/text()",
"column.xpath.GTIN"="/Store/Bskt/Itm/GTIN/text()",
"column.xpath.ItmDsc"="/Store/Bskt/Itm/ItmDsc/text()",
"column.xpath.Promo_CD"="/Store/Bskt/Itm/ItmProm/PromCD/text()",
"column.xpath.Offer_ID"="/Store/Bskt/Itm/ItmProm/OfferID/text()"
)

STORED AS INPUTFORMAT 'com.ibm.spss.hive.serde2.xml.XmlInputFormat'
    OUTPUTFORMAT 
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat' 
    LOCATION 'hdfs://namenode:8020/DEV/TEST/nanda_test'
    TBLPROPERTIES (
    "xmlinput.start"="<Store","xmlinput.end"="</Store>"
);

Output: enter image description here

Tried below query to read the data, It is not showing the results in the way i want.

select Version,StoreId,basket_dtm,basket_number,till_number from POC_BASKET_ITEM_PROMO
    LATERAL VIEW explode(DateTime) table1 as basket_dtm 
    LATERAL VIEW explode(BsktNo) table2 as basket_number
    LATERAL VIEW explode(TillNo) table3 as till_number;

Results:

enter image description here

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

Explode for array object works like cross join. So if you have 3 columns with each containing array with 2 elements, applying explode on all the columns will give you 8 rows.

You can't map one object from array to another.

Actually you can by using posexplode which gives you index for each element. which you can use to join based on condition. However, that' tricky when you have multiple columns and the array size is different for each column.

Solution

  • Use posexplode if you have less column to explode and array size is same. for your case this is not going to work. So
  • Store XML as Complex Data Type : Store your whole XML as a complex data type (not just array), I am talking about creating a struct based on your xml. If you don't have much complex xml, you can achieve this. However xmlSerde is not as good as JSONserde when it comes to converting file to complex data type.

So in your case best solution would be.

  • Convert your XML to JSON. You can use NiFi or some other technology for that.
  • Create Hive table using JSONserde and load this file.
  • Create a view as per your requirement.

JSON for Your XML

{"Version":"1.1","StoreId":"16695","Bskt":[{"TillNo":"4","BsktNo":"1753","DateTime":"2017-10-31T11:19:34.000+11:00","OpID":"50056","Itm":[{"ItmSeq":"1","GTIN":"29559","ItmDsc":"CHOCALATE","ItmProm":{"PromCD":"CM"}},{"ItmSeq":"2","GTIN":"59653","ItmDsc":"CORNFLAKES"},{"ItmSeq":"3","GTIN":"42260","ItmDsc":"MILKCHOCOLATE162GM","ItmProm":{"PromCD":"MTSRO","OfferID":"11766"}}]},{"TillNo":"5","BsktNo":"1947","DateTime":"2017-10-31T16:24:59.000+11:00","OpID":"50063","Itm":[{"ItmSeq":"1","GTIN":"24064","ItmDsc":"TOMATOES2KG","ItmProm":{"PromCD":"INSTORE"}},{"ItmSeq":"2","GTIN":"81287","ItmDsc":"ROTHMANSBLUE","ItmProm":{"PromCD":"TF"}}]}]}

JsonSerde might give your error if you have tabs or other white spaces in your file. So it's always best to remove them.

Hive Table

create external table temp.test_json
(
Version string,
StoreId string,
Bskt array<struct<
                    BsktNo:string,
                    DateTime:string,
                    OpID:string,
                    TillNo:string,
                    Itm:array<struct<
                                        GTIN:string,
                                        ItmDsc:string,
                                        ItmSeq:string,
                                        ItmProm:struct<
                                                        OfferID:string,
                                                        PromCD:string
                                                        >

                                    >
                            >
                >
            >
)
row format serde 'org.openx.data.jsonserde.JsonSerDe'
location '/tmp/test_json/table/';

enter image description here Create View

SELECT Version,
         StoreId,
         basket.bsktno,
         basket.tillno,
         basket.`datetime`,
         item.itmseq,
         item.itmdsc,
         item.gtin,
         item.itmprom.offerid,
         item.itmprom.promcd
FROM temp.test_json 
lateral view explode(bskt) b AS basket 
lateral view explode(basket.itm) i AS item

enter image description here


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...