Home Dashboard Directory Help

Error inserting data in Hive partitioned tabled by Vizoso


Status: 

Closed
 as External Help for as External


3
0
Sign in
to vote
Type: Bug
ID: 792307
Opened: 7/2/2013 10:02:17 AM
Access Restriction: Public
0
Workaround(s)
view
2
User(s) can reproduce this bug

Description



We are having some trouble trying to copy data from a staging, non-partitioned table to a dynamically partitioned table.

We are copying 800000 records in 1348 partitions and the process ends on error.


Staging table

use Staging;
CREATE TABLE Statistics10Min(WindFarm STRING,WTG STRING, LocalTS STRING, TS STRING,WindSpeedAvg FLOAT,WindSpeedMin FLOAT,WindSpeedMax FLOAT,WindSpeedStdDev FLOAT,ActPowAvg FLOAT,ActPowMin FLOAT,ActPowMax FLOAT,ActPowStdDev FLOAT,ReactPowAvg FLOAT,ReactPowMin FLOAT,ReactPowMax FLOAT,ReactPowStdDev FLOAT,GenSpeedAvg FLOAT,GenSpeedMin FLOAT,GenSpeedMax FLOAT,GenSpeedStdDev FLOAT,RotSpeedAvg FLOAT,RotSpeedMin FLOAT,RotSpeedMax FLOAT,RotSpeedStdDev FLOAT,YawDirAvg FLOAT,YawDirMin FLOAT,YawDirMax FLOAT,YawDirStdDev FLOAT,VibTowAvg FLOAT,VibTowMin FLOAT,VibTowMax FLOAT,VibTowStdDev FLOAT,PitchAvg FLOAT,PitchMin FLOAT,PitchMax FLOAT,PitchRateMin FLOAT,PitchRateMax FLOAT,PitchRateStdDev FLOAT,GridVoltAvg FLOAT,GridVoltMin FLOAT,GridVoltMax FLOAT,EnvTempAvg FLOAT,EnvTempMin FLOAT,EnvTempMax FLOAT,GbxTempAvg FLOAT,GbxTempMin FLOAT,GbxTempMax FLOAT,GenAccMin FLOAT,GenAccMax FLOAT,VibratAvg FLOAT,VibratMax FLOAT,GenU1TempAvg FLOAT,GenU1TempMin FLOAT,GenU1TempMax FLOAT,RotorSideLTempAvg FLOAT,RotorSideLTempMin FLOAT,RotorSideLTempMax FLOAT,WindBin FLOAT,FreqMean FLOAT) ROW FORMAT DELIMITED FIELDS TERMINATED BY '59' LINES TERMINATED BY '10';

DW table:
USE DW;
CREATE TABLE Statistics10Min(WindFarm STRING,WTG STRING, LocalTS STRING,TS STRING,WindSpeedAvg FLOAT,WindSpeedMin FLOAT,WindSpeedMax FLOAT,WindSpeedStdDev FLOAT,ActPowAvg FLOAT,ActPowMin FLOAT,ActPowMax FLOAT,ActPowStdDev FLOAT,ReactPowAvg FLOAT,ReactPowMin FLOAT,ReactPowMax FLOAT,ReactPowStdDev FLOAT,GenSpeedAvg FLOAT,GenSpeedMin FLOAT,GenSpeedMax FLOAT,GenSpeedStdDev FLOAT,RotSpeedAvg FLOAT,RotSpeedMin FLOAT,RotSpeedMax FLOAT,RotSpeedStdDev FLOAT,YawDirAvg FLOAT,YawDirMin FLOAT,YawDirMax FLOAT,YawDirStdDev FLOAT,VibTowAvg FLOAT,VibTowMin FLOAT,VibTowMax FLOAT,VibTowStdDev FLOAT,PitchAvg FLOAT,PitchMin FLOAT,PitchMax FLOAT,PitchRateMin FLOAT,PitchRateMax FLOAT,PitchRateStdDev FLOAT,GridVoltAvg FLOAT,GridVoltMin FLOAT,GridVoltMax FLOAT,EnvTempAvg FLOAT,EnvTempMin FLOAT,EnvTempMax FLOAT,GbxTempAvg FLOAT,GbxTempMin FLOAT,GbxTempMax FLOAT,GenAccMin FLOAT,GenAccMax FLOAT,VibratAvg FLOAT,VibratMax FLOAT,GenU1TempAvg FLOAT,GenU1TempMin FLOAT,GenU1TempMax FLOAT,RotorSideLTempAvg FLOAT,RotorSideLTempMin FLOAT,rotorSideLTempMax FLOAT,WindBin FLOAT,FreqMean FLOAT) PARTITIONED BY(CaptureDate STRING, Farm STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '59' LINES TERMINATED BY '10' STORED AS SEQUENCEFILE;
Details
Sign in to post a comment.
Posted by Vizoso on 8/4/2013 at 11:42 PM
Hello

About the small data set, as we said in the description, we are copying 800000 records in 1348 partitions. We are using all the 1348 partitions, but the volume of data in each one is really small.

About the order by, we have tried to insert the data with and without the order by and the result is the same error.

We also tried to create the partitions before inserting the data, but this made the table unavailable as we explain in other bug (https://connect.microsoft.com/BusinessPlatform/feedback/details/792309/hive-metastore-not-accesible-after-creating-partitions)



Posted by Microsoft on 7/24/2013 at 5:33 PM
Pablo,

I am not exactly sure what you mean when you say creating an empty partition will generate an error as well in your cluster. Would you be able to provide the query you are referring to and does it generate an out of memory error, or a different error?

Secondly, you had mentioned you were working with a small data set. Does this data set have much fewer than the 1348 partitions of the original data set described?

Finally, although I do not know what issue ORDER BY was causing previously, I wanted to suggest using a staging table instead. If the out of memory issue is related solely to the large number of partitions being dynamically created then this should provide the same benefit as ORDER BY without the large time cost of ORDER BY. Whereas ORDER BY is constrained to use a single reducer to prepare the data before writing into the dynamically partitioned table, a staging table distributes and sorts the data using many reducers. Therefore, if the issue with ORDER BY was related to the large time it takes for a single reducer to process the data, then this alternative may work. Below is an annotated example of what this looks like. Let me know how this goes if you are able to try this.

Thanks,
Travis

create table ORDERS_TEMP(O_ORDERKEY bigint , O_CUSTKEY int , O_ORDERSTATUS string , O_TOTALPRICE float , O_ORDERDATE string, O_ORDERPRIORITY string, O_CLERK string, O_SHIPPRIORITY int , O_COMMENT string );


INSERT OVERWRITE TABLE ORDERS_TEMP select * from staging_ORDERS distribute by O_ORDERDATE sort by O_ORDERDATE;

##creates an intermediate table using distribute by and sort by on the partition key O_ORDERDATE. sorts and then distributes all rows with the same distribute by columns to the same reducer.##

create table ORDERS(O_ORDERKEY bigint , O_CUSTKEY int , O_ORDERSTATUS string , O_TOTALPRICE float , O_ORDERDATE_PS string, O_ORDERPRIORITY string, O_CLERK string, O_SHIPPRIORITY int , O_COMMENT string ) PARTITIONED BY (O_ORDERDATE string)

##Then create the dynamically partitioned table using the intermediate table.##
Posted by Microsoft on 7/24/2013 at 5:32 PM
Pablo,

I am not exactly sure what you mean when you say creating an empty partition will generate an error as well in your cluster. Would you be able to provide the query you are referring to and does it generate an out of memory error, or a different error?

Secondly, you had mentioned you were working with a small data set. Does this data set have much fewer than the 1348 partitions of the original data set described?

Finally, although I do not know what issue ORDER BY was causing previously, I wanted to suggest using a staging table instead. If the out of memory issue is related solely to the large number of partitions being dynamically created then this should provide the same benefit as ORDER BY without the large time cost of ORDER BY. Whereas ORDER BY is constrained to use a single reducer to prepare the data before writing into the dynamically partitioned table, a staging table distributes and sorts the data using many reducers. Therefore, if the issue with ORDER BY was related to the large time it takes for a single reducer to process the data, then this alternative may work. Below is an annotated example of what this looks like. Let me know how this goes if you are able to try this.

Thanks,
Travis

create table ORDERS_TEMP(O_ORDERKEY bigint , O_CUSTKEY int , O_ORDERSTATUS string , O_TOTALPRICE float , O_ORDERDATE string, O_ORDERPRIORITY string, O_CLERK string, O_SHIPPRIORITY int , O_COMMENT string );


INSERT OVERWRITE TABLE ORDERS_TEMP select * from staging_ORDERS distribute by O_ORDERDATE sort by O_ORDERDATE;

##creates an intermediate table using distribute by and sort by on the partition key O_ORDERDATE. sorts and then distributes all rows with the same distribute by columns to the same reducer.##

create table ORDERS(O_ORDERKEY bigint , O_CUSTKEY int , O_ORDERSTATUS string , O_TOTALPRICE float , O_ORDERDATE_PS string, O_ORDERPRIORITY string, O_CLERK string, O_SHIPPRIORITY int , O_COMMENT string ) PARTITIONED BY (O_ORDERDATE string)

##Then create the dynamically partitioned table using the intermediate table.##
Posted by Pablo Álvarez Doval on 7/24/2013 at 4:20 AM
In our original versión, we had an order by statement and it did not work; we removed the order by statement as a suggestion by an Ms technical guy.

However, this does not seems to be the problem; just creating an empty partition will generate an error in the current state of the cluster.
Posted by Pablo Álvarez Doval on 7/19/2013 at 9:32 AM
Thanks Travis!

We have tried with the -Xmx2048m, but this did not help either.

We will try with the pre-sorting and come back with the results, but there must be something wrong (maybe in our implementation) because we are working with quite a small data set.
Posted by Microsoft on 7/3/2013 at 5:32 PM
The out of memory error is due to each mapper not having enough memory to store the information for all of the partitions. First try to increase the amount of memory for the mappers / reducers by changing the mapred.child.java.opts to mapred.child.java.opts=-Xmx1024m (or perhaps to even 1536 or 2048). Additionally, the table can be sorted before it is dynamically partitioned, so that each mapper only receives a small amount of partition values. This can be done by adding an ORDER BY partition columns to the Select portion of the query. Let me know whether these changes still result in an error.

Thanks,
Travis
Sign in to post a workaround.
File Name Submitted By Submitted On File Size  
error.txt 7/2/2013 8 KB