Hive Load Data Inpath overwrite on text format file causing duplicate SKEY column values
Trying to load a well-formatted ~80GB text file (CHR 01 delimited) into Hive through beeline: beeline -u "<connector:server:port>;principal=<principal_name>" \ -e "LOAD DATA INPATH '/path/to/hdfs/dir' OVERWRITE INTO TABLE database.tableName;" The table was created with the correct/appropriate schema & datatypes (100s of columns), with the following parameters: ROW FORMAT DELIMITED FIELDS TERMINATED BY '\001' STORED AS TEXTFILE; When the data is loaded it appears that all the columns have the correct information - at least the same order of magnitude of the input (10's of millions of records), sampling of the values in the columns match the expected values; however the first column (coincidentally, the SKEY) is duplicated severely - as it were applied to the records below its first occurrence. SKEY ValA ValB ValC ValD Record1 1 2 3 Apple Record2 7 12 22 Baseball Record3 9 28 10 Tennis Record4 6 2 3 Jeans ... RecordN 8 4 12 Wishbone ... Becomes: SKEY ValA ValB ValC ValD Record1 1 2 3 Apple Record1 7 12 22 Baseball Record1 9 28 10 Tennis .... Record4 6 2 3 Jeans Record4 8 4 12 Wishbone ... Anyone have experience overcoming this issue, or have an idea about the root cause? I believe I can get better results with another format (ie/ AVRO) but it's a little unsatisfactory. Is there a maximum limit on textfile import to Hive?
What is the data type of the column key? --Updated after looking at boethius comments---- I would advise you to use String, big int or decimal for your primary key. With float you lose precision. e.g. if you have two skeys 8611317762 and 8611317761. I suspect as float they are both interpreted as 8.611317762 x 10^10. And that is why distinct is returning a wrong answer.
HIVE PROPERTIES settings
Failed to create database 'metastore_db', see the next exception for details
Export Hive data incremental
getting the hive meta-store url to use in other systems
In Hive I need to Get numeric value after a particular word is it possible?
AWS EMR HiveQL - java.lang.OutOfMemoryError: Java heap space
Hive - sql max number with multiple rows
Hive with Tez out of memory error
how to write case statement with update in hive
Hive on tez,when set hive.tez.container.size=2048,query failed with Container exited with a non-zero exit code 1
Date variable in Hive
Last Saturday date in HIVE
External table from compressed parquet files (e.g., gz.parquet) in Hive/Impala
FROM_UTC_TIMESTAMP giving null pointer exception when converting date to EST
timestamp field in presto parquet table showing bad data
HiveQL select not working as expected on an external table with partitions