hive


Hive Load Data Inpath overwrite on text format file causing duplicate SKEY column values


Trying to load a well-formatted ~80GB text file (CHR 01 delimited) into Hive through beeline:
beeline -u "<connector:server:port>;principal=<principal_name>" \
-e "LOAD DATA INPATH '/path/to/hdfs/dir' OVERWRITE INTO TABLE database.tableName;"
The table was created with the correct/appropriate schema & datatypes (100s of columns), with the following parameters:
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\001'
STORED AS TEXTFILE;
When the data is loaded it appears that all the columns have the correct information - at least the same order of magnitude of the input (10's of millions of records), sampling of the values in the columns match the expected values; however the first column (coincidentally, the SKEY) is duplicated severely - as it were applied to the records below its first occurrence.
SKEY ValA ValB ValC ValD
Record1 1 2 3 Apple
Record2 7 12 22 Baseball
Record3 9 28 10 Tennis
Record4 6 2 3 Jeans
...
RecordN 8 4 12 Wishbone
...
Becomes:
SKEY ValA ValB ValC ValD
Record1 1 2 3 Apple
Record1 7 12 22 Baseball
Record1 9 28 10 Tennis
....
Record4 6 2 3 Jeans
Record4 8 4 12 Wishbone
...
Anyone have experience overcoming this issue, or have an idea about the root cause? I believe I can get better results with another format (ie/ AVRO) but it's a little unsatisfactory.
Is there a maximum limit on textfile import to Hive?
What is the data type of the column key?
--Updated after looking at boethius comments----
I would advise you to use String, big int or decimal for your primary key. With float you lose precision. e.g. if you have two skeys 8611317762 and 8611317761. I suspect as float they are both interpreted as 8.611317762 x 10^10. And that is why distinct is returning a wrong answer.

Related Links

HIVE PROPERTIES settings
Failed to create database 'metastore_db', see the next exception for details
Export Hive data incremental
getting the hive meta-store url to use in other systems
In Hive I need to Get numeric value after a particular word is it possible?
AWS EMR HiveQL - java.lang.OutOfMemoryError: Java heap space
Hive - sql max number with multiple rows
Hive with Tez out of memory error
how to write case statement with update in hive
Hive on tez,when set hive.tez.container.size=2048,query failed with Container exited with a non-zero exit code 1
Date variable in Hive
Last Saturday date in HIVE
External table from compressed parquet files (e.g., gz.parquet) in Hive/Impala
FROM_UTC_TIMESTAMP giving null pointer exception when converting date to EST
timestamp field in presto parquet table showing bad data
HiveQL select not working as expected on an external table with partitions

Categories

HOME
netbeans
pycharm
kde
plot
onedrive
office365api
bpmn
spagobi
q
sqlite3
flyway4
indesign
android-youtube-api
pheatmap
midi
imacros
correlation
msp430
serverless-framework
footer
shared-hosting
nas
google-search-api
autosys
fish
greendao
facebook-apps
objectanimator
file-rename
gsmcomm
flink-streaming
karaf
winrt-xaml-toolkit
perlin-noise
vsts-build-task
host
suricata
adobe-premiere
x11-forwarding
mixture-model
vapor
leading-zero
user-controls
dism
preconditions
ruby-on-rails-3.1
slick-3.0
sas-jmp
appcompat
network-flow
atomicity
jquery-validate
dandelion
diagnostics
sqlclient
typed-lambda-calculus
qtwebview
orthogonal
topbeat
parallel-data-warehouse
grails-tomcat-plugin
azure-virtual-network
mikroc
pcf
nsfilemanager
player
jwplayer7
react-native-listview
whois
php-parse-error
revolution-r
wapiti
retina
service-accounts
oberon
dd
gadt
google-style-guide
qcodo
jquery-layout
comexception
seaside
datagridviewcolumn
monomac
mcts
terminal-services
flexmojos
spring-io
odata4j
back-stack
gridcontrol
drools-planner
http-unit
selected
viewswitcher
netdna-api
qt-jambi
appender
self-extracting
actionview
pyinotify
gdlib
paster
perfect-hash
sustainable-pace
pascal-fc

Resources

Encrypt Message



code
soft
python
ios
c
html
jquery
cloud
mobile