hadoop - How much data is considered "too large" for a Hive MAPJOIN job? -
edit: added more file size details, , other session information.
i have seemingly straightforward hive join query surprisingly requires several hours run.
select a.value1, a.value2, b.value join b on a.key = b.key a.keypart between b.startkeypart , b.endkeypart;
i'm trying determine if execution time normal dataset , aws hardware selection, or if trying join data.
- table a: ~2.2 million rows, 12mb compressed, 81mb raw, 4 files.
- table b: ~245 thousand rows, 6.7mb compressed, 14mb raw, 1 file.
- aws: emr-4.3.0, running on 5 m3.2xlarge ec2 instances.
records matches 1 or more records in b, logically see @ 500 billion rows generated before pruned clause.
4 mappers allocated job, completes in 6 hours. normal type of query , configuration? if not, should improve it?
i've partitioned b on join key, yields 5 partitions, haven't noticed significant improvement.
also, logs show hive optimizer starts local map join task, presumably cache or stream smaller table:
2016-02-07 02:14:13 starting launch local task process map join; maximum memory = 932184064 2016-02-07 02:14:16 dump side-table tag: 1 group count: 5 file: file:/mnt/var/lib/hive/tmp/local-hadoop/hive_2016-02-07_02-14-08_435_7052168836302267808-1/-local-10003/hashtable-stage-4/mapjoin-mapfile01--.hashtable 2016-02-07 02:14:17 uploaded 1 file to: file:/mnt/var/lib/hive/tmp/local-hadoop/hive_2016-02-07_02-14-08_435_7052168836302267808-1/-local-10003/hashtable-stage-4/mapjoin-mapfile01--.hashtable (12059634 bytes) 2016-02-07 02:14:17 end of local task; time taken: 3.71 sec.
what causing job run slowly? data set doesn't appear large, , "small-table" size under "small-table" limit of 25mb triggers disabling of mapjoin optimization.
a dump of explain output copied on pastebin reference.
my session enables compression output , intermediate storage. culprit?
set hive.exec.compress.output=true; set hive.exec.compress.intermediate=true; set mapred.output.compress=true; set mapred.output.compression.codec=org.apache.hadoop.io.compress.gzipcodec; set io.compression.codecs=org.apache.hadoop.io.compress.gzipcodec; set io.seqfile.compression.type=block;
my solution problem express join predicate entirely within join on clause, efficient way execute join in hive. why original query slow, believe mappers need time when scanning intermediate data set row row, 100+ billion times.
due hive supporting equality expressions in join on clause , rejecting function calls use both table aliases parameters, there no way rewrite original query's between clause algebraic expression. example, following expression illegal.
-- handles exclusive between join b on a.key = b.key , sign(a.keypart - b.startkeypart) = 1.0 -- keypart > startkeypart , sign(a.keypart - b.endkeypart) = -1.0 -- keypart < endkeypart
i modified source data include every value between startkeypart
, endkeypart
in hive array<bigint>
data type.
create table lookuptable key bigint, startkeypart bigint, endkeypart bigint, keyparts array<bigint>;
alternatively, have generated value inline within queries using custom java method; longstream.rangeclosed() method available in java 8, not part of hive 1.0.0 in aws emr-4.3.0.
now have entire key space in array, can transform array table using lateral view , explode(), rewriting join follows.
with b ( select key, keypart, value lookuptable lateral view explode(keyparts) keypartstable keypart ) select a.value1, a.value2, b.value join b on a.key = b.key , a.keypart = b.keypart;
the end result above query takes approximately 3 minutes complete, when compared original 6 hours on same hardware configuration.
Comments
Post a Comment