Hive_note
Hive 语句简单优化
使用Hive的过程中,因为数据倾斜的问题可能会造成效率十分地下,以下记录一些hive的简单语句的优化操作。大部分都来自网络,查资料的时候看到就顺便记录一下。
1. 设置
set Hive.groupby.skewindata=true;
#解决jion造成的倾斜
set Hive.optimize.skewjoin = true;
2. 简单统计语句
#优化前:
select count(distinct id) from table A;
#优化后
select count(*) from (select distinct id from tableA)t;
3. 列转行
#建表
create table A(
colume1 string,
colume2s array<string>
)ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' ;
COLLECTION ITEMS TERMINATED BY ',' ;
#插入数据,B是原始表格,按照colume1插入数据
insert overwrite table A
select max(B.colume1), collect_set(B.colume2) as colume2s from B group by B.colume1;
方法2:
create table A as select max(B.colume1),concat_ws(',',collect_set(B.colume2)) as colume2s from B group by B.colume1;
4. 多列的distinct
#优化前
select t1, count(distinct t2) as t3 from A group by t1;
#优化后
select t1, count(*) as t3
from (select distinct t1, t2 from A) group by t1;
5. 查询的时候还可以通过并行来进行优化
set Hive.exec.parallel=true;
#默认是8
set Hive.exec.parallel. thread.number=n;
select * from
(
select count(t1) from A
where y = 2017 and m = 1
union all
select count(t1) from A
where y = 2017 and m = 2
union all
select count(t1) from A
where y = 2017 and m = 3
)t
6. 减少Job数量
优化实现思路,减少job的数量。 举例,统计购买既物品A又购买物品B的用户的数量。 优化前:
select count(*)
from
(select distinct user
from T1 where item = ‘A’) A
join
(select distinct user
from T1 where item = ‘B’) B
on A.user = B.user;
优化后:
select count(*)
from T1 group by user
having (count(case when item = ‘A’ then 1 end) > 0
and count(case when item = ‘B’ then 1 end) > 0)