hive_note

Hive 语句简单优化

使用Hive的过程中，因为数据倾斜的问题可能会造成效率十分地下，以下记录一些hive的简单语句的优化操作。大部分都来自网络，查资料的时候看到就顺便记录一下。

1. 设置

set Hive.groupby.skewindata=true;
#解决jion造成的倾斜
set Hive.optimize.skewjoin = true;

2. 简单统计语句

#优化前：
select count(distinct id) from table A;
#优化后
select count(*) from (select distinct id from tableA)t;

3. 列转行

#建表
create table A(
colume1 string,
colume2s array<string>
)ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' ; 
COLLECTION ITEMS TERMINATED BY ',' ;
#插入数据，B是原始表格，按照colume1插入数据
insert  overwrite  table  A  
select max(B.colume1),  collect_set(B.colume2)  as  colume2s from B group by B.colume1;

方法2:
create table A as select max(B.colume1),concat_ws(',',collect_set(B.colume2)) as colume2s from B group by B.colume1;

4. 多列的distinct

#优化前 
select t1， count(distinct t2) as t3 from A group by t1;  
#优化后 
select t1， count(*) as t3  
from (select distinct t1， t2 from A) group by t1;

5. 查询的时候还可以通过并行来进行优化

set Hive.exec.parallel=true;
#默认是8
set Hive.exec.parallel. thread.number=n;
select * from   
(  
select count(t1) from A   
where y = 2017 and m = 1  
union all   
select count(t1) from A   
where y = 2017 and m = 2  
union all   
select count(t1) from A   
where y = 2017 and m = 3  
)t

6. 减少Job数量

优化实现思路，减少job的数量。举例，统计购买既物品A又购买物品B的用户的数量。优化前：

select count(*) 
from
(select distinct user 
from T1 where item = ‘A’) A
join 
(select distinct user 
from T1 where item = ‘B’) B 
on A.user = B.user;

优化后：

select count(*) 
from T1 group by user
having (count(case when item = ‘A’ then 1 end) > 0
and count(case when item = ‘B’ then 1 end) > 0)

Chenlini

Stay hundry stay foolish.

Hive 语句简单优化

1. 设置

2. 简单统计语句

3. 列转行

4. 多列的distinct

5. 查询的时候还可以通过并行来进行优化

6. 减少Job数量