Hive高级函数

添码座原创大约 7 分钟

查看Hive内置函数的信息的方法。

# 查看所有内置函数
hive (default)> show functions;

# 查看指定函数的信息
hive (default)> desc function <FUNC_NAME>;

# 查看指定函数的详细信息
hive (default)> desc function extended <FUNC_NAME>;

分组取TopN

分组、排序然后取前N条数据，是最为常见的统计需求之一，主要会用到ROW_NUMBER()和OVER()两个函数。

现在要分别根据用户的余额、积分和信誉的数值进行排序，然后每项取前三。

# 分组表
hive (default)> create external table t_uservalue (
                    id int,
                    name string,
                    item string,
                    value int
                ) comment "分组表" 
                  row format delimited 
                  fields terminated by '\t' 
                  lines terminated by '\n' 
                  location "/data/external";

# 加载数据
hive (default)> load data local inpath '/home/work/volumes/hive/data/user_value.data' into table t_uservalue;

# 查询数据
hive (default)> select * from t_uservalue;
OK
1		李星云	balance	80
2		李星云	point		90
3		李星云	credit	89
4		王林		balance	60
5		王林		point		75
6		王林		credit	80
7		萧炎		balance	79
8		萧炎		point		83
9		萧炎		credit	72
10	秦羽		balance	90
11	秦羽		point		76
12	秦羽		credit	80
13	石昊		balance	98
14	石昊		point		80
15	石昊		credit	70
Time taken: 0.326 seconds, Fetched: 15 row(s)

进行分组及排序。

# 根据item进行分区，然后按value的倒序排序
hive (default)> select * from (
                    select *, row_number() over(partition by item order by value desc) as number from t_uservalue
                ) result where result.number <= 3;
OK
13	石昊		balance	98	1
10	秦羽		balance	90	2
1		李星云	balance	80	3
3		李星云	credit	89	1
6		王林		credit	80	2
12	秦羽		credit	80	3
2		李星云	point		90	1
8		萧炎		point		83	2
14	石昊		point		80	3
Time taken: 40.864 seconds, Fetched: 9 row(s)

# 如果将row_number()换成rank()则排名会出现并列的情况，但并列之后的排名会“跳跃”，例如并列第2之后直接出现了第4名，第3名被跳过去了
hive (default)> select *, rank() over(partition by item order by value desc) as number from t_uservalue;
OK
13	石昊		balance	98	1
10	秦羽		balance	90	2
1		李星云	balance	80	3
7		萧炎		balance	79	4
4		王林		balance	60	5
3		李星云	credit	89	1
6		王林		credit	80	2
12	秦羽		credit	80	2
9		萧炎		credit	72	4 # 这里出现了跳跃，忽略了第3，直接跳到了第4
15	石昊		credit	70	5
2		李星云	point		90	1
8		萧炎		point		83	2
14	石昊		point		80	3
11	秦羽		point		76	4
5		王林		point		75	5
Time taken: 71.303 seconds, Fetched: 15 row(s)

# 如果将row_number()换成dense_rank()排名也会出现并列的情况，但不会出现跳跃的情况
hive (default)> select *, dense_rank() over(partition by item order by value desc) as number from t_uservalue;
OK
13	石昊		balance	98	1
10	秦羽		balance	90	2
1		李星云	balance	80	3
7		萧炎		balance	79	4
4		王林		balance	60	5
3		李星云	credit	89	1
6		王林		credit	80	2
12	秦羽		credit	80	2
9		萧炎		credit	72	3 # 这里正常了，没有直接跳到第4
15	石昊		credit	70	4
2		李星云	point		90	1
8		萧炎		point		83	2
14	石昊		point		80	3
11	秦羽		point		76	4
5		王林		point		75	5
Time taken: 69.303 seconds, Fetched: 15 row(s)

行列互转

Hive的行转列和上图展示的普通的行转列还有些不同。

Hive的行转列是把多行数据转为一列数据，用到的函数有CONCAT_WS()、COLLECT_LIST()和COLLECT_SET()。

原始数据如下。

# 行转列
hive (default)> create external table t_row2col (
                    name string,
                    hobby string
                ) comment "行转列" 
                  row format delimited 
                  fields terminated by '\t' 
                  lines terminated by '\n' 
                  location "/data/external";

# 加载数据
hive (default)> load data local inpath '/home/work/volumes/hive/data/user_hobby1.data' into table t_row2col;

# 查询数据
hive (default)> select * from t_row2col;
OK
李星云	swiming
李星云	football
李星云	traveling
李星云	sing
李星云	swiming
Time taken: 0.337 seconds, Fetched: 5 row(s)

执行行转列。

# 最终的结果希望是这样的
# 李星云 swiming,football,traveling,sing,swiming
# 如果使用collect_set()，那么会对集合中的元素做去重处理
hive (default)> select name, concat_ws(',', collect_list(hobby)) as hobby_list from t_row2col group by name;
OK
李星云	swiming,football,traveling,sing,swiming
Time taken: 43.281 seconds, Fetched: 1 row(s)

列转行和行转列执行相反的过程，它是把一行上的多列数据转为一行数据，用到函数有SPLIT()、EXPLODE()和LATERAL VIEW。

原始数据如下。

# 列转行
hive (default)> create external table t_col2row (
                    name string,
                    hobby string
                ) comment "列转行" 
                  row format delimited 
                  fields terminated by '\t' 
                  lines terminated by '\n' 
                  location "/data/external";

# 加载数据
hive (default)> load data local inpath '/home/work/volumes/hive/data/user_hobby2.data' into table t_col2row;

# 查询数据
hive (default)> select * from t_col2row;
OK
李星云	swiming,football,sing
王林		traveling,swiming
Time taken: 0.322 seconds, Fetched: 2 row(s)

执行列转行。

# 最终的结果希望是这样的
# 李星云 swiming
# 李星云 football
# 李星云 sing
# 王林   traveling
# 王林   swiming
hive (default)> select name, hobby_new from t_col2row lateral view explode(split(hobby, ',')) temptable as hobby_new;
OK
李星云	swiming
李星云	football
李星云	sing
王林		traveling
王林		swiming
Time taken: 0.233 seconds, Fetched: 5 row(s)

排序函数

Hive中与排序相关的关键字包括ORDER BY、SORT BY、DISTRIBUTE BY和CLUSTER BY

ORDER BY的作用是全局排序。

SORT BY会对每一个MapReduce进行排序，有几个Reduce就排序几次，保证数据的局部有序。

有下面这样一张表。

# 排序表
hive (default)> create external table t_sortby (
                    id int
                ) comment "排序表" 
                  row format delimited 
                  fields terminated by '\t' 
                  lines terminated by '\n' 
                  location "/data/external";

# 加载数据
hive (default)> load data local inpath '/home/work/volumes/hive/data/sortby.data' into table t_sortby;

# 查询数据
# 当reduce数量只有1个时候，看起来就是全局有序
hive (default)> select * from t_sortby sort by id;
......
Stage-Stage-1: Map: 1  Reduce: 1   Cumulative CPU: 4.29 sec   HDFS Read: 9316 HDFS Write: 227 SUCCESS
OK
1
1
2
2
3
3
4
4
5
5
Time taken: 60.002 seconds, Fetched: 10 row(s)

# 动态设置mapreduce数量为2
hive (default)> set mapreduce.job.reduces=2;

# 再次查询数据
hive (default)> select * from t_sortby sort by id;
......
Stage-Stage-1: Map: 1  Reduce: 2   Cumulative CPU: 6.69 sec   HDFS Read: 14009 HDFS Write: 314 SUCCESS
OK
1
3
3
4
5
5
1
2
2
4
Time taken: 84.975 seconds, Fetched: 10 row(s)

如果将上面的SORT BY换成ORDER BY，它不仅是全局有序，而且始终只有1个reduce。

DISTRIBUTE BY是用来控制从map输出到reduce的任务划分的，它后面需要指定一个字段，它根据这个指定的字段进行分区，但不排序。

它一般会和SORT BY结合在一起使用：先分区再排序。

# 这里仍然延续之前的会话，reduce的数量为2
hive (default)> select * from t_sortby distribute by id;
......
Stage-Stage-1: Map: 1  Reduce: 2   Cumulative CPU: 6.69 sec   HDFS Read: 14009 HDFS Write: 314 SUCCESS
OK
4
2
4
2
5
3
1
5
3
1
Time taken: 79.583 seconds, Fetched: 10 row(s)

# 再结合sort by
hive (default)> select * from t_sortby distribute by id order by id;
2
2
4
4
1
1
3
3
5
5

最后，CLUSTER BY相当于DISTRIBUTE BY和SORT BY结合在一起的作用（DISTRIBUTE BY和SORT BY使用的是同一个字段才能用CLUSTER BY替换），但CLUSTER BY只支持升序，不支持降序。

cluster by id ==> distribute by id order by id

hive (default)> select * from t_sortby cluster by id;
2
2
4
4
1
1
3
3
5
5

# 加入desc会报错
hive (default)> select * from t_sortby cluster by id desc;
FAILED: ParseException line 1:37 extraneous input 'desc' expecting EOF near '<EOF>'

# 加入asc也会报错
hive (default)> select * from t_sortby cluster by id asc;
FAILED: ParseException line 1:37 extraneous input 'asc' expecting EOF near '<EOF>'

分组和去重

Hive和MySQL使用的是同样的GROUP BY和DISTINCT关键字用来实现分组和去重功能。

例如，如果需要统计订单表中按商品名称分组之后的去重数量，可以这么做。

# 创建订单表
hive (default)> create external table t_order (
                    id int,
                    name string
                ) comment "订单表" 
                  row format delimited 
                  fields terminated by '\t' 
                  lines terminated by '\n' 
                  location "/data/external";

# 加载数据
hive (default)> load data local inpath '/home/work/volumes/hive/data/order.data' into table t_order;

# 查询数据
hive (default)> select count(*) from (select name from t_order group by name) temp;
......
Total jobs = 2
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 2
OK
5
Time taken: 74.692 seconds, Fetched: 1 row(s)

# 或者（前一种效率更高，因为可以并行执行）
hive (default)> select count(distinct name) from t_order;
......
Total jobs = 1
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
OK
5
Time taken: 81.483 seconds, Fetched: 1 row(s)

一个用Hive解决数据倾斜的样例。

select a.key, sum(a.count) as count
from (
    select key, count(*) as count
    from tablename
    group by key,
        case
            when key='key001' then Hash(Random()) % 50
            else 0
        end
) a
group by a.key;

感谢支持

更多内容，请移步《超级个体》。