七十八、Hive数据仓库实际操作(操作测试)

简介: 七十八、Hive数据仓库实际操作(操作测试)

Hive环境搭建


centos安装hive3.1.2(精讲篇)

https://blog.csdn.net/m0_54925305/article/details/120554242?spm=1001.2014.3001.5502


Hive数据仓库的操作


一、创建数据库


hive> show databases;
OK
default
Time taken: 0.067 seconds, Fetched: 1 row(s)
hive> create database if not exists DB;
OK
Time taken: 0.064 seconds
hive> show databases;
OK
db
default
Time taken: 0.018 seconds, Fetched: 2 row(s)

二、查看数据仓库DB的信息及路径


hive> describe database DB;
OK
db  hdfs://master:9000/user/hive/warehouse/db.db  root  USER  
Time taken: 0.065 seconds, Fetched: 1 row(s)

Hive数据表的操作



Hive的数据表分为两种:内部表和外部表。



Hive创建内部表时,会将数据移动到数据仓库指向的路径;若创建外部表,仅记录数据所在的路径,不对数据的位置做任何改变。在删除表的时候,内部表的元数据和数据会被一起删除,而外部表只删除元数据,不删除数据。这样外部表相对来说更加安全些,数据组织也更加灵活,方便共享源数据,生产中常使用外部表。



下面详细介绍对表操作的命令及使用方法:

       即将创建的表,表名不能和已有表名重复,否则会报错,现在我们show tables 一下,查看已存在的表。


一、创建一个名为cat的内部表,有两个字段为cat_id和cat_name,字符类型为string


hive> create table cat(cat_id string,cat_name string);
OK
Time taken: 0.72 seconds
hive> show tables;
OK
cat
Time taken: 0.057 seconds, Fetched: 1 row(s)

二、创建一个外部表,表名为cat2,有两个字段为cat_id和cat_name,字符类型为string


hive> create external table if not exists goods(group_id string,group_name string) row format delimited fields terminated by '\t' location '/user/root/goods';
OK
Time taken: 0.155 seconds
hive> show tables;
OK
cat
goods
Time taken: 0.026 seconds, Fetched: 2 row(s)

三、修改cat表的表结构。对cat表添加两个字段group_id和cat_code


hive> alter table cat add columns(group_id string,cat_code string);
OK
Time taken: 0.372 seconds
hive> desc cat;
OK
cat_id                string                                   
cat_name              string                                   
group_id              string                                   
cat_code              string                                   
Time taken: 0.087 seconds, Fetched: 4 row(s)

四、修改表名cat为cat2


hive> alter table cat rename to cat2;
OK
Time taken: 0.275 seconds

这个命令可以让用户为表更名,数据所在的位置和分区名并不改变。


五、创建与已知表相同结构的表,创建一个与cat表结构相同的表,名为cat3,这里要用到 like 关键字


hive> create table cat3 like cat2;
OK
Time taken: 1.391 seconds
hive> show tables;
OK
cat2
cat3
goods
Time taken: 0.047 seconds, Fetched: 3 row(s)
hive> desc cat3;
OK
cat_id                string                                   
cat_name              string                                   
group_id              string                                   
cat_code              string                                   
Time taken: 0.118 seconds, Fetched: 4 row(s)

Hive中数据的导入与导出


一、从本地文件系统中导入数据到Hive表


首先,在Hive 中创建一个cat_group表,包含group_id和group_name两个字段,字符类型为string,以“\t”为分隔符,并查看结果。


hive> create table cat_group(group_id string,group_name string) row format delimited fields terminated by '\t' stored as textfile;
OK
Time taken: 0.218 seconds
hive> show tables;
OK
cat2
cat3
cat_group
goods
Time taken: 0.048 seconds, Fetched: 4 row(s)

[row format delimited]关键字,是用来设置创建的表在加载数据的时候,支持的列分隔符。

[stored as textfile]关键字,是用来设置加载数据的数据类型,默认是 TEXTFILE,如果文件数据是纯文本,就是使用[stored as textfile],然后从本地直接拷贝到HDFS上,Hive直接可以识别数据。


二、将Linux本地 /input/hive/目录下的myhive 文件导入到 Hive 中的cat_group表中


14.png


hive> load data local inpath '/input/hive/myhive' into table cat_group;
Loading data to table db.cat_group
OK
Time taken: 1.081 seconds

通过select语句查看cat_group表中是否成功导入数据,使用limit关键字限制输出5条记录。


hive> select * from cat_group limit 5;
OK
101 孙悟空
102 唐僧
103 猪八戒
104 沙僧
105 托马斯
Time taken: 2.088 seconds, Fetched: 5 row(s)

三、将HDFS中的数据导入到Hive中


       1、首先,另外开启一个操作窗口,在HDFS上创建/output/hive目录


[root@master hive]# hadoop fs -mkdir /output/hive

       2、将本地/iutput/hive/下的myhive文件上传到HDFS的/output/hive上,并查看是否创建成功


15.png


[root@master hive]# hadoop fs -put /input/hive/myhive /output/hive/
[root@master hive]# hadoop fs -ls /output/hive 
Found 1 items
-rw-r--r--   2 root supergroup         64 2022-03-05 22:19 /output/hive/myhive

       3、在Hive 中创建名为cat_group1的表,创表语句如下


hive> create table cat_group1(group_id string,group_name string)
    > row format delimited fields terminated by '\t' stored as textfile;
OK
Time taken: 0.243 seconds

       4、将HDFS 下/output/hive 中的表cat_group导入到Hive 中的cat_group1表中,并查看结果


hive> load data inpath '/output/hive/myhive'  into table cat_group1;
Loading data to table db.cat_group1
OK
Time taken: 0.539 seconds
hive> select * from cat_group1 limit 5;
OK
101 孙悟空
102 唐僧
103 猪八戒
104 沙僧
105 托马斯
Time taken: 0.262 seconds, Fetched: 5 row(s)

       注:数据导入成功。


       HDFS中数据导入到 Hive中与本地数据导入到 hive中的区别是load data后少了local。


四、从别的表中查询出相应的数据并导入到Hive中


       1、首先在Hive 中创建一个名为cat_group2的表。


hive> create table cat_group2(group_id string,group_name string)
    > row format delimited fields terminated by '\t' stored as textfile;
OK
Time taken: 0.111 seconds

       2、用下面两种方式将cat_group1表中的数据导入到cat_group2表中。


hive> insert into table cat_group2 select * from cat_group1;
Query ID = root_20220306040659_42572420-db7d-4412-bbc3-495abd9ce479
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Starting Job = job_1646528951444_0003, Tracking URL = http://master:8088/proxy/application_1646528951444_0003/
Kill Command = /home/hadoop//bin/mapred job  -kill job_1646528951444_0003
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2022-03-06 04:07:31,799 Stage-1 map = 0%,  reduce = 0%
2022-03-06 04:07:51,642 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 1.89 sec
2022-03-06 04:08:00,165 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 3.47 sec
MapReduce Total cumulative CPU time: 3 seconds 470 msec
Ended Job = job_1646528951444_0003
Stage-4 is selected by condition resolver.
Stage-3 is filtered out by condition resolver.
Stage-5 is filtered out by condition resolver.
Moving data to directory hdfs://master:9000/user/hive/warehouse/db.db/cat_group2/.hive-staging_hive_2022-03-06_04-06-59_043_3456913091663343579-1/-ext-10000
Loading data to table db.cat_group2
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 1  Reduce: 1   Cumulative CPU: 3.47 sec   HDFS Read: 13409 HDFS Write: 348 SUCCESS
Total MapReduce CPU Time Spent: 3 seconds 470 msec
OK
Time taken: 63.711 seconds
hive> insert overwrite  table cat_group2 select * from cat_group1;
Query ID = root_20220306041024_bf920fd1-b42d-4ed7-ad7b-66955905fa19
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Starting Job = job_1646528951444_0004, Tracking URL = http://master:8088/proxy/application_1646528951444_0004/
Kill Command = /home/hadoop//bin/mapred job  -kill job_1646528951444_0004
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2022-03-06 04:10:47,981 Stage-1 map = 0%,  reduce = 0%
2022-03-06 04:11:12,568 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 2.33 sec
2022-03-06 04:11:22,231 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 4.1 sec
MapReduce Total cumulative CPU time: 4 seconds 100 msec
Ended Job = job_1646528951444_0004
Stage-4 is selected by condition resolver.
Stage-3 is filtered out by condition resolver.
Stage-5 is filtered out by condition resolver.
Moving data to directory hdfs://master:9000/user/hive/warehouse/db.db/cat_group2/.hive-staging_hive_2022-03-06_04-10-24_167_6531779411761470258-1/-ext-10000
Loading data to table db.cat_group2
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 1  Reduce: 1   Cumulative CPU: 4.1 sec   HDFS Read: 13494 HDFS Write: 348 SUCCESS
Total MapReduce CPU Time Spent: 4 seconds 100 msec
OK
Time taken: 60.895 seconds

       注:insert overwrite会覆盖数据


       3、查询表 cat_group2


hive> select * from cat_group2 limit 5;
OK
101 孙悟空
102 唐僧
103 猪八戒
104 沙僧
105 托马斯
Time taken: 0.33 seconds, Fetched: 5 row(s)

       4、在创建表的时候从别的表中查询出相应数据并插入到所创建的表中


Hive中创建表cat_group3并直接从cat_group2中获得数据。


hive> create table  cat_group3 as  select * from cat_group2;
Query ID = root_20220306041630_3200b863-b9b3-4c2e-ac0d-c7caff9b6611
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1646528951444_0005, Tracking URL = http://master:8088/proxy/application_1646528951444_0005/
Kill Command = /home/hadoop//bin/mapred job  -kill job_1646528951444_0005
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2022-03-06 04:16:54,438 Stage-1 map = 0%,  reduce = 0%
2022-03-06 04:17:02,430 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 1.58 sec
MapReduce Total cumulative CPU time: 1 seconds 580 msec
Ended Job = job_1646528951444_0005
Stage-4 is selected by condition resolver.
Stage-3 is filtered out by condition resolver.
Stage-5 is filtered out by condition resolver.
Moving data to directory hdfs://master:9000/user/hive/warehouse/db.db/.hive-staging_hive_2022-03-06_04-16-30_327_7813330832683742274-1/-ext-10002
Moving data to directory hdfs://master:9000/user/hive/warehouse/db.db/cat_group3
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 1   Cumulative CPU: 1.58 sec   HDFS Read: 4969 HDFS Write: 133 SUCCESS
Total MapReduce CPU Time Spent: 1 seconds 580 msec
OK
Time taken: 34.65 seconds

       5、查询表 cat_group3


hive> select * from cat_group3 limit 5;
OK
101 孙悟空
102 唐僧
103 猪八戒
104 沙僧
105 托马斯
Time taken: 0.229 seconds, Fetched: 5 row(s)

五、常见的三种数据导出方式


       1、导出到本地文件系统


在本地创建目录 /output/hive  并将Hive中的cat_group表导出到本地文件系统/output/hive/中。


[root@master hive]# mkdir -p /output/hive/
hive> insert overwrite local directory '/output/hive/'
    > row format delimited fields terminated by '\t' select * from cat_group;
Query ID = root_20220306062829_b059a3f5-e4ad-4dd7-a000-e294c4ccbee2
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1646528951444_0006, Tracking URL = http://master:8088/proxy/application_1646528951444_0006/
Kill Command = /home/hadoop//bin/mapred job  -kill job_1646528951444_0006
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2022-03-06 06:28:51,743 Stage-1 map = 0%,  reduce = 0%
2022-03-06 06:29:00,515 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 1.59 sec
MapReduce Total cumulative CPU time: 1 seconds 590 msec
Ended Job = job_1646528951444_0006
Moving data to local directory /output/hive
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 1   Cumulative CPU: 1.59 sec   HDFS Read: 4738 HDFS Write: 64 SUCCESS
Total MapReduce CPU Time Spent: 1 seconds 590 msec
OK
Time taken: 32.116 seconds
[root@master out]# cd /output/hive/
[root@master hive]# ll
total 4
-rw-r--r--. 1 root root 64 Mar  6 06:29 000000_0
[root@master hive]# cat 000000_0 
101 孙悟空
102 唐僧
103 猪八戒
104 沙僧
105 托马斯

注意:方法和导入数据到 Hive不一样,不能用insert into来将数据导出。


       2、Hive中数据导出到HDFS中


将Hive中的表cat_group中的数据导入到HDFS的/output/hive目录里。


hive> insert overwrite directory '/output/hive' 
    > row format delimited fields terminated by '\t' select group_id,
    > group_name from cat_group;
Query ID = root_20220306063621_b359d338-77ee-4571-a425-5415f9c6fb03
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1646528951444_0007, Tracking URL = http://master:8088/proxy/application_1646528951444_0007/
Kill Command = /home/hadoop//bin/mapred job  -kill job_1646528951444_0007
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2022-03-06 06:36:41,866 Stage-1 map = 0%,  reduce = 0%
2022-03-06 06:36:55,679 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 1.75 sec
MapReduce Total cumulative CPU time: 1 seconds 750 msec
Ended Job = job_1646528951444_0007
Stage-3 is selected by condition resolver.
Stage-2 is filtered out by condition resolver.
Stage-4 is filtered out by condition resolver.
Moving data to directory hdfs://master:9000/output/hive/.hive-staging_hive_2022-03-06_06-36-21_452_7432529204143275493-1/-ext-10000
Moving data to directory /output/hive
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 1   Cumulative CPU: 1.75 sec   HDFS Read: 4772 HDFS Write: 64 SUCCESS
Total MapReduce CPU Time Spent: 1 seconds 750 msec
OK
Time taken: 36.494 seconds

在HDFS上查看结果


[root@master hive]# hadoop fs -ls /output/hive
Found 1 items
-rw-r--r--   2 root supergroup         64 2022-03-06 06:36 /output/hive/000000_0

       3、导出到Hive的另一张表中


将Hive中表cat_group中的数据导入到cat_group4中(两表字段及字符类型相同)。

首先在Hive 中创建一个表cat_group4,有group_id和group_name 两个字段,字符类型为string,以\t’为分隔符。


hive> create table cat_group4(group_id string,group_name string)
    > row format delimited fields terminated by '\t' stored as textfile;
OK
Time taken: 0.195 seconds

然后将cat_group中的数据导入到cat_group4中。


hive> insert into table cat_group4 select * from cat_group;
Query ID = root_20220306064421_722364dd-7475-4ae5-ba44-553f3df856e2
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Starting Job = job_1646528951444_0008, Tracking URL = http://master:8088/proxy/application_1646528951444_0008/
Kill Command = /home/hadoop//bin/mapred job  -kill job_1646528951444_0008
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2022-03-06 06:44:47,514 Stage-1 map = 0%,  reduce = 0%
2022-03-06 06:44:58,359 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 1.74 sec
2022-03-06 06:45:11,880 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 3.4 sec
MapReduce Total cumulative CPU time: 3 seconds 400 msec
Ended Job = job_1646528951444_0008
Stage-4 is selected by condition resolver.
Stage-3 is filtered out by condition resolver.
Stage-5 is filtered out by condition resolver.
Moving data to directory hdfs://master:9000/user/hive/warehouse/db.db/cat_group4/.hive-staging_hive_2022-03-06_06-44-21_318_6696628966307745769-1/-ext-10000
Loading data to table db.cat_group4
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 1  Reduce: 1   Cumulative CPU: 3.4 sec   HDFS Read: 13474 HDFS Write: 348 SUCCESS
Total MapReduce CPU Time Spent: 3 seconds 400 msec
OK
Time taken: 52.617 seconds

导入完成后,查看cat_group4表中数据。

hive> select * from cat_group4 limit 10;
OK
101 孙悟空
102 唐僧
103 猪八戒
104 沙僧
105 托马斯
Time taken: 0.249 seconds, Fetched: 5 row(s)

六、Hive分区表的操作


创建表分区,在 Hive中创建一个分区表goods,包含 goods_id和goods_status两个字段,字符类型为string,分区为cat_id,字符类型为string,以“\t“为分隔符。


hive> create table goods(goods_id string,goods_status string) partitioned by (cat_id string)
    > row format delimited fields terminated by '\t';
OK
Time taken: 0.107 seconds

查看表 goods 结构


hive> desc goods;
OK
goods_id              string                                   
goods_status          string                                   
cat_id                string                                   
# Partition Information    
# col_name              data_type            comment             
cat_id                string                                   
Time taken: 0.108 seconds, Fetched: 7 row(s)

向分区表插入数据,将本地/output/hive下的表goods 中数据,插入到分区表goods中。

[root@master hive]# cat goods 
1020405 6       52052
1020405 6       52052
1020405 6       52052
1020405 6       52052
1020405 6       52052
1020405 6       52052
1020405 6       52052
1020405 6       52052
1020405 6       52052
1020405 6       52052


在Hive中创建一个非分区表goods_1表,用于存储本地/input/hive/下的表goods 中数据。


hive> create table goods_1(goods_id string,goods_status string,cat_id string)
    > row format delimited fields terminated by '\t';
OK
Time taken: 0.179 seconds

将本地/input/hive/下的表goods 中数据导入到Hive中的goods_1表中。


hive> load data local inpath '/input/hive/goods' into table goods_1;
Loading data to table db.goods_1
OK
Time taken: 0.511 seconds

再将表goods_1中的数据导入到分区表goods中


hive> insert into table db.goods partition(cat_id = '52052') select goods_id, goods_status from db.goods_1 where cat_id = '52052';
Query ID = root_20220307041832_30f47fc3-629d-4eda-821a-5f0c3a9edb0d
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Starting Job = job_1646636256603_0002, Tracking URL = http://master:8088/proxy/application_1646636256603_0002/
Kill Command = /home/hadoop//bin/mapred job  -kill job_1646636256603_0002
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2022-03-07 04:19:05,274 Stage-1 map = 0%,  reduce = 0%
2022-03-07 04:19:18,487 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 2.77 sec
2022-03-07 04:19:27,292 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 4.59 sec
MapReduce Total cumulative CPU time: 4 seconds 590 msec
Ended Job = job_1646636256603_0002
Stage-4 is selected by condition resolver.
Stage-3 is filtered out by condition resolver.
Stage-5 is filtered out by condition resolver.
Moving data to directory hdfs://master:9000/user/hive/warehouse/db.db/goods/cat_id=52052/.hive-staging_hive_2022-03-07_04-18-32_060_6446641423854979060-1/-ext-10000
Loading data to table db.goods partition (cat_id=52052)
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 1  Reduce: 1   Cumulative CPU: 4.59 sec   HDFS Read: 14777 HDFS Write: 320 SUCCESS
Total MapReduce CPU Time Spent: 4 seconds 590 msec
OK
Time taken: 59.931 seconds

查看表 goods中数据


hive> select goods_id, goods_status from goods;
OK
1624123 6
1020405 6
1020405 6
1020405 6
1020405 6
1020405 6
1020405 6
1020405 6
Time taken: 0.252 seconds, Fetched: 8 row(s)

修改表分区,将分区表goods中的分区列 cat_id = 52050 改为cat_id = 52051,并查看修改后的分区名。

hive> alter table goods partition(cat_id=52052) rename to partition(cat_id=52051);
OK
Time taken: 0.678 seconds
hive> show partitions goods;
OK
cat_id=52051
Time taken: 0.139 seconds, Fetched: 1 row(s)

删除表分区


在删除goods分区表之前,先将goods表备份出一个goods_2表

hive> create table goods_2(goods_id string,goods_status string) partitioned by (cat_id string) row format delimited fields terminated by '\t';
OK
Time taken: 0.178 seconds
hive> insert into table goods_2 partition(cat_id='52052') select goods_id,goods_status from goods_1 where cat_id = '52052';
Query ID = root_20220307054238_db58a379-17f6-4ecb-86e0-402e0d7bbf54
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Starting Job = job_1646636256603_0003, Tracking URL = http://master:8088/proxy/application_1646636256603_0003/
Kill Command = /home/hadoop//bin/mapred job  -kill job_1646636256603_0003
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2022-03-07 05:43:04,534 Stage-1 map = 0%,  reduce = 0%
2022-03-07 05:43:17,542 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 2.76 sec
2022-03-07 05:43:26,197 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 4.55 sec
MapReduce Total cumulative CPU time: 4 seconds 550 msec
Ended Job = job_1646636256603_0003
Stage-4 is selected by condition resolver.
Stage-3 is filtered out by condition resolver.
Stage-5 is filtered out by condition resolver.
Moving data to directory hdfs://master:9000/user/hive/warehouse/db.db/goods_2/cat_id=52052/.hive-staging_hive_2022-03-07_05-42-38_498_2225361888387483704-1/-ext-10000
Loading data to table db.goods_2 partition (cat_id=52052)
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 1  Reduce: 1   Cumulative CPU: 4.55 sec   HDFS Read: 14813 HDFS Write: 322 SUCCESS
Total MapReduce CPU Time Spent: 4 seconds 550 msec
OK
Time taken: 49.84 seconds

删除goods表中的cat_id 分区


hive> alter table goods drop if exists partition(cat_id = '52051');
Dropped the partition cat_id=52051
OK
Time taken: 0.405 seconds
hive> show partitions goods;
OK
Time taken: 0.137 seconds

七、Hive桶的操作


在建立桶之前,需要设置hive.enforce.bucketing属性为true,使用Hive能识别桶。


1、创建桶


创建一个名为goods_t表,包含两个字段goods_id 和goods_status ,字段类型都为string ,按cat_id string 做分区,按goods_status 列聚类和goods_id列排列,划分成两个桶。


hive> create table goods_t(goods_id string, goods_status string) partitioned by (cat_id string) clustered by(goods_status) sorted by(goods_id) into 2 buckets;
OK
Time taken: 0.148 seconds

2、设置环境变量set hive.enforce.bucketing=true;


hive> set hive.enforce.bucketing=true;

3、向goods_t表中插入goods_2表中的数据


hive> insert overwrite table goods_t partition(cat_id='52063') select goods_id,goods_status from goods_2;
Query ID = root_20220307060336_c76fa90c-ea59-4fa4-9dd5-654c843421fd
Total jobs = 2
Launching Job 1 out of 2
Number of reduce tasks determined at compile time: 2
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Starting Job = job_1646636256603_0004, Tracking URL = http://master:8088/proxy/application_1646636256603_0004/
Kill Command = /home/hadoop//bin/mapred job  -kill job_1646636256603_0004
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 2
2022-03-07 06:04:01,531 Stage-1 map = 0%,  reduce = 0%
2022-03-07 06:04:12,389 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 1.73 sec
2022-03-07 06:04:29,170 Stage-1 map = 100%,  reduce = 50%, Cumulative CPU 4.23 sec
2022-03-07 06:04:30,371 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 6.99 sec
MapReduce Total cumulative CPU time: 7 seconds 410 msec
Ended Job = job_1646636256603_0004
Loading data to table db.goods_t partition (cat_id=52063)
Launching Job 2 out of 2
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Starting Job = job_1646636256603_0005, Tracking URL = http://master:8088/proxy/application_1646636256603_0005/
Kill Command = /home/hadoop//bin/mapred job  -kill job_1646636256603_0005
Hadoop job information for Stage-3: number of mappers: 1; number of reducers: 1
2022-03-07 06:04:54,726 Stage-3 map = 0%,  reduce = 0%
2022-03-07 06:05:07,008 Stage-3 map = 100%,  reduce = 0%, Cumulative CPU 1.75 sec
2022-03-07 06:05:16,566 Stage-3 map = 100%,  reduce = 100%, Cumulative CPU 3.93 sec
MapReduce Total cumulative CPU time: 3 seconds 930 msec
Ended Job = job_1646636256603_0005
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 1  Reduce: 2   Cumulative CPU: 7.41 sec   HDFS Read: 19414 HDFS Write: 469 SUCCESS
Stage-Stage-3: Map: 1  Reduce: 1   Cumulative CPU: 3.93 sec   HDFS Read: 11591 HDFS Write: 173 SUCCESS
Total MapReduce CPU Time Spent: 11 seconds 340 msec
OK
Time taken: 102.151 seconds

4、抽样桶表


hive> select * from goods_t tablesample(bucket 1 out of 2 on goods_status);
OK
Time taken: 0.281 seconds
相关文章
|
8月前
|
SQL 存储 分布式计算
Hive数据仓库设计与优化策略:面试经验与必备知识点解析
本文深入探讨了Hive数据仓库设计原则(分区、分桶、存储格式选择)与优化策略(SQL优化、内置优化器、统计信息、配置参数调整),并分享了面试经验及常见问题,如Hive与RDBMS的区别、实际项目应用和与其他组件的集成。通过代码样例,帮助读者掌握Hive核心技术,为面试做好充分准备。
686 0
|
3月前
|
SQL 分布式计算 关系型数据库
Hadoop-13-Hive 启动Hive 修改启动参数命令行启动测试 几句简单的HQL了解Hive
Hadoop-13-Hive 启动Hive 修改启动参数命令行启动测试 几句简单的HQL了解Hive
79 2
|
3月前
|
SQL 分布式计算 Hadoop
Hadoop-14-Hive HQL学习与测试 表连接查询 HDFS数据导入导出等操作 逻辑运算 函数查询 全表查询 WHERE GROUP BY ORDER BY(一)
Hadoop-14-Hive HQL学习与测试 表连接查询 HDFS数据导入导出等操作 逻辑运算 函数查询 全表查询 WHERE GROUP BY ORDER BY(一)
59 4
|
3月前
|
SQL
Hadoop-14-Hive HQL学习与测试 表连接查询 HDFS数据导入导出等操作 逻辑运算 函数查询 全表查询 WHERE GROUP BY ORDER BY(二)
Hadoop-14-Hive HQL学习与测试 表连接查询 HDFS数据导入导出等操作 逻辑运算 函数查询 全表查询 WHERE GROUP BY ORDER BY(二)
49 2
|
4月前
|
SQL JavaScript 前端开发
基于Python访问Hive的pytest测试代码实现
根据《用Java、Python来开发Hive应用》一文,建立了使用Python、来开发Hive应用的方法,产生的代码如下
85 6
基于Python访问Hive的pytest测试代码实现
|
4月前
|
SQL JavaScript 前端开发
基于Java访问Hive的JUnit5测试代码实现
根据《用Java、Python来开发Hive应用》一文,建立了使用Java、来开发Hive应用的方法,产生的代码如下
83 6
|
8月前
|
SQL 分布式计算 关系型数据库
【数据仓库与联机分析处理】数据仓库工具Hive
【数据仓库与联机分析处理】数据仓库工具Hive
129 6
|
7月前
|
SQL 存储 关系型数据库
杨校老师课题之Hive数据仓库搭建2
杨校老师课题之Hive数据仓库搭建
58 0
|
7月前
|
SQL 存储 关系型数据库
杨校老师课题之Hive数据仓库搭建1
杨校老师课题之Hive数据仓库搭建
86 0
|
8月前
|
SQL 存储 分布式计算
基于Hadoop数据仓库Hive1.2部署及使用
基于Hadoop数据仓库Hive1.2部署及使用