0: jdbc:hive2://hadoopmaster:10000/> alter table supplybypartition add partition(day=20110102); OK No rows affected (0.088 seconds) 0: jdbc:hive2://hadoopmaster:10000/> alter table supplybypartition add partition(day=20110103); OK No rows affected (0.067 seconds) 0: jdbc:hive2://hadoopmaster:10000/> alter table supplybypartition add partition(day=20110104); OK No rows affected (0.083 seconds)
0: jdbc:hive2://hadoopmaster:10000/> select * from supplybypartition . . . . . . . . . . . . . . . . . .> where day>=20110102 and day<20110103 and quantity<4; OK +-----------------------+-------------------------+-----------------------------+------------------------+--+ | supplybypartition.id | supplybypartition.part | supplybypartition.quantity | supplybypartition.day | +-----------------------+-------------------------+-----------------------------+------------------------+--+ +-----------------------+-------------------------+-----------------------------+------------------------+--+ No rows selected (0.162 seconds) 0: jdbc:hive2://hadoopmaster:10000/>
A number userd for percentage sampling. By changing this number, user will change the subsets of data sampled.
数据分桶存在的一些缺陷:
如果通过数据文件LOAD 到分桶表中,会存在额外的MR负担。
实际生产中分桶策略使用频率较低,更常见的还是使用数据分区。
二. 事务
1. 建表
1 2 3 4 5 6 7 8 9
hive> create table test_trancaction > (user_id Int,name String) > clustered by (user_id) into3 buckets > stored as orc TBLPROPERTIES ('transactional'='true'); OK Time taken: 0.813 seconds hive> create table test_insert_test(id int,name string) row format delimited fields TERMINATED BY','; OK Time taken: 0.11 seconds
2. 导入数据
1 2 3 4
hive> insertinto test_insert_test values(3,"ma");
hive> deletefrom test_insert_test where id=1; FAILED: SemanticException [Error 10294]: Attempt to do updateordeleteusingtransaction manager that does not support these operations.
hive> delete from test_trancaction where user_id=1; WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. tez, spark) or using Hive 1.X releases. Query ID = hadoop_20160810104829_0e78e0cd-2bc9-4741-89c1-7a8d1f384682 Total jobs = 1 Launching Job 1 out of 1 Number of reduce tasks determined at compile time:3 In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=<number> In order to limit the maximum number of reducers: set hive.exec.reducers.max=<number> In order to set a constant number of reducers: set mapreduce.job.reduces=<number> Starting Job = job_1470228460967_0010, Tracking URL = http://hadoopmaster:8088/proxy/application_1470228460967_0010/ Kill Command = /usr/local/hadoop/bin/hadoop job -kill job_1470228460967_0010 Hadoop job information for Stage-1: number of mappers:3; number of reducers:3 2016-08-1010:48:36,463 Stage-1 map = 0%, reduce = 0% 2016-08-1010:48:41,784 Stage-1 map = 33%, reduce = 0%, Cumulative CPU 0.97 sec 2016-08-1010:48:46,913 Stage-1 map = 67%, reduce = 0%, Cumulative CPU 2.0 sec 2016-08-1010:48:48,970 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 3.0 sec 2016-08-1010:48:50,020 Stage-1 map = 100%, reduce = 33%, Cumulative CPU 4.1 sec 2016-08-1010:48:54,117 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 5.76 sec MapReduce Total cumulative CPU time:5 seconds 760 msec Ended Job = job_1470228460967_0010 Loading data to table default.test_trancaction MapReduce Jobs Launched: Stage-Stage-1: Map:3Reduce:3 Cumulative CPU:5.76 sec HDFS Read:32745 HDFS Write:701 SUCCESS Total MapReduce CPU Time Spent:5 seconds 760 msec OK Time taken:26.074 seconds