DWD层数据导入脚本
创建脚本dwd_db.sh
#!/bin/bash # 定义变量方便修改 APP=gmall # 如果是输入的日期按照取输入日期;如果没输入日期取当前时间的前一天 if [ -n "$1" ] ;then do_date=$1 else do_date=`date -d "-1 day" +%F` fi sql=" set hive.exec.dynamic.partition.mode=nonstrict; insert overwrite table "$APP".dwd_order_info partition(dt) select * from "$APP".ods_order_info where dt='$do_date' and id is not null; insert overwrite table "$APP".dwd_order_detail partition(dt) select * from "$APP".ods_order_detail where dt='$do_date' and id is not null; insert overwrite table "$APP".dwd_user_info partition(dt) select * from "$APP".ods_user_info where dt='$do_date' and id is not null; insert overwrite table "$APP".dwd_payment_info partition(dt) select * from "$APP".ods_payment_info where dt='$do_date' and id is not null; insert overwrite table "$APP".dwd_sku_info partition(dt) select sku.id, sku.spu_id, sku.price, sku.sku_name, sku.sku_desc, sku.weight, sku.tm_id, sku.category3_id, c2.id category2_id , c1.id category1_id, c3.name category3_name, c2.name category2_name, c1.name category1_name, sku.create_time, sku.dt from "$APP".ods_sku_info sku join "$APP".ods_base_category3 c3 on sku.category3_id=c3.id join "$APP".ods_base_category2 c2 on c3.category2_id=c2.id join "$APP".ods_base_category1 c1 on c2.category1_id=c1.id where sku.dt='$do_date' and c2.dt='$do_date' and c3.dt='$do_date' and c1.dt='$do_date' and sku.id is not null; " hive -e "$sql"
执行脚本 dwd_db.sh 2020-11-24
INFO : Status: DAG finished successfully in 9.01 seconds INFO : INFO : Query Execution Summary INFO : ---------------------------------------------------------------------------------------------- INFO : OPERATION DURATION INFO : ---------------------------------------------------------------------------------------------- INFO : Compile Query 1.16s INFO : Prepare Plan 0.16s INFO : Get Query Coordinator (AM) 0.01s INFO : Submit Plan 0.10s INFO : Start DAG 0.70s INFO : Run DAG 9.01s INFO : ---------------------------------------------------------------------------------------------- INFO : INFO : Task Execution Summary INFO : ---------------------------------------------------------------------------------------------- INFO : VERTICES DURATION(ms) CPU_TIME(ms) GC_TIME(ms) INPUT_RECORDS OUTPUT_RECORDS INFO : ---------------------------------------------------------------------------------------------- INFO : Map 1 506.00 380 0 143 125 INFO : Map 2 0.00 110 0 18 18 INFO : Map 3 4955.00 10,340 90 1,574 1 INFO : Map 5 0.00 510 0 300 300 INFO : Reducer 4 0.00 610 0 1 0 INFO : ---------------------------------------------------------------------------------------------- INFO : INFO : org.apache.tez.common.counters.DAGCounter: INFO : NUM_SUCCEEDED_TASKS: 6 INFO : TOTAL_LAUNCHED_TASKS: 6 INFO : DATA_LOCAL_TASKS: 4 INFO : AM_CPU_MILLISECONDS: 2930 INFO : AM_GC_TIME_MILLIS: 0 INFO : File System Counters: INFO : FILE_BYTES_READ: 4550 INFO : FILE_BYTES_WRITTEN: 36200 INFO : HDFS_BYTES_READ: 53605 INFO : HDFS_BYTES_WRITTEN: 54917 INFO : HDFS_READ_OPS: 13 INFO : HDFS_WRITE_OPS: 7 INFO : HDFS_OP_CREATE: 3 INFO : HDFS_OP_GET_FILE_STATUS: 9 INFO : HDFS_OP_MKDIRS: 2 INFO : HDFS_OP_OPEN: 4 INFO : HDFS_OP_RENAME: 2 INFO : org.apache.tez.common.counters.TaskCounter: INFO : REDUCE_INPUT_GROUPS: 1 INFO : REDUCE_INPUT_RECORDS: 1 INFO : COMBINE_INPUT_RECORDS: 0 INFO : SPILLED_RECORDS: 2 INFO : NUM_SHUFFLED_INPUTS: 4 INFO : NUM_SKIPPED_INPUTS: 1 INFO : NUM_FAILED_SHUFFLE_INPUTS: 0 INFO : MERGED_MAP_OUTPUTS: 1 INFO : GC_TIME_MILLIS: 90 INFO : TASK_DURATION_MILLIS: 5711 INFO : CPU_MILLISECONDS: 11950 INFO : PHYSICAL_MEMORY_BYTES: 4435476480 INFO : VIRTUAL_MEMORY_BYTES: 32940134400 INFO : COMMITTED_HEAP_BYTES: 4435476480 INFO : INPUT_RECORDS_PROCESSED: 2035 INFO : INPUT_SPLIT_LENGTH_BYTES: 53605 INFO : OUTPUT_RECORDS: 444 INFO : OUTPUT_LARGE_RECORDS: 0 INFO : OUTPUT_BYTES: 41244 INFO : OUTPUT_BYTES_WITH_OVERHEAD: 42158 INFO : OUTPUT_BYTES_PHYSICAL: 32070 INFO : ADDITIONAL_SPILLS_BYTES_WRITTEN: 4050 INFO : ADDITIONAL_SPILLS_BYTES_READ: 4050 INFO : ADDITIONAL_SPILL_COUNT: 0 INFO : SHUFFLE_CHUNK_COUNT: 1 INFO : SHUFFLE_BYTES: 31998 INFO : SHUFFLE_BYTES_DECOMPRESSED: 42158 INFO : SHUFFLE_BYTES_TO_MEM: 31642 INFO : SHUFFLE_BYTES_TO_DISK: 0 INFO : SHUFFLE_BYTES_DISK_DIRECT: 356 INFO : NUM_MEM_TO_DISK_MERGES: 0 INFO : NUM_DISK_TO_DISK_MERGES: 0 INFO : SHUFFLE_PHASE_TIME: 1149 INFO : MERGE_PHASE_TIME: 88 INFO : FIRST_EVENT_RECEIVED: 690 INFO : LAST_EVENT_RECEIVED: 690 INFO : DATA_BYTES_VIA_EVENT: 0 INFO : HIVE: INFO : CREATED_DYNAMIC_PARTITIONS: 1 INFO : CREATED_FILES: 2 INFO : DESERIALIZE_ERRORS: 0 INFO : RECORDS_IN_Map_1: 125 INFO : RECORDS_IN_Map_2: 18 INFO : RECORDS_IN_Map_3: 1149 INFO : RECORDS_IN_Map_5: 300 INFO : RECORDS_OUT_0: 1 INFO : RECORDS_OUT_1_gmall.dwd_sku_info: 300 INFO : RECORDS_OUT_INTERMEDIATE_Map_1: 125 INFO : RECORDS_OUT_INTERMEDIATE_Map_2: 18 INFO : RECORDS_OUT_INTERMEDIATE_Map_3: 1 INFO : RECORDS_OUT_INTERMEDIATE_Map_5: 300 INFO : RECORDS_OUT_INTERMEDIATE_Reducer_4: 0 INFO : RECORDS_OUT_OPERATOR_FIL_41: 1149 INFO : RECORDS_OUT_OPERATOR_FIL_49: 18 INFO : RECORDS_OUT_OPERATOR_FIL_52: 125 INFO : RECORDS_OUT_OPERATOR_FIL_56: 300 INFO : RECORDS_OUT_OPERATOR_FS_22: 300 INFO : RECORDS_OUT_OPERATOR_FS_29: 1 INFO : RECORDS_OUT_OPERATOR_GBY_25: 1 INFO : RECORDS_OUT_OPERATOR_GBY_27: 1 INFO : RECORDS_OUT_OPERATOR_MAPJOIN_44: 1149 INFO : RECORDS_OUT_OPERATOR_MAPJOIN_45: 300 INFO : RECORDS_OUT_OPERATOR_MAPJOIN_54: 125 INFO : RECORDS_OUT_OPERATOR_MAP_0: 0 INFO : RECORDS_OUT_OPERATOR_RS_26: 1 INFO : RECORDS_OUT_OPERATOR_RS_51: 18 INFO : RECORDS_OUT_OPERATOR_RS_55: 125 INFO : RECORDS_OUT_OPERATOR_RS_58: 300 INFO : RECORDS_OUT_OPERATOR_SEL_21: 300 INFO : RECORDS_OUT_OPERATOR_SEL_24: 300 INFO : RECORDS_OUT_OPERATOR_SEL_28: 1 INFO : RECORDS_OUT_OPERATOR_SEL_50: 18 INFO : RECORDS_OUT_OPERATOR_SEL_53: 125 INFO : RECORDS_OUT_OPERATOR_SEL_57: 300 INFO : RECORDS_OUT_OPERATOR_SEL_8: 1149 INFO : RECORDS_OUT_OPERATOR_TS_0: 125 INFO : RECORDS_OUT_OPERATOR_TS_3: 18 INFO : RECORDS_OUT_OPERATOR_TS_6: 1149 INFO : RECORDS_OUT_OPERATOR_TS_9: 300 INFO : TOTAL_TABLE_ROWS_WRITTEN: 300 INFO : Shuffle Errors: INFO : BAD_ID: 0 INFO : CONNECTION: 0 INFO : IO_ERROR: 0 INFO : WRONG_LENGTH: 0 INFO : WRONG_MAP: 0 INFO : WRONG_REDUCE: 0 INFO : Shuffle Errors_Reducer_4_INPUT_Map_3: INFO : BAD_ID: 0 INFO : CONNECTION: 0 INFO : IO_ERROR: 0 INFO : WRONG_LENGTH: 0 INFO : WRONG_MAP: 0 INFO : WRONG_REDUCE: 0 INFO : TaskCounter_Map_1_INPUT_Map_2: INFO : FIRST_EVENT_RECEIVED: 9 INFO : INPUT_RECORDS_PROCESSED: 18 INFO : LAST_EVENT_RECEIVED: 9 INFO : NUM_FAILED_SHUFFLE_INPUTS: 0 INFO : NUM_SHUFFLED_INPUTS: 1 INFO : SHUFFLE_BYTES: 356 INFO : SHUFFLE_BYTES_DECOMPRESSED: 348 INFO : SHUFFLE_BYTES_DISK_DIRECT: 356 INFO : SHUFFLE_BYTES_TO_DISK: 0 INFO : SHUFFLE_BYTES_TO_MEM: 0 INFO : SHUFFLE_PHASE_TIME: 50 INFO : TaskCounter_Map_1_INPUT_c2: INFO : INPUT_RECORDS_PROCESSED: 125 INFO : INPUT_SPLIT_LENGTH_BYTES: 2181 INFO : TaskCounter_Map_1_OUTPUT_Map_3: INFO : ADDITIONAL_SPILLS_BYTES_READ: 0 INFO : ADDITIONAL_SPILLS_BYTES_WRITTEN: 0 INFO : ADDITIONAL_SPILL_COUNT: 0 INFO : DATA_BYTES_VIA_EVENT: 0 INFO : OUTPUT_BYTES: 4131 INFO : OUTPUT_BYTES_PHYSICAL: 2219 INFO : OUTPUT_BYTES_WITH_OVERHEAD: 4387 INFO : OUTPUT_LARGE_RECORDS: 0 INFO : OUTPUT_RECORDS: 125 INFO : SPILLED_RECORDS: 0 INFO : TaskCounter_Map_2_INPUT_c1: INFO : INPUT_RECORDS_PROCESSED: 18 INFO : INPUT_SPLIT_LENGTH_BYTES: 270 INFO : TaskCounter_Map_2_OUTPUT_Map_1: INFO : ADDITIONAL_SPILLS_BYTES_READ: 0 INFO : ADDITIONAL_SPILLS_BYTES_WRITTEN: 0 INFO : ADDITIONAL_SPILL_COUNT: 0 INFO : DATA_BYTES_VIA_EVENT: 0 INFO : OUTPUT_BYTES: 306 INFO : OUTPUT_BYTES_PHYSICAL: 380 INFO : OUTPUT_BYTES_WITH_OVERHEAD: 348 INFO : OUTPUT_LARGE_RECORDS: 0 INFO : OUTPUT_RECORDS: 18 INFO : SPILLED_RECORDS: 0 INFO : TaskCounter_Map_3_INPUT_Map_1: INFO : FIRST_EVENT_RECEIVED: 329 INFO : INPUT_RECORDS_PROCESSED: 125 INFO : LAST_EVENT_RECEIVED: 329 INFO : NUM_FAILED_SHUFFLE_INPUTS: 0 INFO : NUM_SHUFFLED_INPUTS: 1 INFO : SHUFFLE_BYTES: 2195 INFO : SHUFFLE_BYTES_DECOMPRESSED: 4387 INFO : SHUFFLE_BYTES_DISK_DIRECT: 0 INFO : SHUFFLE_BYTES_TO_DISK: 0 INFO : SHUFFLE_BYTES_TO_MEM: 2195 INFO : SHUFFLE_PHASE_TIME: 511 INFO : TaskCounter_Map_3_INPUT_Map_5: INFO : FIRST_EVENT_RECEIVED: 329 INFO : INPUT_RECORDS_PROCESSED: 300 INFO : LAST_EVENT_RECEIVED: 329 INFO : NUM_FAILED_SHUFFLE_INPUTS: 0 INFO : NUM_SHUFFLED_INPUTS: 1 INFO : SHUFFLE_BYTES: 25397 INFO : SHUFFLE_BYTES_DECOMPRESSED: 30468 INFO : SHUFFLE_BYTES_DISK_DIRECT: 0 INFO : SHUFFLE_BYTES_TO_DISK: 0 INFO : SHUFFLE_BYTES_TO_MEM: 25397 INFO : SHUFFLE_PHASE_TIME: 514 INFO : TaskCounter_Map_3_INPUT_c3: INFO : INPUT_RECORDS_PROCESSED: 1149 INFO : INPUT_SPLIT_LENGTH_BYTES: 21652 INFO : TaskCounter_Map_3_OUTPUT_Reducer_4: INFO : ADDITIONAL_SPILLS_BYTES_READ: 0 INFO : ADDITIONAL_SPILLS_BYTES_WRITTEN: 0 INFO : ADDITIONAL_SPILL_COUNT: 0 INFO : OUTPUT_BYTES: 6945 INFO : OUTPUT_BYTES_PHYSICAL: 4050 INFO : OUTPUT_BYTES_WITH_OVERHEAD: 6955 INFO : OUTPUT_LARGE_RECORDS: 0 INFO : OUTPUT_RECORDS: 1 INFO : SHUFFLE_CHUNK_COUNT: 1 INFO : SPILLED_RECORDS: 1 INFO : TaskCounter_Map_5_INPUT_sku: INFO : INPUT_RECORDS_PROCESSED: 300 INFO : INPUT_SPLIT_LENGTH_BYTES: 29502 INFO : TaskCounter_Map_5_OUTPUT_Map_3: INFO : ADDITIONAL_SPILLS_BYTES_READ: 0 INFO : ADDITIONAL_SPILLS_BYTES_WRITTEN: 0 INFO : ADDITIONAL_SPILL_COUNT: 0 INFO : DATA_BYTES_VIA_EVENT: 0 INFO : OUTPUT_BYTES: 29862 INFO : OUTPUT_BYTES_PHYSICAL: 25421 INFO : OUTPUT_BYTES_WITH_OVERHEAD: 30468 INFO : OUTPUT_LARGE_RECORDS: 0 INFO : OUTPUT_RECORDS: 300 INFO : SPILLED_RECORDS: 0 INFO : TaskCounter_Reducer_4_INPUT_Map_3: INFO : ADDITIONAL_SPILLS_BYTES_READ: 4050 INFO : ADDITIONAL_SPILLS_BYTES_WRITTEN: 4050 INFO : COMBINE_INPUT_RECORDS: 0 INFO : FIRST_EVENT_RECEIVED: 23 INFO : LAST_EVENT_RECEIVED: 23 INFO : MERGED_MAP_OUTPUTS: 1 INFO : MERGE_PHASE_TIME: 88 INFO : NUM_DISK_TO_DISK_MERGES: 0 INFO : NUM_FAILED_SHUFFLE_INPUTS: 0 INFO : NUM_MEM_TO_DISK_MERGES: 0 INFO : NUM_SHUFFLED_INPUTS: 1 INFO : NUM_SKIPPED_INPUTS: 1 INFO : REDUCE_INPUT_GROUPS: 1 INFO : REDUCE_INPUT_RECORDS: 1 INFO : SHUFFLE_BYTES: 4050 INFO : SHUFFLE_BYTES_DECOMPRESSED: 6955 INFO : SHUFFLE_BYTES_DISK_DIRECT: 0 INFO : SHUFFLE_BYTES_TO_DISK: 0 INFO : SHUFFLE_BYTES_TO_MEM: 4050 INFO : SHUFFLE_PHASE_TIME: 74 INFO : SPILLED_RECORDS: 1 INFO : TaskCounter_Reducer_4_OUTPUT_out_Reducer_4: INFO : OUTPUT_RECORDS: 0 INFO : org.apache.hadoop.hive.ql.exec.tez.HiveInputCounters: INFO : GROUPED_INPUT_SPLITS_Map_1: 1 INFO : GROUPED_INPUT_SPLITS_Map_2: 1 INFO : GROUPED_INPUT_SPLITS_Map_3: 1 INFO : GROUPED_INPUT_SPLITS_Map_5: 1 INFO : INPUT_DIRECTORIES_Map_1: 1 INFO : INPUT_DIRECTORIES_Map_2: 1 INFO : INPUT_DIRECTORIES_Map_3: 1 INFO : INPUT_DIRECTORIES_Map_5: 1 INFO : INPUT_FILES_Map_1: 1 INFO : INPUT_FILES_Map_2: 1 INFO : INPUT_FILES_Map_3: 1 INFO : INPUT_FILES_Map_5: 1 INFO : RAW_INPUT_SPLITS_Map_1: 1 INFO : RAW_INPUT_SPLITS_Map_2: 1 INFO : RAW_INPUT_SPLITS_Map_3: 1 INFO : RAW_INPUT_SPLITS_Map_5: 1 ---------------------------------------------------------------------------------------------- VERTICES MODE STATUS TOTAL COMPLETED RUNNING PENDING FAILED KILLED ---------------------------------------------------------------------------------------------- Map 2 .......... container SUCCEEDED 1 1 0 0 0 0 Map 1 .......... container SUCCEEDED 1 1 0 0 0 0 Map 5 .......... container SUCCEEDED 1 1 0 0 0 0 Map 3 .......... container SUCCEEDED 1 1 0 0 0 0 Reducer 4 ...... container SUCCEEDED 2 2 0 0 0 0 ---------------------------------------------------------------------------------------------- VERTICES: 05/05 [==========================>>] 100% ELAPSED TIME: 9.59 s ---------------------------------------------------------------------------------------------- 300 rows affected (11.721 seconds) Beeline version 3.1.3000.7.1.3.0-100 by Apache Hive Closing: 0: jdbc:hive2://cdh1.macro.com:2181,cdh2.macro.com:2181,cdh3.macro.com:2181/default;password=root;serviceDiscoveryMode=zooKeeper;user=root;zooKeeperNamespace=hiveserver2
DWS层之用户行为宽表
1)为什么要建宽表
需求目标,把每个用户单日的行为聚合起来组成一张多列宽表,以便之后关联用户维度信息后进行,不同角度的统计分析
创建用户行为宽表
drop table if exists dws_user_action; create external table dws_user_action ( user_id string comment '用户 id', order_count bigint comment '下单次数 ', order_amount decimal(16,2) comment '下单金额 ', payment_count bigint comment '支付次数', payment_amount decimal(16,2) comment '支付金额 ' ) COMMENT '每日用户行为宽表' PARTITIONED BY (`dt` string) stored as parquet tblproperties ("parquet.compression"="snappy");
用户行为数据宽表导入脚本
创建脚本dws_db_wide.sh
#!/bin/bash # 定义变量方便修改 APP=gmall # 如果是输入的日期按照取输入日期;如果没输入日期取当前时间的前一天 if [ -n "$1" ] ;then do_date=$1 else do_date=`date -d "-1 day" +%F` fi sql=" with tmp_order as ( select user_id, count(*) order_count, sum(oi.total_amount) order_amount from "$APP".dwd_order_info oi where date_format(oi.create_time,'yyyy-MM-dd')='$do_date' group by user_id ) , tmp_payment as ( select user_id, sum(pi.total_amount) payment_amount, count(*) payment_count from "$APP".dwd_payment_info pi where date_format(pi.payment_time,'yyyy-MM-dd')='$do_date' group by user_id ) insert overwrite table "$APP".dws_user_action partition(dt='$do_date') select user_actions.user_id, sum(user_actions.order_count), sum(user_actions.order_amount), sum(user_actions.payment_count), sum(user_actions.payment_amount) from ( select user_id, order_count, order_amount, 0 payment_count, 0 payment_amount from tmp_order union all select user_id, 0 order_count, 0 order_amount, payment_count, payment_amount from tmp_payment ) user_actions group by user_id; " hive -e "$sql"
执行脚本
dws_db_wide.sh 2020-11-24
INFO : Status: Running (Executing on YARN cluster with App id application_1606125802436_0017) ---------------------------------------------------------------------------------------------- VERTICES MODE STATUS TOTAL COMPLETED RUNNING PENDING FAILED KILLED ---------------------------------------------------------------------------------------------- Map 1 .......... container SUCCEEDED 1 1 0 0 0 0 Reducer 2 ...... container SUCCEEDED 2 2 0 0 0 0 Reducer 4 ...... container SUCCEEDED 2 2 0 0 0 0 Reducer 5 ...... container SUCCEEDED 2 2 0 0 0 0 Map 6 .......... container SUCCEEDED 1 1 0 0 0 0 Reducer 7 ...... container SUCCEEDED 2 2 0 0 0 0 ---------------------------------------------------------------------------------------------- VERTICES: 06/06 [==========================>>] 100% ELAPSED TIME: 11.15 s ---------------------------------------------------------------------------------------------- INFO : Status: DAG finished successfully in 10.07 seconds INFO : INFO : Query Execution Summary INFO : ---------------------------------------------------------------------------------------------- INFO : OPERATION DURATION INFO : ---------------------------------------------------------------------------------------------- INFO : Compile Query 0.72s INFO : Prepare Plan 0.18s INFO : Get Query Coordinator (AM) 0.01s INFO : Submit Plan 4.30s INFO : Start DAG 1.22s INFO : Run DAG 10.07s INFO : ---------------------------------------------------------------------------------------------- INFO : INFO : Task Execution Summary INFO : ---------------------------------------------------------------------------------------------- INFO : VERTICES DURATION(ms) CPU_TIME(ms) GC_TIME(ms) INPUT_RECORDS OUTPUT_RECORDS INFO : ---------------------------------------------------------------------------------------------- INFO : Map 1 4047.00 7,860 82 1,000 200 INFO : Map 6 4551.00 9,360 115 518 188 INFO : Reducer 2 1011.00 1,730 26 200 307 INFO : Reducer 4 1453.00 3,040 0 388 2 INFO : Reducer 5 443.00 830 8 2 0 INFO : Reducer 7 1011.00 1,970 14 188 188 ---------------------------------------------------------------------------------------------- VERTICES MODE STATUS TOTAL COMPLETED RUNNING PENDING FAILED KILLED ---------------------------------------------------------------------------------------------- Map 1 .......... container SUCCEEDED 1 1 0 0 0 0 Reducer 2 ...... container SUCCEEDED 2 2 0 0 0 0 Reducer 4 ...... container SUCCEEDED 2 2 0 0 0 0 Reducer 5 ...... container SUCCEEDED 2 2 0 0 0 0 Map 6 .......... container SUCCEEDED 1 1 0 0 0 0 .macro.com:8020/warehouse/tablespace/exReducer 7 ...... container SUCCEEDED 2 2 0 0 0 0 3248837236917871195-1/-ext-10000 ---------------------------------------------------------------------------------------------- VERTICES: 06/06 [==========================>>] 100% ELAPSED TIME: 11.17 s ---------------------------------------------------------------------------------------------- 200 rows affected (17.604 seconds) Beeline version 3.1.3000.7.1.3.0-100 by Apache Hive Closing: 0: jdbc:hive2://cdh1.macro.com:2181,cdh2.macro.com:2181,cdh3.macro.com:2181/default;password=root;serviceDiscoveryMode=zooKeeper;user=root;zooKeeperNamespace=hiveserver2
查看hue,数据已经生成
ADS层(需求:GMV成交总额)
建表语句
drop table if exists ads_gmv_sum_day; create external table ads_gmv_sum_day( `dt` string COMMENT '统计日期', `gmv_count` bigint COMMENT '当日gmv订单个数', `gmv_amount` decimal(16,2) COMMENT '当日gmv订单总金额', `gmv_payment` decimal(16,2) COMMENT '当日支付金额' ) COMMENT 'GMV' row format delimited fields terminated by '\t';
数据导入脚本
创建脚本ads_db_gmv.sh
#!/bin/bash # 定义变量方便修改 APP=gmall # 如果是输入的日期按照取输入日期;如果没输入日期取当前时间的前一天 if [ -n "$1" ] ;then do_date=$1 else do_date=`date -d "-1 day" +%F` fi sql=" insert into table "$APP".ads_gmv_sum_day select '$do_date' dt, sum(order_count) gmv_count, sum(order_amount) gmv_amount, sum(payment_amount) payment_amount from "$APP".dws_user_action where dt ='$do_date' group by dt; " hive -e "$sql"
执行脚本
ads_db_gmv.sh 2020-11-24
SLF4J: Class path contains multiple SLF4J bindings. SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-7.1.3-1.cdh7.1.3.p0.4992530/jars/log4j-slf4j-impl-2.10.0.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-7.1.3-1.cdh7.1.3.p0.4992530/jars/slf4j-log4j12-1.7.30.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation. SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory] ERROR StatusLogger No log4j2 configuration file found. Using default configuration: logging only errors to the console. Set system property 'log4j2.debug' to show Log4j2 internal initialization logging. WARNING: Use "yarn jar" to launch YARN applications. SLF4J: Class path contains multiple SLF4J bindings. SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-7.1.3-1.cdh7.1.3.p0.4992530/jars/log4j-slf4j-impl-2.10.0.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-7.1.3-1.cdh7.1.3.p0.4992530/jars/slf4j-log4j12-1.7.30.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation. SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory] Connecting to jdbc:hive2://cdh1.macro.com:2181,cdh2.macro.com:2181,cdh3.macro.com:2181/default;password=root;serviceDiscoveryMode=zooKeeper;user=root;zooKeeperNamespace=hiveserver2 20/11/25 14:51:24 [main-EventThread]: ERROR imps.EnsembleTracker: Invalid config event received: {server.1=cdh2.macro.com:3181:4181:participant, version=0, server.3=cdh1.macro.com:3181:4181:participant, server.2=cdh3.macro.com:3181:4181:participant} 20/11/25 14:51:24 [main-EventThread]: ERROR imps.EnsembleTracker: Invalid config event received: {server.1=cdh2.macro.com:3181:4181:participant, version=0, server.3=cdh1.macro.com:3181:4181:participant, server.2=cdh3.macro.com:3181:4181:participant} 20/11/25 14:51:24 [main]: INFO jdbc.HiveConnection: Connected to cdh2.macro.com:10000 Connected to: Apache Hive (version 3.1.3000.7.1.3.0-100) Driver: Hive JDBC (version 3.1.3000.7.1.3.0-100) Transaction isolation: TRANSACTION_REPEATABLE_READ INFO : Compiling command(queryId=hive_20201125145125_efe11d96-baaa-451f-b5f8-d98403f2a949): insert into table gmall.ads_gmv_sum_day select '2020-11-24' dt, sum(order_count) gmv_count, sum(order_amount) gmv_amount, sum(payment_amount) payment_amount from gmall.dws_user_action where dt ='2020-11-24' group by dt INFO : Semantic Analysis Completed (retrial = false) INFO : Created Hive schema: Schema(fieldSchemas:[FieldSchema(name:_col0, type:string, comment:null), FieldSchema(name:_col1, type:bigint, comment:null), FieldSchema(name:_col2, type:decimal(16,2), comment:null), FieldSchema(name:_col3, type:decimal(16,2), comment:null)], properties:null) INFO : Completed compiling command(queryId=hive_20201125145125_efe11d96-baaa-451f-b5f8-d98403f2a949); Time taken: 0.217 seconds INFO : Executing command(queryId=hive_20201125145125_efe11d96-baaa-451f-b5f8-d98403f2a949): insert into table gmall.ads_gmv_sum_day select '2020-11-24' dt, sum(order_count) gmv_count, sum(order_amount) gmv_amount, sum(payment_amount) payment_amount from gmall.dws_user_action where dt ='2020-11-24' group by dt INFO : Query ID = hive_20201125145125_efe11d96-baaa-451f-b5f8-d98403f2a949 INFO : Total jobs = 1 INFO : Launching Job 1 out of 1 INFO : Starting task [Stage-1:MAPRED] in serial mode INFO : Subscribed to counters: [] for queryId: hive_20201125145125_efe11d96-baaa-451f-b5f8-d98403f2a949 INFO : Session is already open INFO : Dag name: insert into table gmall.ads_gmv_sum_day...dt (Stage-1) INFO : Tez session was closed. Reopening... INFO : Session re-established. INFO : Session re-established. INFO : Status: Running (Executing on YARN cluster with App id application_1606125802436_0018) ---------------------------------------------------------------------------------------------- VERTICES MODE STATUS TOTAL COMPLETED RUNNING PENDING FAILED KILLED ---------------------------------------------------------------------------------------------- Map 1 .......... container SUCCEEDED 1 1 0 0 0 0 Reducer 2 ...... container SUCCEEDED 2 2 0 0 0 0 Reducer 3 ...... container SUCCEEDED 1 1 0 0 0 0 ---------------------------------------------------------------------------------------------- VERTICES: 03/03 [==========================>>] 100% ELAPSED TIME: 10.00 s ---------------------------------------------------------------------------------------------- INFO : Status: DAG finished successfully in 9.45 seconds INFO : INFO : Query Execution Summary INFO : ---------------------------------------------------------------------------------------------- INFO : OPERATION DURATION INFO : ---------------------------------------------------------------------------------------------- INFO : Compile Query 0.22s INFO : Prepare Plan 0.06s INFO : Get Query Coordinator (AM) 0.01s INFO : Submit Plan 4.22s INFO : Start DAG 1.31s INFO : Run DAG 9.45s INFO : ---------------------------------------------------------------------------------------------- INFO : INFO : Task Execution Summary INFO : ---------------------------------------------------------------------------------------------- INFO : VERTICES DURATION(ms) CPU_TIME(ms) GC_TIME(ms) INPUT_RECORDS OUTPUT_RECORDS INFO : ---------------------------------------------------------------------------------------------- INFO : Map 1 3527.00 7,280 102 200 1 INFO : Reducer 2 4397.00 7,540 98 1 2 INFO : Reducer 3 3389.00 630 0 2 0 INFO : ---------------------------------------------------------------------------------------------- ---------------------------------------------------------------------------------------------- VERTICES MODE STATUS TOTAL COMPLETED RUNNING PENDING FAILED KILLED ---------------------------------------------------------------------------------------------- Map 1 .......... container SUCCEEDED 1 1 0 0 0 0 Reducer 2 ...... container SUCCEEDED 2 2 0 0 0 0 tablespace/external/hive/gmall.db/ads_gReducer 3 ...... container SUCCEEDED 1 1 0 0 0 0 ---------------------------------------------------------------------------------------------- VERTICES: 03/03 [==========================>>] 100% ELAPSED TIME: 10.01 s ---------------------------------------------------------------------------------------------- 1 row affected (15.842 seconds) Beeline version 3.1.3000.7.1.3.0-100 by Apache Hive Closing: 0: jdbc:hive2://cdh1.macro.com:2181,cdh2.macro.com:2181,cdh3.macro.com:2181/default;password=root;serviceDiscoveryMode=zooKeeper;user=root;zooKeeperNamespace=hiveserver2
数据导出脚本
在MySQL中创建ads_gmv_sum_day表
DROP TABLE IF EXISTS ads_gmv_sum_day; CREATE TABLE ads_gmv_sum_day( `dt` varchar(200) DEFAULT NULL COMMENT '统计日期', `gmv_count` bigint(20) DEFAULT NULL COMMENT '当日gmv订单个数', `gmv_amount` decimal(16, 2) DEFAULT NULL COMMENT '当日gmv订单总金额', `gmv_payment` decimal(16, 2) DEFAULT NULL COMMENT '当日支付金额' ) ENGINE = InnoDB CHARACTER SET = utf8 COLLATE = utf8_general_ci COMMENT = '每日活跃用户数量' ROW_FORMAT = Dynamic;
创建脚本sqoop_export.sh
vim sqoop_export.sh
#!/bin/bash export HADOOP_USER_NAME=hive db_name=gmall export_data() { sqoop export \ --connect "jdbc:mysql://192.168.0.208:3306/${db_name}?useUnicode=true&characterEncoding=utf-8" \ --username root \ --password password \ --table $1 \ --num-mappers 1 \ --export-dir /warehouse/tablespace/external/hive/gmall.db/$1 \ --input-fields-terminated-by "\t" \ --update-mode allowinsert \ --update-key $2 \ --input-null-string '\\N' \ --input-null-non-string '\\N' } case $1 in "ads_gmv_sum_day") export_data "ads_gmv_sum_day" "dt" ;; "all") export_data "ads_gmv_sum_day" "dt" ;; esac
执行脚本导入数据
sqoop_export.sh all
Warning: /opt/cloudera/parcels/CDH-7.1.3-1.cdh7.1.3.p0.4992530/bin/../lib/sqoop/../accumulo does not exist! Accumulo imports will fail. Please set $ACCUMULO_HOME to the root of your Accumulo installation. SLF4J: Class path contains multiple SLF4J bindings. SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-7.1.3-1.cdh7.1.3.p0.4992530/jars/slf4j-log4j12-1.7.30.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-7.1.3-1.cdh7.1.3.p0.4992530/jars/log4j-slf4j-impl-2.10.0.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation. SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory] 20/11/25 15:49:34 INFO sqoop.Sqoop: Running Sqoop version: 1.4.7.7.1.3.0-100 20/11/25 15:49:34 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead. 20/11/25 15:49:35 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset. 20/11/25 15:49:35 INFO tool.CodeGenTool: Beginning code generation 20/11/25 15:49:35 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `ads_gmv_sum_day` AS t LIMIT 1 20/11/25 15:49:35 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `ads_gmv_sum_day` AS t LIMIT 1 20/11/25 15:49:35 INFO orm.CompilationManager: HADOOP_MAPRED_HOME is /opt/cloudera/parcels/CDH/lib/hadoop-mapreduce warning: No SupportedSourceVersion annotation found on org.apache.hadoop.hdds.conf.ConfigFileGenerator, returning RELEASE_6. warning: Supported source version 'RELEASE_6' from annotation processor 'org.apache.hadoop.hdds.conf.ConfigFileGenerator' less than -source '1.8' 2 warnings 20/11/25 15:49:38 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-root/compile/f6653e4db17d3a1223b93c16acc9d822/ads_gmv_sum_day.jar 20/11/25 15:49:38 WARN manager.MySQLManager: MySQL Connector upsert functionality is using INSERT ON 20/11/25 15:49:38 WARN manager.MySQLManager: DUPLICATE KEY UPDATE clause that relies on table's unique key. 20/11/25 15:49:38 WARN manager.MySQLManager: Insert/update distinction is therefore independent on column 20/11/25 15:49:38 WARN manager.MySQLManager: names specified in --update-key parameter. Please see MySQL 20/11/25 15:49:38 WARN manager.MySQLManager: documentation for additional limitations. 20/11/25 15:49:38 INFO mapreduce.ExportJobBase: Beginning export of ads_gmv_sum_day 20/11/25 15:49:38 INFO Configuration.deprecation: mapred.jar is deprecated. Instead, use mapreduce.job.jar 20/11/25 15:49:39 INFO Configuration.deprecation: mapred.reduce.tasks.speculative.execution is deprecated. Instead, use mapreduce.reduce.speculative 20/11/25 15:49:39 INFO Configuration.deprecation: mapred.map.tasks.speculative.execution is deprecated. Instead, use mapreduce.map.speculative 20/11/25 15:49:39 INFO Configuration.deprecation: mapred.map.tasks is deprecated. Instead, use mapreduce.job.maps 20/11/25 15:49:40 INFO client.RMProxy: Connecting to ResourceManager at cdh2.macro.com/192.168.0.207:8032 20/11/25 15:49:41 INFO mapreduce.JobResourceUploader: Disabling Erasure Coding for path: /user/hive/.staging/job_1606125802436_0019 20/11/25 15:49:45 INFO input.FileInputFormat: Total input files to process : 1 20/11/25 15:49:45 INFO input.FileInputFormat: Total input files to process : 1 20/11/25 15:49:45 INFO lzo.GPLNativeCodeLoader: Loaded native gpl library 20/11/25 15:49:45 INFO lzo.LzoCodec: Successfully loaded & initialized native-lzo library [hadoop-lzo rev 19cb887b751ca6813bec478cf3bf20459deed68b] 20/11/25 15:49:45 INFO mapreduce.JobSubmitter: number of splits:1 20/11/25 15:49:45 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1606125802436_0019 20/11/25 15:49:45 INFO mapreduce.JobSubmitter: Executing with tokens: [] 20/11/25 15:49:45 INFO conf.Configuration: resource-types.xml not found 20/11/25 15:49:45 INFO resource.ResourceUtils: Unable to find 'resource-types.xml'. 20/11/25 15:49:46 INFO impl.YarnClientImpl: Submitted application application_1606125802436_0019 20/11/25 15:49:46 INFO mapreduce.Job: The url to track the job: http://cdh2.macro.com:8088/proxy/application_1606125802436_0019/ 20/11/25 15:49:46 INFO mapreduce.Job: Running job: job_1606125802436_0019 20/11/25 15:49:54 INFO mapreduce.Job: Job job_1606125802436_0019 running in uber mode : false 20/11/25 15:49:54 INFO mapreduce.Job: map 0% reduce 0% 20/11/25 15:50:02 INFO mapreduce.Job: map 100% reduce 0% 20/11/25 15:50:02 INFO mapreduce.Job: Job job_1606125802436_0019 completed successfully 20/11/25 15:50:02 INFO mapreduce.Job: Counters: 33 File System Counters FILE: Number of bytes read=0 FILE: Number of bytes written=253343 FILE: Number of read operations=0 FILE: Number of large read operations=0 FILE: Number of write operations=0 HDFS: Number of bytes read=215 HDFS: Number of bytes written=0 HDFS: Number of read operations=4 HDFS: Number of large read operations=0 HDFS: Number of write operations=0 HDFS: Number of bytes read erasure-coded=0 Job Counters Launched map tasks=1 Data-local map tasks=1 Total time spent by all maps in occupied slots (ms)=5709 Total time spent by all reduces in occupied slots (ms)=0 Total time spent by all map tasks (ms)=5709 Total vcore-milliseconds taken by all map tasks=5709 Total megabyte-milliseconds taken by all map tasks=11692032 Map-Reduce Framework Map input records=1 Map output records=1 Input split bytes=176 Spilled Records=0 Failed Shuffles=0 Merged Map outputs=0 GC time elapsed (ms)=74 CPU time spent (ms)=1320 Physical memory (bytes) snapshot=285306880 Virtual memory (bytes) snapshot=2846998528 Total committed heap usage (bytes)=320864256 Peak Map Physical memory (bytes)=285306880 Peak Map Virtual memory (bytes)=2846998528 File Input Format Counters Bytes Read=0 File Output Format Counters Bytes Written=0 20/11/25 15:50:02 INFO mapreduce.ExportJobBase: Transferred 215 bytes in 22.3977 seconds (9.5992 bytes/sec) 20/11/25 15:50:02 INFO mapreduce.ExportJobBase: Exported 1 records.
查看mysql数据
MariaDB [gmall]> select * from ads_gmv_sum_day; +------------+-----------+------------+-------------+ | dt | gmv_count | gmv_amount | gmv_payment | +------------+-----------+------------+-------------+ | 2020-11-24 | 1000 | 527431.00 | 268152.00 | +------------+-----------+------------+-------------+ 1 row in set (0.00 sec)