经过前面那么久的折腾,我们终于可以切入主题了,接下来我们用数仓分层的理论,在Hive中建立数据仓库。
ODS层
启动Hive客户端,创建gmall数据库
[root@cdh2 ~]# hive 0: jdbc:hive2://cdh1.macro.com:2181,cdh2.macr> create database gmall; 0: jdbc:hive2://cdh1.macro.com:2181,cdh2.macr> use gmall;
创建启动日志表
ODS层创建启动日志表分析
drop table if exists ods_start_log; CREATE EXTERNAL TABLE ods_start_log (`line` string) PARTITIONED BY (`dt` string) STORED AS INPUTFORMAT 'com.hadoop.mapred.DeprecatedLzoTextInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
注意的是,我们要配置hive的LZO压缩INPUTFORMAT 'com.hadoop.mapred.DeprecatedLzoTextInputFormat'
ODS层加载数据脚本
0: jdbc:hive2://cdh1.macro.com:2181,cdh2.macr> load data inpath '/origin_data/gmall/log/topic_start/2020-11-24' into table gmall.ods_start_log partition(dt='2020-11-24'); INFO : Compiling command(queryId=hive_20201124101935_584d70b8-56bb-4102-aa82-f14e34180f6a): load data inpath '/origin_data/gmall/log/topic_start/2020-11-24' into table gmall.ods_start_log partition(dt='2020-11-24') INFO : Semantic Analysis Completed (retrial = false) INFO : Created Hive schema: Schema(fieldSchemas:null, properties:null) INFO : Completed compiling command(queryId=hive_20201124101935_584d70b8-56bb-4102-aa82-f14e34180f6a); Time taken: 0.171 seconds INFO : Executing command(queryId=hive_20201124101935_584d70b8-56bb-4102-aa82-f14e34180f6a): load data inpath '/origin_data/gmall/log/topic_start/2020-11-24' into table gmall.ods_start_log partition(dt='2020-11-24') INFO : Starting task [Stage-0:MOVE] in serial mode INFO : Loading data to table gmall.ods_start_log partition (dt=2020-11-24) from hdfs://cdh1.macro.com:8020/origin_data/gmall/log/topic_start/2020-11-24 INFO : Starting task [Stage-1:STATS] in serial mode INFO : Completed executing command(queryId=hive_20201124101935_584d70b8-56bb-4102-aa82-f14e34180f6a); Time taken: 1.229 seconds INFO : OK No rows affected (1.443 seconds)
通过Hue查找数据,我们可以看到ods_start_log表已经有数据了
DWD层
创建启动表
0: jdbc:hive2://cdh1.macro.com:2181,cdh2.macr> drop table if exists dwd_start_log; 0: jdbc:hive2://cdh1.macro.com:2181,cdh2.macr> CREATE EXTERNAL TABLE dwd_start_log( . . . . . . . . . . . . . . . . . . . . . . .> `mid_id` string, . . . . . . . . . . . . . . . . . . . . . . .> `user_id` string, . . . . . . . . . . . . . . . . . . . . . . .> `version_code` string, . . . . . . . . . . . . . . . . . . . . . . .> `version_name` string, . . . . . . . . . . . . . . . . . . . . . . .> `lang` string, . . . . . . . . . . . . . . . . . . . . . . .> `source` string, . . . . . . . . . . . . . . . . . . . . . . .> `os` string, . . . . . . . . . . . . . . . . . . . . . . .> `area` string, . . . . . . . . . . . . . . . . . . . . . . .> `model` string, . . . . . . . . . . . . . . . . . . . . . . .> `brand` string, . . . . . . . . . . . . . . . . . . . . . . .> `sdk_version` string, . . . . . . . . . . . . . . . . . . . . . . .> `gmail` string, . . . . . . . . . . . . . . . . . . . . . . .> `height_width` string, . . . . . . . . . . . . . . . . . . . . . . .> `app_time` string, . . . . . . . . . . . . . . . . . . . . . . .> `network` string, . . . . . . . . . . . . . . . . . . . . . . .> `lng` string, . . . . . . . . . . . . . . . . . . . . . . .> `lat` string, . . . . . . . . . . . . . . . . . . . . . . .> `entry` string, . . . . . . . . . . . . . . . . . . . . . . .> `open_ad_type` string, . . . . . . . . . . . . . . . . . . . . . . .> `action` string, . . . . . . . . . . . . . . . . . . . . . . .> `loading_time` string, . . . . . . . . . . . . . . . . . . . . . . .> `detail` string, . . . . . . . . . . . . . . . . . . . . . . .> `extend1` string . . . . . . . . . . . . . . . . . . . . . . .> ) . . . . . . . . . . . . . . . . . . . . . . .> PARTITIONED BY (dt string);
DWD层启动表加载数据脚本
新建脚本dwd_start_log.sh
#!/bin/bash # 定义变量方便修改 APP=gmall # 如果是输入的日期按照取输入日期;如果没输入日期取当前时间的前一天 if [ -n "$1" ] ;then do_date=$1 else do_date=`date -d "-1 day" +%F` fi sql=" set hive.exec.dynamic.partition.mode=nonstrict; insert overwrite table "$APP".dwd_start_log PARTITION (dt='$do_date') select get_json_object(line,'$.mid') mid_id, get_json_object(line,'$.uid') user_id, get_json_object(line,'$.vc') version_code, get_json_object(line,'$.vn') version_name, get_json_object(line,'$.l') lang, get_json_object(line,'$.sr') source, get_json_object(line,'$.os') os, get_json_object(line,'$.ar') area, get_json_object(line,'$.md') model, get_json_object(line,'$.ba') brand, get_json_object(line,'$.sv') sdk_version, get_json_object(line,'$.g') gmail, get_json_object(line,'$.hw') height_width, get_json_object(line,'$.t') app_time, get_json_object(line,'$.nw') network, get_json_object(line,'$.ln') lng, get_json_object(line,'$.la') lat, get_json_object(line,'$.entry') entry, get_json_object(line,'$.open_ad_type') open_ad_type, get_json_object(line,'$.action') action, get_json_object(line,'$.loading_time') loading_time, get_json_object(line,'$.detail') detail, get_json_object(line,'$.extend1') extend1 from "$APP".ods_start_log where dt='$do_date'; " hive -e "$sql"
增加脚本执行权限
chmod 777 dwd_start_log.sh
执行脚本
dwd_start_log.sh 2020-11-24
[root@cdh2]# dwd_start_log.sh 2020-11-24 INFO : Compiling command(queryId=hive_20201124203254_3e88639f-e439-40d3-bd7b-bbbb2159fb3b): insert overwrite table gmall.dwd_start_log PARTITION (dt='2020-11-24') select get_json_object(line,'$.mid') mid_id, get_json_object(line,'$.uid') user_id, get_json_object(line,'$.vc') version_code, get_json_object(line,'$.vn') version_name, get_json_object(line,'$.l') lang, get_json_object(line,'$.sr') source, get_json_object(line,'$.os') os, get_json_object(line,'$.ar') area, get_json_object(line,'$.md') model, get_json_object(line,'$.ba') brand, get_json_object(line,'$.sv') sdk_version, get_json_object(line,'$.g') gmail, get_json_object(line,'$.hw') height_width, get_json_object(line,'$.t') app_time, get_json_object(line,'$.nw') network, get_json_object(line,'$.ln') lng, get_json_object(line,'$.la') lat, get_json_object(line,'$.entry') entry, get_json_object(line,'$.open_ad_type') open_ad_type, get_json_object(line,'$.action') action, get_json_object(line,'$.loading_time') loading_time, get_json_object(line,'$.detail') detail, get_json_object(line,'$.extend1') extend1 from gmall.ods_start_log where dt='2020-11-24' INFO : No Stats for gmall@ods_start_log, Columns: line INFO : Semantic Analysis Completed (retrial = false) INFO : Created Hive schema: Schema(fieldSchemas:[FieldSchema(name:mid_id, type:string, comment:null), FieldSchema(name:user_id, type:string, comment:null), FieldSchema(name:version_code, type:string, comment:null), FieldSchema(name:version_name, type:string, comment:null), FieldSchema(name:lang, type:string, comment:null), FieldSchema(name:source, type:string, comment:null), FieldSchema(name:os, type:string, comment:null), FieldSchema(name:area, type:string, comment:null), FieldSchema(name:model, type:string, comment:null), FieldSchema(name:brand, type:string, comment:null), FieldSchema(name:sdk_version, type:string, comment:null), FieldSchema(name:gmail, type:string, comment:null), FieldSchema(name:height_width, type:string, comment:null), FieldSchema(name:app_time, type:string, comment:null), FieldSchema(name:network, type:string, comment:null), FieldSchema(name:lng, type:string, comment:null), FieldSchema(name:lat, type:string, comment:null), FieldSchema(name:entry, type:string, comment:null), FieldSchema(name:open_ad_type, type:string, comment:null), FieldSchema(name:action, type:string, comment:null), FieldSchema(name:loading_time, type:string, comment:null), FieldSchema(name:detail, type:string, comment:null), FieldSchema(name:extend1, type:string, comment:null)], properties:null) INFO : Completed compiling command(queryId=hive_20201124203254_3e88639f-e439-40d3-bd7b-bbbb2159fb3b); Time taken: 1.615 seconds INFO : Executing command(queryId=hive_20201124203254_3e88639f-e439-40d3-bd7b-bbbb2159fb3b): insert overwrite table gmall.dwd_start_log PARTITION (dt='2020-11-24') select get_json_object(line,'$.mid') mid_id, get_json_object(line,'$.uid') user_id, get_json_object(line,'$.vc') version_code, get_json_object(line,'$.vn') version_name, get_json_object(line,'$.l') lang, get_json_object(line,'$.sr') source, get_json_object(line,'$.os') os, get_json_object(line,'$.ar') area, get_json_object(line,'$.md') model, get_json_object(line,'$.ba') brand, get_json_object(line,'$.sv') sdk_version, get_json_object(line,'$.g') gmail, get_json_object(line,'$.hw') height_width, get_json_object(line,'$.t') app_time, get_json_object(line,'$.nw') network, get_json_object(line,'$.ln') lng, get_json_object(line,'$.la') lat, get_json_object(line,'$.entry') entry, get_json_object(line,'$.open_ad_type') open_ad_type, get_json_object(line,'$.action') action, get_json_object(line,'$.loading_time') loading_time, get_json_object(line,'$.detail') detail, get_json_object(line,'$.extend1') extend1 from gmall.ods_start_log where dt='2020-11-24' INFO : Query ID = hive_20201124203254_3e88639f-e439-40d3-bd7b-bbbb2159fb3b INFO : Total jobs = 1 INFO : Launching Job 1 out of 1 INFO : Starting task [Stage-1:MAPRED] in serial mode INFO : Subscribed to counters: [] for queryId: hive_20201124203254_3e88639f-e439-40d3-bd7b-bbbb2159fb3b INFO : Session is already open INFO : Dag name: insert overwrite table gma...dt='2020-11-24' (Stage-1) INFO : Tez session was closed. Reopening... INFO : Session re-established. INFO : Session re-established. INFO : Status: Running (Executing on YARN cluster with App id application_1606125802436_0004) ---------------------------------------------------------------------------------------------- VERTICES MODE STATUS TOTAL COMPLETED RUNNING PENDING FAILED KILLED ---------------------------------------------------------------------------------------------- Map 1 .......... container SUCCEEDED 1 1 0 0 0 0 Reducer 2 ...... container SUCCEEDED 2 2 0 0 0 0 ---------------------------------------------------------------------------------------------- VERTICES: 02/02 [==========================>>] 100% ELAPSED TIME: 10.58 s ---------------------------------------------------------------------------------------------- INFO : Status: DAG finished successfully in 9.26 seconds INFO : INFO : Query Execution Summary INFO : ---------------------------------------------------------------------------------------------- INFO : OPERATION DURATION INFO : ---------------------------------------------------------------------------------------------- INFO : Compile Query 1.61s INFO : Prepare Plan 0.16s INFO : Get Query Coordinator (AM) 0.03s INFO : Submit Plan 4.28s INFO : Start DAG 1.36s INFO : Run DAG 9.26s INFO : ---------------------------------------------------------------------------------------------- INFO : INFO : Task Execution Summary INFO : ---------------------------------------------------------------------------------------------- INFO : VERTICES DURATION(ms) CPU_TIME(ms) GC_TIME(ms) INPUT_RECORDS OUTPUT_RECORDS INFO : ---------------------------------------------------------------------------------------------- INFO : Map 1 5540.00 11,310 128 502 1 INFO : Reducer 2 177.00 1,490 12 1 0 INFO : ---------------------------------------------------------------------------------------------- INFO : INFO : org.apache.tez.common.counters.DAGCounter: INFO : NUM_SUCCEEDED_TASKS: 3 INFO : TOTAL_LAUNCHED_TASKS: 3 INFO : DATA_LOCAL_TASKS: 1 INFO : AM_CPU_MILLISECONDS: 3270 INFO : AM_GC_TIME_MILLIS: 19 INFO : File System Counters: INFO : FILE_BYTES_READ: 3722 INFO : FILE_BYTES_WRITTEN: 3610 INFO : HDFS_BYTES_READ: 48652 INFO : HDFS_BYTES_WRITTEN: 73761 INFO : HDFS_READ_OPS: 8 INFO : HDFS_WRITE_OPS: 5 INFO : HDFS_OP_CREATE: 3 INFO : HDFS_OP_GET_FILE_STATUS: 7 INFO : HDFS_OP_OPEN: 1 INFO : HDFS_OP_RENAME: 2 INFO : org.apache.tez.common.counters.TaskCounter: INFO : REDUCE_INPUT_GROUPS: 1 INFO : REDUCE_INPUT_RECORDS: 1 INFO : COMBINE_INPUT_RECORDS: 0 INFO : SPILLED_RECORDS: 2 INFO : NUM_SHUFFLED_INPUTS: 1 INFO : NUM_SKIPPED_INPUTS: 1 INFO : NUM_FAILED_SHUFFLE_INPUTS: 0 INFO : MERGED_MAP_OUTPUTS: 1 INFO : GC_TIME_MILLIS: 140 INFO : TASK_DURATION_MILLIS: 5902 INFO : CPU_MILLISECONDS: 12800 INFO : PHYSICAL_MEMORY_BYTES: 1645215744 INFO : VIRTUAL_MEMORY_BYTES: 16442019840 INFO : COMMITTED_HEAP_BYTES: 1645215744 INFO : INPUT_RECORDS_PROCESSED: 502 INFO : INPUT_SPLIT_LENGTH_BYTES: 48652 INFO : OUTPUT_RECORDS: 1 INFO : OUTPUT_LARGE_RECORDS: 0 INFO : OUTPUT_BYTES: 9238 INFO : OUTPUT_BYTES_WITH_OVERHEAD: 9248 INFO : OUTPUT_BYTES_PHYSICAL: 3554 INFO : ADDITIONAL_SPILLS_BYTES_WRITTEN: 0 INFO : ADDITIONAL_SPILLS_BYTES_READ: 3554 INFO : ADDITIONAL_SPILL_COUNT: 0 INFO : SHUFFLE_CHUNK_COUNT: 1 INFO : SHUFFLE_BYTES: 3554 INFO : SHUFFLE_BYTES_DECOMPRESSED: 9248 INFO : SHUFFLE_BYTES_TO_MEM: 0 INFO : SHUFFLE_BYTES_TO_DISK: 0 INFO : SHUFFLE_BYTES_DISK_DIRECT: 3554 INFO : NUM_MEM_TO_DISK_MERGES: 0 INFO : NUM_DISK_TO_DISK_MERGES: 0 INFO : SHUFFLE_PHASE_TIME: 72 INFO : MERGE_PHASE_TIME: 89 INFO : FIRST_EVENT_RECEIVED: 49 INFO : LAST_EVENT_RECEIVED: 49 INFO : HIVE: INFO : CREATED_FILES: 2 INFO : DESERIALIZE_ERRORS: 0 INFO : RECORDS_IN_Map_1: 502 INFO : RECORDS_OUT_0: 1 INFO : RECORDS_OUT_1_gmall.dwd_start_log: 502 INFO : RECORDS_OUT_INTERMEDIATE_Map_1: 1 INFO : RECORDS_OUT_INTERMEDIATE_Reducer_2: 0 INFO : RECORDS_OUT_OPERATOR_FS_10: 1 INFO : RECORDS_OUT_OPERATOR_FS_3: 502 INFO : RECORDS_OUT_OPERATOR_GBY_6: 1 INFO : RECORDS_OUT_OPERATOR_GBY_8: 1 INFO : RECORDS_OUT_OPERATOR_MAP_0: 0 INFO : RECORDS_OUT_OPERATOR_RS_7: 1 INFO : RECORDS_OUT_OPERATOR_SEL_2: 502 INFO : RECORDS_OUT_OPERATOR_SEL_5: 502 INFO : RECORDS_OUT_OPERATOR_SEL_9: 1 INFO : RECORDS_OUT_OPERATOR_TS_0: 502 INFO : TOTAL_TABLE_ROWS_WRITTEN: 502 INFO : Shuffle Errors: INFO : BAD_ID: 0 INFO : CONNECTION: 0 INFO : IO_ERROR: 0 INFO : WRONG_LENGTH: 0 INFO : WRONG_MAP: 0 INFO : WRONG_REDUCE: 0 INFO : Shuffle Errors_Reducer_2_INPUT_Map_1: INFO : BAD_ID: 0 INFO : CONNECTION: 0 INFO : IO_ERROR: 0 INFO : WRONG_LENGTH: 0 INFO : WRONG_MAP: 0 INFO : WRONG_REDUCE: 0 INFO : TaskCounter_Map_1_INPUT_ods_start_log: INFO : INPUT_RECORDS_PROCESSED: 502 INFO : INPUT_SPLIT_LENGTH_BYTES: 48652 INFO : TaskCounter_Map_1_OUTPUT_Reducer_2: INFO : ADDITIONAL_SPILLS_BYTES_READ: 0 INFO : ADDITIONAL_SPILLS_BYTES_WRITTEN: 0 INFO : ADDITIONAL_SPILL_COUNT: 0 INFO : OUTPUT_BYTES: 9238 INFO : OUTPUT_BYTES_PHYSICAL: 3554 INFO : OUTPUT_BYTES_WITH_OVERHEAD: 9248 INFO : OUTPUT_LARGE_RECORDS: 0 INFO : OUTPUT_RECORDS: 1 INFO : SHUFFLE_CHUNK_COUNT: 1 INFO : SPILLED_RECORDS: 1 INFO : TaskCounter_Reducer_2_INPUT_Map_1: INFO : ADDITIONAL_SPILLS_BYTES_READ: 3554 INFO : ADDITIONAL_SPILLS_BYTES_WRITTEN: 0 INFO : COMBINE_INPUT_RECORDS: 0 INFO : FIRST_EVENT_RECEIVED: 49 INFO : LAST_EVENT_RECEIVED: 49 INFO : MERGED_MAP_OUTPUTS: 1 INFO : MERGE_PHASE_TIME: 89 INFO : NUM_DISK_TO_DISK_MERGES: 0 INFO : NUM_FAILED_SHUFFLE_INPUTS: 0 INFO : NUM_MEM_TO_DISK_MERGES: 0 INFO : NUM_SHUFFLED_INPUTS: 1 INFO : NUM_SKIPPED_INPUTS: 1 INFO : REDUCE_INPUT_GROUPS: 1 INFO : REDUCE_INPUT_RECORDS: 1 INFO : SHUFFLE_BYTES: 3554 INFO : SHUFFLE_BYTES_DECOMPRESSED: 9248 INFO : SHUFFLE_BYTES_DISK_DIRECT: 3554 INFO : SHUFFLE_BYTES_TO_DISK: 0 INFO : SHUFFLE_BYTES_TO_MEM: 0 INFO : SHUFFLE_PHASE_TIME: 72 INFO : SPILLED_RECORDS: 1 INFO : TaskCounter_Reducer_2_OUTPUT_out_Reducer_2: INFO : OUTPUT_RECORDS: 0 INFO : org.apache.hadoop.hive.ql.exec.tez.HiveInputCounters: INFO : GROUPED_INPUT_SPLITS_Map_1: 1 INFO : INPUT_DIRECTORIES_Map_1: 1 INFO : INPUT_FILES_Map_1: 1 ---------------------------------------------------------------------------------------------- VERTICES MODE STATUS TOTAL COMPLETED RUNNING PENDING FAILED KILLED ---------------------------------------------------------------------------------------------- Map 1 .......... container SUCCEEDED 1 1 0 0 0 0 Reducer 2 ...... container SUCCEEDED 2 2 0 0 0 0 640498459124089-1/-ext-10000 ---------------------------------------------------------------------------------------------- VERTICES: 02/02 [==========================>>] 100% ELAPSED TIME: 10.61 s ---------------------------------------------------------------------------------------------- 502 rows affected (18.059 seconds)
Hue中查看dwd_start_log,可以看到已经生成数据了
DWS层(用户日活跃)
每日活跃设备分析
目标:统计当日、当周、当月活动的每个设备明细
新建dws_uv_detail_day表
drop table if exists dws_uv_detail_day; create external table dws_uv_detail_day ( `mid_id` string COMMENT '设备唯一标识', `user_id` string COMMENT '用户标识', `version_code` string COMMENT '程序版本号', `version_name` string COMMENT '程序版本名', `lang` string COMMENT '系统语言', `source` string COMMENT '渠道号', `os` string COMMENT '安卓系统版本', `area` string COMMENT '区域', `model` string COMMENT '手机型号', `brand` string COMMENT '手机品牌', `sdk_version` string COMMENT 'sdkVersion', `gmail` string COMMENT 'gmail', `height_width` string COMMENT '屏幕宽高', `app_time` string COMMENT '客户端日志产生时的时间', `network` string COMMENT '网络模式', `lng` string COMMENT '经度', `lat` string COMMENT '纬度' ) partitioned by(dt string) stored as parquet
新建脚本dws_log.sh
[root@cdh2 25k]# cat dws_log.sh #!/bin/bash # 定义变量方便修改 APP=gmall # 如果是输入的日期按照取输入日期;如果没输入日期取当前时间的前一天 if [ -n "$1" ] ;then do_date=$1 else do_date=`date -d "-1 day" +%F` fi sql=" set hive.exec.dynamic.partition.mode=nonstrict; insert overwrite table "$APP".dws_uv_detail_day partition(dt='$do_date') select mid_id, concat_ws('|', collect_set(user_id)) user_id, concat_ws('|', collect_set(version_code)) version_code, concat_ws('|', collect_set(version_name)) version_name, concat_ws('|', collect_set(lang)) lang, concat_ws('|', collect_set(source)) source, concat_ws('|', collect_set(os)) os, concat_ws('|', collect_set(area)) area, concat_ws('|', collect_set(model)) model, concat_ws('|', collect_set(brand)) brand, concat_ws('|', collect_set(sdk_version)) sdk_version, concat_ws('|', collect_set(gmail)) gmail, concat_ws('|', collect_set(height_width)) height_width, concat_ws('|', collect_set(app_time)) app_time, concat_ws('|', collect_set(network)) network, concat_ws('|', collect_set(lng)) lng, concat_ws('|', collect_set(lat)) lat from "$APP".dwd_start_log where dt='$do_date' group by mid_id; " hive -e "$sql"
修改执行权限
chmod 777 dws_log.sh