适用于
Oracle 10G或更高版本的单机,本文适用于任何平台。
详情
- Active Session: on the CPU or was waiting for an event that didn't belong to the Idle wait class
- 在Oracle 10G中,我们引入了AWR和ASH采样机制,有一个视图gv$active_session_history会每秒钟将数据库所有节点的Active Session采样一次,
- 基于这个特征,我们可以通过分析v$active_session_history的Session采样情况,来定位问题发生的准确时间范围,并且可以观察每个采样点的top event和top holder。下面通过一个例子来详细说明。
v$active_session_history版
1. Dump出问题期间的ASH数据:
为了不影响生产系统,我们可以将问题大概期间的ASH数据export出来在测试机上分析。
基于gv$active_session_history创建一个新表t_ash,然后将其通过exp/imp导入到测试机。在发生问题的数据库上执行exp:
SQL> conn user/passwd
SQL> create table m_ash_20220416 as select * from v$ACTIVE_SESSION_HISTORY
where sample_time >
to_timestamp('2022-04-16 15:30:00','yyyy-mm-dd hh24:mi:ss')
and sample_time <
to_timestamp('2022-04-16 16:30:00','yyyy-mm-dd hh24:mi:ss');
$ exp user/passwd file=m_ash_20220416.dmp tables=(t_ash) log=t_ash.exp.log
然后导入到测试机:
$ imp user/passwd file=m_ash_20220416.dmp log=m_ash_20220416.imp.log
2. 验证导出的ASH时间范围:
建议采用Oracle SQL Developer来查询以防止输出结果折行不便于观察。
set line 200 pages 1000
col sample_time for a25
col event for a40
alter session set nls_timestamp_format='yyyy-mm-dd hh24:mi:ss.ff';
select
min(sample_time), max(sample_time), count(*) session_count
from m_ash_20220416 t;
MIN(SAMPLE_TIME) MAX(SAMPLE_TIME) SESSION_COUNT
-------------------------------------------------- -------------------------------------------------- -------------
2022/4/16 16:01:14.687 2022/4/16 16:29:59.501 125301
从以上输出可知该数据库采样时间共半小时。
3. 确认问题发生的精确时间范围:
参考如下脚本:
col event for a35
set lines 222
set pages 9999
select
sample_id, sample_time, count(*) session_count
from m_ash_20220416 t
group by sample_id, sample_time
order by sample_time;
SAMPLE_ID SAMPLE_TIME SESSION_COUNT
---------- -------------------------------------------------- -------------
200574254 2022/4/16 16:01:14.687 1087
200574255 2022/4/16 16:01:15.714 1090
200574256 2022/4/16 16:01:16.743 1076
200574257 2022/4/16 16:01:17.769 1080
200574258 2022/4/16 16:01:19.937 1087
200574259 2022/4/16 16:01:20.961 1092
200574260 2022/4/16 16:01:21.989 1093
200574261 2022/4/16 16:01:23.018 1093
200574262 2022/4/16 16:01:24.051 1097
200574263 2022/4/16 16:01:25.076 1100
200574264 2022/4/16 16:01:26.102 1101
200574265 2022/4/16 16:01:27.129 1064
200574266 2022/4/16 16:01:28.155 1067
200574267 2022/4/16 16:01:29.189 1071
200574268 2022/4/16 16:01:30.214 1075
...
200574374 2022/4/16 16:03:21.307 255
200574375 2022/4/16 16:03:22.319 287
200574376 2022/4/16 16:03:23.332 318
200574377 2022/4/16 16:03:24.352 198
200574378 2022/4/16 16:03:25.364 216
200574379 2022/4/16 16:03:26.378 40
200574380 2022/4/16 16:03:27.389 22
200574381 2022/4/16 16:03:28.402 25
200574382 2022/4/16 16:03:29.432 27
200574383 2022/4/16 16:03:30.445 25
注意观察以上输出的每个采样点的active session的数量,数量突然变多往往意味着问题发生了。从以上输出可以确定问题发生的精确时间在2022/4/16 16:01:14 ~ 2022/4/16 16:03:25,问题持续了大约2分钟。
注意: 观察以上的输出有无断档,比如某些时间没有采样。
4. 确定每个采样点的top n event:
在这里我们指定的是top 2 event 并且注掉了采样时间以观察所有采样点的情况。如果数据量较多,您也可以通过开启sample_time的注释来观察某个时间段的情况。
注意最后一列session_count指的是该采样点上的等待该event的session数量。
col event for a35
set lines 1000
set pages 9999
select
t.sample_id,
t.sample_time,
t.event,
t.session_state,
t.c session_count
from (select t.*,
rank() over(partition by sample_time order by c desc) r
from (select /*+ parallel 2 */
t.*,
count(*) over(partition by sample_time, event) c,
row_number() over(partition by sample_time, event order by 1) r1
from m_ash_20220416 t
/*where sample_time >
to_timestamp('2022-04-16 13:59:00',
'yyyy-mm-dd hh24:mi:ss')
and sample_time <
to_timestamp('2022-04-16 14:10:00',
'yyyy-mm-dd hh24:mi:ss')*/
) t
where r1 = 1) t
where r < 3
order by sample_time, r;
SAMPLE_ID SAMPLE_TIME EVENT SESSION_STATE SESSION_COUNT
---------- -------------------------------------------------- ----------------------------------- ------------- -------------
200574254 2022/4/16 16:01:14.687 buffer busy waits WAITING 870
200574254 2022/4/16 16:01:14.687 log file sync WAITING 143
200574255 2022/4/16 16:01:15.714 buffer busy waits WAITING 870
200574255 2022/4/16 16:01:15.714 log file sync WAITING 146
200574256 2022/4/16 16:01:16.743 buffer busy waits WAITING 869
200574256 2022/4/16 16:01:16.743 log file sync WAITING 133
200574257 2022/4/16 16:01:17.769 buffer busy waits WAITING 869
200574257 2022/4/16 16:01:17.769 log file sync WAITING 136
200574258 2022/4/16 16:01:19.937 buffer busy waits WAITING 869
200574258 2022/4/16 16:01:19.937 log file sync WAITING 144
200574259 2022/4/16 16:01:20.961 buffer busy waits WAITING 868
200574259 2022/4/16 16:01:20.961 log file sync WAITING 148
200574260 2022/4/16 16:01:21.989 buffer busy waits WAITING 868
200574260 2022/4/16 16:01:21.989 log file sync WAITING 150
200574261 2022/4/16 16:01:23.018 buffer busy waits WAITING 869
200574261 2022/4/16 16:01:23.018 log file sync WAITING 152
200574262 2022/4/16 16:01:24.051 buffer busy waits WAITING 869
200574262 2022/4/16 16:01:24.051 log file sync WAITING 156
200574263 2022/4/16 16:01:25.076 buffer busy waits WAITING 870
200574263 2022/4/16 16:01:25.076 log file sync WAITING 158
200574264 2022/4/16 16:01:26.102 buffer busy waits WAITING 869
...
200574396 2022/4/16 16:03:43.716 direct path read WAITING 1
200574396 2022/4/16 16:03:43.716 log file sync WAITING 1
200574397 2022/4/16 16:03:44.730 ON CPU 14
200574397 2022/4/16 16:03:44.730 db file parallel write WAITING 2
200574398 2022/4/16 16:03:45.744 ON CPU 11
200574398 2022/4/16 16:03:45.744 db file parallel write WAITING 3
200574399 2022/4/16 16:03:46.754 ON CPU 11
200574399 2022/4/16 16:03:46.754 Backup: MML create a backup piece WAITING 1
200574399 2022/4/16 16:03:46.754 Disk file operations I/O WAITING 1
200574399 2022/4/16 16:03:46.754 db file sequential read WAITING 1
从以上输出我们可以发现问题期间最严重的等待为buffer busy waits ,高峰期等待该event的session数达到了870个,其次为log file sync,高峰期session为490个。
***注意:
1) 再次确认以上输出有无断档,是否有某些时间没有采样。
2) 注意那些session_state为ON CPU的输出,比较ON CPU的进程个数与您的OS物理CPU的个数,如果接近或者超过物理CPU个数,那么您还需要检查OS当时的CPU资源状况,
比如OSWatcher/NMON等工具,高的CPU Run Queue可能引发该问题,当然也可能是问题的结果,需要结合OSWatcher和ASH的时间顺序来验证。
5. 观察每个采样点的等待链:(更详细显示参考:附1)
其原理为通过 $active_session_history. blocking_session记录的holder来通过connect by级联查询,找出最终的holder.
该输出中isleaf=1的都是最终holder,而iscycle=1的代表死锁了(也就是在同一个采样点中a等待b,b等待c,而c又等待a,这种情况如果持续发生,那么尤其值得关注)。采用如下查询能观察到阻塞链。
col event for a30
col SAMPLE_TIME for a30
set lines 322
set pages 9999
select /*+ parallel 8 */
level lv,
connect_by_isleaf isleaf,
connect_by_iscycle iscycle,
t.sample_id,
t.sample_time,
t.session_id,
t.sql_id,
t.session_type,
t.event,
t.session_state,
t.blocking_session,
t.blocking_session_status
from m_ash_20220416 t
start with blocking_session is not null
connect by nocycle
prior sample_id = sample_id
and prior blocking_session = session_id
and prior blocking_session_serial# = session_serial#
order siblings by sample_id;
LV ISLEAF ISCYCLE SAMPLE_ID SAMPLE_TIME SESSION_ID SQL_ID SESSION_TYPE EVENT SESSION_STATE BLOCKING_SESSION BLOCKING_SESSION_STATUS
---------- ---------- ---------- ---------- ------------------------------ ---------- ------------- ------------ ------------------------------ ------------- ---------------- -----------------------
1 0 0 200574254 2022/4/16 16:01:14.687 14306 dxxxh350mfyvj FOREGROUND enq: US - contention WAITING 66 VALID
2 0 0 200574254 2022/4/16 16:01:14.687 66 dxxxh350mfyvj FOREGROUND buffer busy waits WAITING 4670 VALID
3 0 0 200574254 2022/4/16 16:01:14.687 4670 dxxxh350mfyvj FOREGROUND buffer busy waits WAITING 6082 VALID
4 0 0 200574254 2022/4/16 16:01:14.687 6082 dxxxh350mfyvj FOREGROUND log file sync WAITING 10501 VALID
5 1 0 200574254 2022/4/16 16:01:14.687 10501 BACKGROUND log file parallel write WAITING NO HOLDER
1 0 0 200574254 2022/4/16 16:01:14.687 3829 dxxxh350mfyvj FOREGROUND enq: US - contention WAITING 772 VALID
2 0 0 200574254 2022/4/16 16:01:14.687 772 dxxxh350mfyvj FOREGROUND buffer busy waits WAITING 4670 VALID
3 0 0 200574254 2022/4/16 16:01:14.687 4670 dxxxh350mfyvj FOREGROUND buffer busy waits WAITING 6082 VALID
4 0 0 200574254 2022/4/16 16:01:14.687 6082 dxxxh350mfyvj FOREGROUND log file sync WAITING 10501 VALID
5 1 0 200574254 2022/4/16 16:01:14.687 10501 BACKGROUND log file parallel write WAITING NO HOLDER
1 0 0 200574254 2022/4/16 16:01:14.687 9159 dxxxh350mfyvj FOREGROUND enq: US - contention WAITING 922 VALID
2 0 0 200574254 2022/4/16 16:01:14.687 922 dxxxh350mfyvj FOREGROUND buffer busy waits WAITING 4670 VALID
3 0 0 200574254 2022/4/16 16:01:14.687 4670 dxxxh350mfyvj FOREGROUND buffer busy waits WAITING 6082 VALID
4 0 0 200574254 2022/4/16 16:01:14.687 6082 dxxxh350mfyvj FOREGROUND log file sync WAITING 10501 VALID
5 1 0 200574254 2022/4/16 16:01:14.687 10501 BACKGROUND log file parallel write WAITING NO HOLDER
1 0 0 200574254 2022/4/16 16:01:14.687 100 dxxxh350mfyvj FOREGROUND enq: US - contention WAITING 1558 VALID
2 0 0 200574254 2022/4/16 16:01:14.687 1558 dxxxh350mfyvj FOREGROUND buffer busy waits WAITING 4670 VALID
3 0 0 200574254 2022/4/16 16:01:14.687 4670 dxxxh350mfyvj FOREGROUND buffer busy waits WAITING 6082 VALID
4 0 0 200574254 2022/4/16 16:01:14.687 6082 dxxxh350mfyvj FOREGROUND log file sync WAITING 10501 VALID
5 1 0 200574254 2022/4/16 16:01:14.687 10501 BACKGROUND log file parallel write WAITING NO HOLDER
1 0 0 200574254 2022/4/16 16:01:14.687 2404 2pa3dfjgqarkg FOREGROUND buffer busy waits WAITING 1624 VALID
2 0 0 200574254 2022/4/16 16:01:14.687 1624 2pa3dfjgqarkg FOREGROUND log file sync WAITING 10501 VALID
3 1 0 200574254 2022/4/16 16:01:14.687 10501 BACKGROUND log file parallel write WAITING NO HOLDER
***注意:
从上面的输出可见,在相同的采样点上(2022/4/16 16:01:14), 大部分session 在等待buffer busy waits和log file sync,其最终都被 session 10501 阻塞,
而 session 10501 又在等待log file parallel write,因此这里buffer busy waits只是一个表面现象,
问题的原因在于log file parallel write
6. 基于第5步的原理来找出每个采样点的最终top holder:
比如如下SQL列出了每个采样点top 2的blocker session,并且计算了其最终阻塞的session数(参考blocking_session_count)
col event for a30
col SAMPLE_TIME for a30
set lines 322
set pages 9999
select t.lv,
t.iscycle,
t.sample_id,
t.sample_time,
t.session_id,
t.sql_id,
t.session_type,
t.event,
t.seq#,
t.session_state,
t.blocking_inst_id,
t.blocking_session,
t.blocking_session_status,
t.c blocking_session_count
from (select t.*,
row_number() over(partition by sample_time order by c desc) r
from (select t.*,
count(*) over(partition by sample_time, session_id) c,
row_number() over(partition by sample_time, session_id order by 1) r1
from (select
level lv,
connect_by_isleaf isleaf,
connect_by_iscycle iscycle,
t.*
from m_ash_20220416 t
/*where sample_time >
to_timestamp('2013-11-17 13:55:00',
'yyyy-mm-dd hh24:mi:ss')
and sample_time <
to_timestamp('2013-11-17 14:10:00',
'yyyy-mm-dd hh24:mi:ss')*/
start with blocking_session is not null
connect by nocycle
prior sample_time = sample_time
and prior blocking_session = session_id
and prior
blocking_session_serial# = session_serial#) t
where t.isleaf = 1) t
where r1 = 1) t
where r < 3
order by sample_time, r;
LV ISCYCLE SAMPLE_ID SAMPLE_TIME SESSION_ID SQL_ID SESSION_TYPE EVENT SEQ# SESSION_STATE BLOCKING_INST_ID BLOCKING_SESSION BLOCKING_SESSION_STATUS BLOCKING_SESSION_COUNT
---------- ---------- ---------- ------------------------------ ---------- ------------- ------------ ------------------------------ ---------- ------------- ---------------- ---------------- ----------------------- ----------------------
4 0 200574254 2022/4/16 16:01:14.687 10501 BACKGROUND log file parallel write 43500 WAITING NO HOLDER 1071
4 0 200574255 2022/4/16 16:01:15.714 10501 BACKGROUND log file parallel write 43500 WAITING NO HOLDER 1074
3 0 200574256 2022/4/16 16:01:16.743 10501 BACKGROUND log file parallel write 43501 WAITING NO HOLDER 1063
4 0 200574257 2022/4/16 16:01:17.769 10501 BACKGROUND log file parallel write 43501 WAITING NO HOLDER 1066
2 0 200574258 2022/4/16 16:01:19.937 10501 BACKGROUND log file parallel write 43501 WAITING NO HOLDER 1073
4 0 200574259 2022/4/16 16:01:20.961 10501 BACKGROUND log file parallel write 43501 WAITING NO HOLDER 1075
4 0 200574260 2022/4/16 16:01:21.989 10501 BACKGROUND log file parallel write 43501 WAITING NO HOLDER 1077
2 0 200574261 2022/4/16 16:01:23.018 10501 BACKGROUND log file parallel write 43501 WAITING NO HOLDER 1079
2 0 200574262 2022/4/16 16:01:24.051 10501 BACKGROUND log file parallel write 43501 WAITING NO HOLDER 1083
2 0 200574263 2022/4/16 16:01:25.076 10501 BACKGROUND log file parallel write 43501 WAITING NO HOLDER 1086
5 0 200574264 2022/4/16 16:01:26.102 10501 BACKGROUND log file parallel write 43501 WAITING NO HOLDER 1087
...
2 0 200575893 2022/4/16 16:29:51.395 10501 BACKGROUND log file parallel write 13558 WAITING NO HOLDER 2
2 0 200575894 2022/4/16 16:29:52.403 10501 BACKGROUND log file parallel write 13777 WAITING NO HOLDER 2
2 0 200575896 2022/4/16 16:29:54.431 10501 BACKGROUND log file parallel write 14150 WAITING NO HOLDER 5
1 0 200575898 2022/4/16 16:29:56.455 18135 FOREGROUND log file sync 32163 WAITING 1 10501 VALID 1
2 0 200575900 2022/4/16 16:29:58.476 10501 BACKGROUND log file parallel write 15150 WAITING NO HOLDER 1
已选择 923 行。
从以上输出,比如第一行,代表在2022/4/16 16:01:14,节点1的session 10501 最终阻塞了1071个session. 顺着时间往下看,可见session 10501 是问题期间最严重的holder,
它在每个采样点都阻塞了1000多个session,并且它持续等待log file parallel write,
注意观察其seq#您会发现该event的seq#在16:01分有几秒没变化,表明session hang住了,过了会才不断变化,表明该session又不完全hang住,
由于10501是redo写日志后台进程,并等待事件为具体的log file parallel write,这很可能是在问题时间段io有问题造成整个数据库无法响应。
因此可以结合oswatch或系统层面的io监控输出进一步确定问题。
**注意:
1) blocking_session_count 指某一个holder最终阻塞的session数,比如 a <- b<- c (a被b阻塞,b又被c阻塞),只计算c阻塞了1个session,因为中间的b可能在不同的阻塞链中发生重复。
2) 如果最终的holder没有被ash采样(一般因为该holder处于空闲),比如 a<- c 并且b<- c (a被c阻塞,并且b也被c阻塞),但是c没有采样,那么以上脚本无法将c统计到最终holder里,这可能会导致一些遗漏。
3) 注意比较blocking_session_count的数量与第3步查询的每个采样点的总session_count数,如果每个采样点的blocking_session_count数远小于总session_count数,那表明大部分session并未记载holder,因此本查询的结果并不能代表什么。
4) 在Oracle 10g中,ASH并没有blocking_inst_id列,在以上所有的脚本中,您只需要去掉该列即可,因此10g的ASH一般只能用于诊断单节点的问题。
附1:根据等待事件查看等待链
select /*+ parallel 8 */
level lv,
connect_by_isleaf isleaf,
connect_by_iscycle iscycle,
t.sample_id,
t.sample_time,
t.session_id,
t.session_serial#,
t.session_type,
t.session_state,
t.SQL_OPNAME,
t.sql_id,
T.SQL_PLAN_HASH_VALUE,
t.SQL_PLAN_LINE_ID,
T.SQL_CHILD_NUMBER,
T.SQL_EXEC_ID,
T.SQL_EXEC_START,
T.IN_BIND,
t.IN_PARSE,
t.IN_HARD_PARSE,
t.IN_SQL_EXECUTION,
t.IN_PLSQL_COMPILATION,
t.IN_PLSQL_EXECUTION,
t.IN_SEQUENCE_LOAD,
T.IN_PLSQL_RPC,
t.PROGRAM,
t.MACHINE,
t.PGA_ALLOCATED,
t.TEMP_SPACE_ALLOCATED,
t.DELTA_READ_IO_BYTES,
t.DELTA_WRITE_IO_BYTES,
t.TM_DELTA_CPU_TIME,
t.event,
t.blocking_session,
t.blocking_session_serial#,
t.blocking_session_status,
t.WAIT_TIME,
t.TIME_WAITED,
t.CURRENT_OBJ#,
t.CURRENT_ROW#,
t.xid
from m_ash_20220416 t
start with event = 'buffer busy waits'
--start with blocking_session is not null
connect by nocycle prior blocking_session = session_id and prior blocking_session_serial# = session_serial#
and prior sample_time = sample_time order siblings by sample_time;
附2: 查看某个具体session的采样等待信息
select /*+ parallel 4 */
t.sample_time,
t.session_id,
t.session_serial#,
t.session_type,
t.session_state,
t.SQL_OPNAME,
t.sql_id,
T.SQL_PLAN_HASH_VALUE,
t.SQL_PLAN_LINE_ID,
T.SQL_CHILD_NUMBER,
T.SQL_EXEC_ID,
T.SQL_EXEC_START,
T.IN_BIND,
t.IN_PARSE,
t.IN_HARD_PARSE,
t.IN_SQL_EXECUTION,
t.IN_PLSQL_COMPILATION,
t.IN_PLSQL_EXECUTION,
t.IN_SEQUENCE_LOAD,
T.IN_PLSQL_RPC,
t.PROGRAM,
t.MACHINE,
t.PGA_ALLOCATED,
t.TEMP_SPACE_ALLOCATED,
t.DELTA_READ_IO_BYTES,
t.DELTA_WRITE_IO_BYTES,
t.TM_DELTA_CPU_TIME,
t.event,
t.blocking_session,
t.blocking_session_serial#,
t.blocking_session_status,
t.WAIT_TIME,
t.TIME_WAITED,
t.CURRENT_OBJ#,
t.CURRENT_ROW#,
t.xid
from m_ash_20210106 t
where session_id = 12989
order by sample_time;
附3:查看Sql文本
select * from dba_hist_sqltext where sql_id='bxgrjr8wa7g7x';
附4:查看sql执行详细信息参考文档:
分析sql历史执行信息.md