背景
1.对于7版本(大版本)集群希望只维护一个版本,最终选择7.17,对同大版本的7.5版本集群进行升级
2.根据官方描述,_id放到堆外性能损失非常小可以忽略,且对BKD进行了优化
3.升级完成,一段时间之后,收到用户报障
4.抽样检查了下部分升级的集群,其中部分受到影响,部分不受影响。且每个集群内存均有一定优化(预期内)
调查&分析
1.发现is_deleted文档特别多,怀疑是7.17版本对于碎片过于敏感。做forcemerge,没什么效果。
2.GET _nodes/hot_threads 查看耗时部分,结果展示笼统,没得到关键信息。
3.给语句加上profile,查看耗时部分。
GET index-v1/_search {"profile":"true","query":{"bool":{"filter":[{"term":{"xid":{"value":"11111111","boost":1.0}}},{"terms":{"status":[2,3,4],"boost":1.0}},{"terms":{"platform":["aaa","bbb"],"boost":1.0}},{"terms":{"pId":[1,2],"boost":1.0}}],"adjust_pure_negative":true,"boost":1.0}},"sort":[{"time":{"order":"desc"}}]}
从脱敏的简化结果中可以看出来,主要是 status、pId 字段耗时高,这两个字段都是integer类型,并且使用了terms查询。
{ "took": 554, "timed_out": false, "_shards": { "total": 3, "successful": 3, "skipped": 0, "failed": 0 }, "hits": { "total": { "value": 5, "relation": "eq" }, "max_score": null, "hits": [ ... ] }, "profile": { "shards": [ { "id": "[APxxxxxxxxxxxxxxQ][index-v1][0]", "searches": [ { "query": [ { "type": "BooleanQuery", "description": "#xid:111111111 #status:{2 3 4} #ConstantScore(platform:aaa platform:bbb) #pId:{1 2}", "time_in_nanos": 415205306, "breakdown": { ... "build_scorer": 415028271 }, "children": [ { "type": "TermQuery", "description": "xid:111111111", "time_in_nanos": 102656, "breakdown": { ..... "build_scorer": 86264 } }, { "type": "PointInSetQuery", "description": "status:{2 3 4}", "time_in_nanos": 220394978, "breakdown": { .... "build_scorer": 220385119 } }, { "type": "ConstantScoreQuery", "description": "ConstantScore(platform:aaa platform:bbb)", "time_in_nanos": 341845, "breakdown": { ..... "build_scorer": 282277 }, "children": [ { "type": "BooleanQuery", "description": "platform:aaa platform:bbb", "time_in_nanos": 329042, "breakdown": { ..... "build_scorer": 277752 }, "children": [ { "type": "TermQuery", "description": "platform:aaa", "time_in_nanos": 62446, "breakdown": { ..... "build_scorer": 37931 } }, { "type": "TermQuery", "description": "platform:bbb", "time_in_nanos": 15093, "breakdown": { ..... "build_scorer": 6981 } } ] } ] }, { "type": "PointInSetQuery", "description": "pId:{1 2}", "time_in_nanos": 194164297, "breakdown": { .... "build_scorer": 194160452 } } ] } ], "rewrite_time": 40044, "collector": [ { "name": "SimpleFieldCollector", "reason": "search_top_hits", "time_in_nanos": 144012 } ] } ]
4.单个的profile无法说明问题,进一步排查:使用arthas工具获取一段时间内的火焰图
可以看到主要就是BKD数据结构占用的CPU。
5.参考官方论坛相似问题:https://discuss.elastic.co/t/very-slow-search-performance-after-upgrade-to-7-16-1/296152/3
6.integer类型的terms查询性能较差,看起来官方描述的BKD相关优化指的是range
7.测试验证,将字段改成keyword,查看结果,CPU查询耗时恢复到正常范围