环境信息
elasticsearch 版本:7.10.0
自建elasticsearch集群,三个数据节点
背景
新的周一,日常打开kibana查看日志发现没有日志输出,更改条件查看最近七天日志看到的最后为两天前的日志即最近两天(周六周日)的日志都没有。优秀的公司周末都不上班但是定时任务都是不区分周六日的😂,肯定有问题。
调查操作
常规健康走查
# 集群的健康状态
GET _cluster/health
# 集群的状况信息
GET _cluster/stats
# 查看节点
GET _nodes/stats
并没有发现什么异常。
查看服务器日志
由于是自建集群,同时集群也没问题。那就看下日志咯,日志很长核心内容就一句。
{"type": "server", "timestamp": "2022-07-18T09:21:09,290Z", "level": "WARN", "component": "o.e.x.m.e.l.LocalExporter", "cluster.name": "devops", "node.name": "elasticsearch-0", "message": "unexpected error while indexing monitoring document", "cluster.uuid": "y4HAM9yUQF2St0Hu40t6pw", "node.id": "TQwaqTHeRTWVXFJx6A7u7g" ,
"stacktrace": ["org.elasticsearch.xpack.monitoring.exporter.ExportException: org.elasticsearch.common.ValidationException: Validation Failed: 1: this action would add [1] total shards, but this cluster currently has [3000]/[3000] maximum shards open;",
"at org.elasticsearch.xpack.monitoring.exporter.local.LocalBulk.lambda$throwExportException$2(LocalBulk.java:125) ~[x-pack-monitoring-7.10.1.jar:7.10.1]",
"at java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:195) ~[?:?]",
"at java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:177) ~[?:?]",
"at java.util.Spliterators$ArraySpliterator.forEachRemaining(Spliterators.java:948) ~[?:?]",
"at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:484) ~[?:?]",
"at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:474) ~[?:?]",
"at java.util.stream.ForEachOps$ForEachOp.evaluateSequential(ForEachOps.java:150) ~[?:?]",
"at java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateSequential(ForEachOps.java:173) ~[?:?]",
"at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) ~[?:?]",
"at java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:497) ~[?:?]",
"at org.elasticsearch.xpack.monitoring.exporter.local.LocalBulk.throwExportException(LocalBulk.java:126) [x-pack-monitoring-7.10.1.jar:7.10.1]",
"at org.elasticsearch.xpack.monitoring.exporter.local.LocalBulk.lambda$doFlush$0(LocalBulk.java:108) [x-pack-monitoring-7.10.1.jar:7.10.1]",
"at org.elasticsearch.action.ActionListener$1.onResponse(ActionListener.java:63) [elasticsearch-7.10.1.jar:7.10.1]",
"at org.elasticsearch.action.support.ContextPreservingActionListener.onResponse(ContextPreservingActionListener.java:43) [elasticsearch-7.10.1.jar:7.10.1]",
"at org.elasticsearch.action.support.TransportAction$1.onResponse(TransportAction.java:89) [elasticsearch-7.10.1.jar:7.10.1]",
"at org.elasticsearch.action.support.TransportAction$1.onResponse(TransportAction.java:83) [elasticsearch-7.10.1.jar:7.10.1]",
"at org.elasticsearch.action.ActionListener$6.onResponse(ActionListener.java:282) [elasticsearch-7.10.1.jar:7.10.1]",
"at org.elasticsearch.action.ActionListener$4.onResponse(ActionListener.java:163) [elasticsearch-7.10.1.jar:7.10.1]",
"at org.elasticsearch.action.ActionListener$1.onResponse(ActionListener.java:63) [elasticsearch-7.10.1.jar:7.10.1]",
"at org.elasticsearch.action.bulk.TransportBulkAction$BulkOperation.doRun(TransportBulkAction.java:533) [elasticsearch-7.10.1.jar:7.10.1]",
"at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-7.10.1.jar:7.10.1]",
"at org.elasticsearch.action.bulk.TransportBulkAction.executeBulk(TransportBulkAction.java:679) [elasticsearch-7.10.1.jar:7.10.1]",
"at org.elasticsearch.action.bulk.TransportBulkAction$1$2.doRun(TransportBulkAction.java:302) [elasticsearch-7.10.1.jar:7.10.1]",
"at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:737) [elasticsearch-7.10.1.jar:7.10.1]",
"at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-7.10.1.jar:7.10.1]",
"at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130) [?:?]",
"at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630) [?:?]",
"at java.lang.Thread.run(Thread.java:832) [?:?]",
"Caused by: org.elasticsearch.common.ValidationException: Validation Failed: 1: this action would add [1] total shards, but this cluster currently has [3000]/[3000] maximum shards open;",
"at org.elasticsearch.indices.ShardLimitValidator.validateShardLimit(ShardLimitValidator.java:80) ~[elasticsearch-7.10.1.jar:7.10.1]",
"at org.elasticsearch.cluster.metadata.MetadataCreateIndexService.aggregateIndexSettings(MetadataCreateIndexService.java:765) ~[elasticsearch-7.10.1.jar:7.10.1]",
"at org.elasticsearch.cluster.metadata.MetadataCreateIndexService.applyCreateIndexRequestWithV1Templates(MetadataCreateIndexService.java:489) ~[elasticsearch-7.10.1.jar:7.10.1]",
"at org.elasticsearch.cluster.metadata.MetadataCreateIndexService.applyCreateIndexRequest(MetadataCreateIndexService.java:370) ~[elasticsearch-7.10.1.jar:7.10.1]",
"at org.elasticsearch.cluster.metadata.MetadataCreateIndexService.applyCreateIndexRequest(MetadataCreateIndexService.java:377) ~[elasticsearch-7.10.1.jar:7.10.1]",
"at org.elasticsearch.action.admin.indices.create.AutoCreateAction$TransportAction$1.execute(AutoCreateAction.java:137) ~[elasticsearch-7.10.1.jar:7.10.1]",
"at org.elasticsearch.cluster.ClusterStateUpdateTask.execute(ClusterStateUpdateTask.java:47) ~[elasticsearch-7.10.1.jar:7.10.1]",
"at org.elasticsearch.cluster.service.MasterService.executeTasks(MasterService.java:702) ~[elasticsearch-7.10.1.jar:7.10.1]",
"at org.elasticsearch.cluster.service.MasterService.calculateTaskOutputs(MasterService.java:324) ~[elasticsearch-7.10.1.jar:7.10.1]",
"at org.elasticsearch.cluster.service.MasterService.runTasks(MasterService.java:219) ~[elasticsearch-7.10.1.jar:7.10.1]",
"at org.elasticsearch.cluster.service.MasterService.access$000(MasterService.java:73) ~[elasticsearch-7.10.1.jar:7.10.1]",
"at org.elasticsearch.cluster.service.MasterService$Batcher.run(MasterService.java:151) ~[elasticsearch-7.10.1.jar:7.10.1]",
"at org.elasticsearch.cluster.service.TaskBatcher.runIfNotProcessed(TaskBatcher.java:150) ~[elasticsearch-7.10.1.jar:7.10.1]",
"at org.elasticsearch.cluster.service.TaskBatcher$BatchedTask.run(TaskBatcher.java:188) ~[elasticsearch-7.10.1.jar:7.10.1]",
"at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:678) ~[elasticsearch-7.10.1.jar:7.10.1]",
"at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:252) ~[elasticsearch-7.10.1.jar:7.10.1]",
"at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:215) ~[elasticsearch-7.10.1.jar:7.10.1]",
"... 3 more"] }
this action would add [1] total shards, but this cluster currently has [3000]/[3000] maximum shards open;
核心就是这一句。有个行为要添加一个分片,但是现在集群的分片使用已经最大使用3000个了。
修改
PUT /_cluster/settings
{
"transient": {
"cluster": {
"max_shards_per_node":2000
}
}
}
知道了问题在哪里就很好办了,先设置解决下。设置每个节点最大分片数为2000,注意这里是每个节点。环境上下文中数据节点数量为3,这么设置之后此时集群的最大分片数目前就是6000了。
elasticsearch的说明
默认分片数
在使用的7.10版本是这么说的,每个节点的最大分片数默认是1000。在集群中计算规则是每个节点的最大允许数 * 节点数量
cluster.max_shards_per_node
(Dynamic) Limits the total number of primary and replica shards for the cluster. Elasticsearch calculates the limit as follows:
cluster.max_shards_per_node * number of data nodes
Shards for closed indices do not count toward this limit. Defaults to 1000. A cluster with no data nodes is unlimited.
Elasticsearch rejects any request that creates more shards than this limit allows. For example, a cluster with a cluster.max_shards_per_node setting of 100 and three data nodes has a shard limit of 300. If the cluster already contains 296 shards, Elasticsearch rejects any request that adds five or more shards to the cluster.
This setting does not limit shards for individual nodes. To limit the number of shards for each node, use the cluster.routing.allocation.total_shards_per_node setting.
elasticsearch的索引声明周期
elasticsearch定义了5个阶段的索引生命周期(Index Lifecycle Management)称为ILM。
- Hot(热):索引处于活动状态,能够更新(增改删)和查询。
- Warm(暖):处于该阶段的索引不再支持更新,但是能够被查询。
- Cold(冷): 该阶段的索引不再支持更新,只能支持很少的查询,查询较慢!
- Frozen(冻结):该阶段的所有不再支持更新,也很少查询,查询很慢!
- Delete(删除):索引不再需要可以被安全删除。
https://blog.csdn.net/gybshen/article/details/123794134
kibana的bug
版本:kibana的7.10.0
中文乱码
如果创建的索引生命周期策略名称为中文,查看策略详情时会乱码,经过几次刷新又可以查看。
actions字段没有delete
这里选择十天以上的数据进入删除阶段周期,但是实际上没有删除的动作。需要手动创建的才有,在界面上创建的没有。
手动创建索引的生命周期策略
PUT _ilm/policy/application-logstash-clear-policy
{
"policy" : {
"phases" : {
"warm" : {
"min_age" : "1d",
"actions" : {
"set_priority" : {
"priority" : 50
}
}
},
"cold" : {
"min_age" : "3d",
"actions" : { }
},
"hot" : {
"min_age" : "0ms",
"actions" : {
"set_priority" : {
"priority" : 100
}
}
},
"delete" : {
"min_age" : "10d",
"actions" : {
"delete": {
}
}
}
}
}
}
这里为索引创建一个生命周期策略,名称为application-logstash-clear-policy
具体周期为
- 刚创建的索引是hot
- 索引创建1天后是warn
- 索引创建3天后是cold
- 索引创建10天后是delete
这个时候在看该策略的详情就看到删除阶段就有具体的行为了