我们目前发现运行过程中,突然三台节点出现了
2021-09-13 12:08:08,779 ERROR failed to req API:http://10.12.105.24:8848/nacos/v1/ns/distro/checksum
java.net.SocketTimeoutException: 10,000 milliseconds timeout on connection http-outgoing-33476 [ACTIVE] at org.apache.http.nio.protocol.HttpAsyncRequestExecutor.timeout(HttpAsyncRequestExecutor.java:387) at org.apache.http.impl.nio.client.InternalIODispatch.onTimeout(InternalIODispatch.java:92) at org.apache.http.impl.nio.client.InternalIODispatch.onTimeout(InternalIODispatch.java:39) at org.apache.http.impl.nio.reactor.AbstractIODispatch.timeout(AbstractIODispatch.java:175) at org.apache.http.impl.nio.reactor.BaseIOReactor.sessionTimedOut(BaseIOReactor.java:261) at org.apache.http.impl.nio.reactor.AbstractIOReactor.timeoutCheck(AbstractIOReactor.java:502) at org.apache.http.impl.nio.reactor.BaseIOReactor.validate(BaseIOReactor.java:211) at org.apache.http.impl.nio.reactor.AbstractIOReactor.execute(AbstractIOReactor.java:280) at org.apache.http.impl.nio.reactor.BaseIOReactor.execute(BaseIOReactor.java:104) at org.apache.http.impl.nio.reactor.AbstractMultiworkerIOReactor$Worker.run(AbstractMultiworkerIOReactor.java:591) at java.lang.Thread.run(Thread.java:748) 2021-09-13 12:08:08,828 ERROR failed to req API:http://10.12.105.26:8848/nacos/v1/ns/distro/checksum
java.net.SocketTimeoutException: 10,000 milliseconds timeout on connection http-outgoing-33477 [ACTIVE] at org.apache.http.nio.protocol.HttpAsyncRequestExecutor.timeout(HttpAsyncRequestExecutor.java:387) at org.apache.http.impl.nio.client.InternalIODispatch.onTimeout(InternalIODispatch.java:92) at org.apache.http.impl.nio.client.InternalIODispatch.onTimeout(InternalIODispatch.java:39) at org.apache.http.impl.nio.reactor.AbstractIODispatch.timeout(AbstractIODispatch.java:175) at org.apache.http.impl.nio.reactor.BaseIOReactor.sessionTimedOut(BaseIOReactor.java:261) at org.apache.http.impl.nio.reactor.AbstractIOReactor.timeoutCheck(AbstractIOReactor.java:502) at org.apache.http.impl.nio.reactor.BaseIOReactor.validate(BaseIOReactor.java:211) at org.apache.http.impl.nio.reactor.AbstractIOReactor.execute(AbstractIOReactor.java:280) at org.apache.http.impl.nio.reactor.BaseIOReactor.execute(BaseIOReactor.java:104) at org.apache.http.impl.nio.reactor.AbstractMultiworkerIOReactor$Worker.run(AbstractMultiworkerIOReactor.java:591)
这个异常,我们初步认为在选举过程中请求流量一样会打到所有的节点上,导致选主失败,于是我们外层添加了ng检查,一旦切到某一台,然后另外两台就可以迅速恢复,但是只能临时解决下,但是找不到为什么三台突然出现这个问题,防火墙都关闭,服务数量在6000+ ;感觉是健康检查的问题,帮看下。
原提问者GitHub用户DoctChen
看看服务器的连接数是不是太多导致了不能再建立连接,nacos1.4.2的内部通信是建立的短连接,当请求结束后连接状态会变成TIME_WAIT,只要有nacos-client端发来心跳都会向其他节点同步数据,这个同步数据的请求就是建立的短连接。 linux默认TIME_WAIT的连接会等1分钟才关闭,如果是因为连接过多,试试把net.ipv4.tcp_tw_recycle改为1再看看效果
原回答者GitHub用户amazinglogic
版权声明:本文内容由阿里云实名注册用户自发贡献,版权归原作者所有,阿里云开发者社区不拥有其著作权,亦不承担相应法律责任。具体规则请查看《阿里云开发者社区用户服务协议》和《阿里云开发者社区知识产权保护指引》。如果您发现本社区中有涉嫌抄袭的内容,填写侵权投诉表单进行举报,一经查实,本社区将立刻删除涉嫌侵权内容。
阿里云拥有国内全面的云原生产品技术以及大规模的云原生应用实践,通过全面容器化、核心技术互联网化、应用 Serverless 化三大范式,助力制造业企业高效上云,实现系统稳定、应用敏捷智能。拥抱云原生,让创新无处不在。