V2EX 07月21日 13:38
[程序员] 求助, Milvus 数据库导入数据会导致数据库崩溃,是我的配置问题吗?
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文档展示了Milvus集群在Docker Compose环境下的配置,并结合部分日志分析了可能出现的部署问题。日志显示,etcd连接异常、服务节点(Query Node, MixCoord, Proxy, StreamingNode, DataNode)与etcd失联是主要故障点,可能导致服务无法正常启动或运行。分析表明,etcd的配置参数、健康检查以及服务节点间的依赖关系是排查的关键。

💡 **etcd连接与稳定性是Milvus集群运行的基础**:日志显示Query Node, MixCoord, Proxy, StreamingNode, DataNode等核心服务节点均因与etcd断开连接而退出。这表明etcd的配置(如`ETCD_QUOTA_BACKEND_BYTES`、`ETCD_SNAPSHOT_COUNT`)和健康检查(`healthcheck`)设置至关重要,任何etcd服务的不稳定或配置不当都可能导致整个Milvus集群的崩溃。

🚀 **服务节点间的健康检查与依赖关系需关注**:Milvus的`standalone`服务配置中,`depends_on`明确了其对`etcd`和`minio`的依赖。日志中出现的`etcdserver: requested lease not found`错误,以及随后各节点断开连接并退出的情况,暗示了etcd租约管理或etcd与Milvus服务间的通信存在问题,需要检查etcd的会话保持(keepAlive)机制和Milvus服务节点的注册逻辑。

🗄️ **数据持久化与卷挂载配置影响服务可用性**:`docker-compose.yaml`中为`etcd`、`minio`和`milvus`服务都配置了卷(volumes)用于数据持久化。然而,日志中并未直接体现卷配置的错误,但数据丢失或卷权限问题也可能间接影响etcd的会话管理和Milvus服务的正常启动,尤其是在多次重启或更新过程中。

📈 **资源分配与日志监控是排查的关键**:根据机器配置(4T内存,256 CPU)和Milvus服务节点的`mem_limit`及`cpus`设置,资源(特别是内存)分配看起来是充足的。但日志中的`STREAMING_CODE_CHANNEL_FENCED`错误,以及`append time tick msg to wal failed`警告,可能与etcd写入性能、WAL(Write-Ahead Logging)配置或并发处理能力有关,需要进一步分析etcd的性能指标和Milvus的WAL相关配置。

数据量:6000w

Milvus docker compose:

services:  etcd:    container_name: milvus-etcd    image: quay.io/coreos/etcd:v3.5.18    environment:      - ETCD_AUTO_COMPACTION_MODE=revision      - ETCD_AUTO_COMPACTION_RETENTION=1000      - ETCD_QUOTA_BACKEND_BYTES=8589934592      - ETCD_SNAPSHOT_COUNT=50000      - ETCD_MAX_REQUEST_BYTES=33554432    volumes:      - ${DOCKER_VOLUME_DIRECTORY:-.}/volumes/etcd:/etcd    command: etcd -advertise-client-urls=http://etcd:2379 -listen-client-urls http://0.0.0.0:2379 --data-dir /etcd    healthcheck:      test: ["CMD", "etcdctl", "endpoint", "health"]      interval: 30s      timeout: 20s      retries: 3    ulimits:      nofile:        soft: 655360        hard: 655360    mem_limit: 16g    cpus: 4.0    logging:      driver: "json-file"      options:        max-size: "100m"        max-file: "3"  minio:    container_name: milvus-minio    image: minio/minio:RELEASE.2024-05-28T17-19-04Z    environment:      MINIO_ACCESS_KEY: xxxxx      MINIO_SECRET_KEY: xxxxx    ports:      - "9001:9001"      - "9000:9000"    volumes:      - ${DOCKER_VOLUME_DIRECTORY:-.}/volumes/minio:/minio_data    command: minio server /minio_data --console-address ":9001"    healthcheck:      test: ["CMD", "curl", "-f", "http://localhost:9000/minio/health/live"]      interval: 30s      timeout: 20s      retries: 3    ulimits:      nofile:        soft: 655360        hard: 655360    mem_limit: 16g    cpus: 4.0    logging:      driver: "json-file"      options:        max-size: "100m"        max-file: "3"  standalone:    container_name: milvus    image: milvusdb/milvus:v2.6.0-rc1    command: ["milvus", "run", "standalone"]    security_opt:    - seccomp:unconfined    environment:      ETCD_ENDPOINTS: etcd:2379      MINIO_ADDRESS: minio:9000      MQ_TYPE: woodpecker    volumes:      - ${DOCKER_VOLUME_DIRECTORY:-.}/volumes/_milvus:/var/lib/milvus      - ./milvus.yaml:/milvus/configs/milvus.yaml    healthcheck:      test: ["CMD", "curl", "-f", "http://localhost:9091/healthz"]      interval: 30s      start_period: 90s      timeout: 20s      retries: 3    ports:      - "xxxx:19530"      - "xxxx:9091"    depends_on:      - "etcd"      - "minio"    ulimits:      nofile:        soft: 655360        hard: 655360    mem_limit: 1024g    cpus: 32.0    logging:      driver: "json-file"      options:        max-size: "100m"        max-file: "3"networks:  default:    name: milvus

机器配置:4T 内存,256 CPU

milvus.yaml: https://raw.githubusercontent.com/milvus-io/milvus/v2.6.0-rc1/configs/milvus.yaml

部分日志:

milvus        | [2025/07/18 15:41:28.119 +00:00] [WARN] [timetick/timetick_sync_operator.go:85] ["send time tick sync message failed"] [module=streamingnode] [component=timetick-sync] [pchannel=by-dev-rootcoord-dml_8:rw@3] [error="append time tick msg to wal failed, timestamp: 459499972358307846, previous message counter: 8: code: STREAMING_CODE_CHANNEL_FENCED, cause: by-dev-rootcoord-dml_8:rw@3 fenced"]milvus        | [2025/07/18 15:41:28.119 +00:00] [WARN] [timetick/timetick_sync_operator.go:85] ["send time tick sync message failed"] [module=streamingnode] [component=timetick-sync] [pchannel=by-dev-rootcoord-dml_14:rw@3] [error="append time tick msg to wal failed, timestamp: 459499972358307848, previous message counter: 8: code: STREAMING_CODE_CHANNEL_FENCED, cause: by-dev-rootcoord-dml_14:rw@3 fenced"]milvus        | [2025/07/18 15:41:28.119 +00:00] [WARN] [timetick/timetick_sync_operator.go:85] ["send time tick sync message failed"] [module=streamingnode] [component=timetick-sync] [pchannel=by-dev-rootcoord-dml_7:rw@3] [error="append time tick msg to wal failed, timestamp: 459499972358307847, previous message counter: 8: code: STREAMING_CODE_CHANNEL_FENCED, cause: by-dev-rootcoord-dml_7:rw@3 fenced"]milvus        | [2025/07/18 15:41:28.120 +00:00] [WARN] [sessionutil/session_util.go:593] ["fail to retry keepAliveOnce"] [serverName=querynode] [LeaseID=7587888197442225626] [error="etcdserver: requested lease not found"]milvus        | [2025/07/18 15:41:28.121 +00:00] [ERROR] [querynodev2/server.go:188] ["Query Node disconnected from etcd, process will exit"] ["Server Id"=2] [stack="github.com/milvus-io/milvus/internal/querynodev2.(*QueryNode).Register.func1\n\t/workspace/source/internal/querynodev2/server.go:188"]milvus        | [2025/07/18 15:41:28.121 +00:00] [WARN] [sessionutil/session_util.go:593] ["fail to retry keepAliveOnce"] [serverName=mixcoord] [LeaseID=7587888197442225598] [error="etcdserver: requested lease not found"]milvus        | [2025/07/18 15:41:28.122 +00:00] [ERROR] [coordinator/mix_coord.go:107] ["MixCoord disconnected from etcd, process will exit"] [serverID=2] [stack="github.com/milvus-io/milvus/internal/coordinator.(*mixCoordImpl).Register.(*mixCoordImpl).Register.func1.func3\n\t/workspace/source/internal/coordinator/mix_coord.go:107"]milvus        | [2025/07/18 15:41:28.122 +00:00] [WARN] [sessionutil/session_util.go:593] ["fail to retry keepAliveOnce"] [serverName=proxy] [LeaseID=7587888197442225923] [error="etcdserver: requested lease not found"]milvus        | [2025/07/18 15:41:28.122 +00:00] [ERROR] [proxy/proxy.go:181] ["Proxy disconnected from etcd, process will exit"] ["Server Id"=2] [stack="github.com/milvus-io/milvus/internal/proxy.(*Proxy).Register.func1\n\t/workspace/source/internal/proxy/proxy.go:181"]milvus        | [2025/07/18 15:41:28.122 +00:00] [WARN] [handler/handler_client_impl.go:178] ["create handler failed"] [pchannel=by-dev-rootcoord-dml_10] [handler=producer] [assignment=by-dev-rootcoord-dml_10:rw@3>2@172.23.0.4:22222] [error="/milvus.proto.streaming.StreamingNodeHandlerService/Produce; streaming error: code = STREAMING_CODE_CHANNEL_NOT_EXIST, cause = by-dev-rootcoord-dml_10 not exist; rpc error: code = FailedPrecondition, desc = "]milvus        | [2025/07/18 15:41:28.123 +00:00] [INFO] [handler/handler_client_impl.go:183] ["report assignment error"] [pchannel=by-dev-rootcoord-dml_10] [handler=producer] [assignmentError="/milvus.proto.streaming.StreamingNodeHandlerService/Produce; streaming error: code = STREAMING_CODE_CHANNEL_NOT_EXIST, cause = by-dev-rootcoord-dml_10 not exist; rpc error: code = FailedPrecondition, desc = "] []milvus        | [2025/07/18 15:41:28.120 +00:00] [ERROR] [streamingnode/service.go:389] ["StreamingNode disconnected from etcd, process will exit"] ["Server Id"=2] [stack="github.com/milvus-io/milvus/internal/distributed/streamingnode.(*Server).registerSessionToETCD.func1\n\t/workspace/source/internal/distributed/streamingnode/service.go:389"]milvus        | [2025/07/18 15:41:28.120 +00:00] [ERROR] [datanode/data_node.go:200] ["Data Node disconnected from etcd, process will exit"] ["Server Id"=2] [stack="github.com/milvus-io/milvus/internal/datanode.(*DataNode).Register.func1\n\t/workspace/source/internal/datanode/data_node.go:200"]

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

Milvus Docker Compose etcd 集群部署 故障排查
相关文章