V2EX 13小时前
[程序员] 求助, Milvus 数据库导入数据会导致数据库崩溃,是我的配置问题吗?
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文档展示了Milvus集群在Docker Compose环境下启动时遇到的问题。主要表现为etcd连接中断,导致Query Node、MixCoord、Proxy和StreamingNode等关键服务相继断开连接并退出。日志显示etcd的lease未找到,以及streamingnode向etcd同步timetick消息失败,表明etcd作为核心协调服务存在不稳定或连接问题,进而引发整个Milvus集群的连锁故障。

💡 **etcd连接中断是核心问题**:日志显示Query Node、MixCoord、Proxy和StreamingNode等多个Milvus组件均因与etcd连接中断而退出。etcdserver提示'requested lease not found',这是导致集群服务不可用的根本原因。

📈 **Timetick同步失败影响数据一致性**:StreamingNode在向etcd同步timetick消息时出现失败,并报告'append time tick msg to wal failed'及'STREAMING_CODE_CHANNEL_FENCED'错误。这表明etcd的WAL(Write-Ahead Logging)或通道管理可能存在问题,影响了Milvus内部的时间戳同步机制。

🎛️ **服务组件连锁故障**:由于etcd作为Milvus集群的元数据和协调中心,其连接问题直接导致了Query Node、MixCoord、Proxy等关键服务无法正常注册和维持心跳,引发了服务间的连锁反应,最终导致整个集群无法正常运行。

⚙️ **配置与资源分配需审视**:尽管机器配置(4T内存,256 CPU)和Docker Compose配置(如etcd的ulimit、mem_limit、cpus设置)看起来较为充足,但etcd的稳定性问题可能源于etcd自身的配置参数(如ETCD_SNAPSHOT_COUNT、ETCD_QUOTA_BACKEND_BYTES)、网络延迟、磁盘I/O性能,或etcd与Milvus组件间的资源竞争。

数据量:6000w

Milvus docker compose:

services:  etcd:    container_name: milvus-etcd    image: quay.io/coreos/etcd:v3.5.18    environment:      - ETCD_AUTO_COMPACTION_MODE=revision      - ETCD_AUTO_COMPACTION_RETENTION=1000      - ETCD_QUOTA_BACKEND_BYTES=8589934592      - ETCD_SNAPSHOT_COUNT=50000      - ETCD_MAX_REQUEST_BYTES=33554432    volumes:      - ${DOCKER_VOLUME_DIRECTORY:-.}/volumes/etcd:/etcd    command: etcd -advertise-client-urls=http://etcd:2379 -listen-client-urls http://0.0.0.0:2379 --data-dir /etcd    healthcheck:      test: ["CMD", "etcdctl", "endpoint", "health"]      interval: 30s      timeout: 20s      retries: 3    ulimits:      nofile:        soft: 655360        hard: 655360    mem_limit: 16g    cpus: 4.0    logging:      driver: "json-file"      options:        max-size: "100m"        max-file: "3"  minio:    container_name: milvus-minio    image: minio/minio:RELEASE.2024-05-28T17-19-04Z    environment:      MINIO_ACCESS_KEY: xxxxx      MINIO_SECRET_KEY: xxxxx    ports:      - "9001:9001"      - "9000:9000"    volumes:      - ${DOCKER_VOLUME_DIRECTORY:-.}/volumes/minio:/minio_data    command: minio server /minio_data --console-address ":9001"    healthcheck:      test: ["CMD", "curl", "-f", "http://localhost:9000/minio/health/live"]      interval: 30s      timeout: 20s      retries: 3    ulimits:      nofile:        soft: 655360        hard: 655360    mem_limit: 16g    cpus: 4.0    logging:      driver: "json-file"      options:        max-size: "100m"        max-file: "3"  standalone:    container_name: milvus    image: milvusdb/milvus:v2.6.0-rc1    command: ["milvus", "run", "standalone"]    security_opt:    - seccomp:unconfined    environment:      ETCD_ENDPOINTS: etcd:2379      MINIO_ADDRESS: minio:9000      MQ_TYPE: woodpecker    volumes:      - ${DOCKER_VOLUME_DIRECTORY:-.}/volumes/_milvus:/var/lib/milvus      - ./milvus.yaml:/milvus/configs/milvus.yaml    healthcheck:      test: ["CMD", "curl", "-f", "http://localhost:9091/healthz"]      interval: 30s      start_period: 90s      timeout: 20s      retries: 3    ports:      - "xxxx:19530"      - "xxxx:9091"    depends_on:      - "etcd"      - "minio"    ulimits:      nofile:        soft: 655360        hard: 655360    mem_limit: 1024g    cpus: 32.0    logging:      driver: "json-file"      options:        max-size: "100m"        max-file: "3"networks:  default:    name: milvus

机器配置:4T 内存,256 CPU

milvus.yaml: https://raw.githubusercontent.com/milvus-io/milvus/v2.6.0-rc1/configs/milvus.yaml

部分日志:

milvus        | [2025/07/18 15:41:28.119 +00:00] [WARN] [timetick/timetick_sync_operator.go:85] ["send time tick sync message failed"] [module=streamingnode] [component=timetick-sync] [pchannel=by-dev-rootcoord-dml_8:rw@3] [error="append time tick msg to wal failed, timestamp: 459499972358307846, previous message counter: 8: code: STREAMING_CODE_CHANNEL_FENCED, cause: by-dev-rootcoord-dml_8:rw@3 fenced"]milvus        | [2025/07/18 15:41:28.119 +00:00] [WARN] [timetick/timetick_sync_operator.go:85] ["send time tick sync message failed"] [module=streamingnode] [component=timetick-sync] [pchannel=by-dev-rootcoord-dml_14:rw@3] [error="append time tick msg to wal failed, timestamp: 459499972358307848, previous message counter: 8: code: STREAMING_CODE_CHANNEL_FENCED, cause: by-dev-rootcoord-dml_14:rw@3 fenced"]milvus        | [2025/07/18 15:41:28.119 +00:00] [WARN] [timetick/timetick_sync_operator.go:85] ["send time tick sync message failed"] [module=streamingnode] [component=timetick-sync] [pchannel=by-dev-rootcoord-dml_7:rw@3] [error="append time tick msg to wal failed, timestamp: 459499972358307847, previous message counter: 8: code: STREAMING_CODE_CHANNEL_FENCED, cause: by-dev-rootcoord-dml_7:rw@3 fenced"]milvus        | [2025/07/18 15:41:28.120 +00:00] [WARN] [sessionutil/session_util.go:593] ["fail to retry keepAliveOnce"] [serverName=querynode] [LeaseID=7587888197442225626] [error="etcdserver: requested lease not found"]milvus        | [2025/07/18 15:41:28.121 +00:00] [ERROR] [querynodev2/server.go:188] ["Query Node disconnected from etcd, process will exit"] ["Server Id"=2] [stack="github.com/milvus-io/milvus/internal/querynodev2.(*QueryNode).Register.func1\n\t/workspace/source/internal/querynodev2/server.go:188"]milvus        | [2025/07/18 15:41:28.121 +00:00] [WARN] [sessionutil/session_util.go:593] ["fail to retry keepAliveOnce"] [serverName=mixcoord] [LeaseID=7587888197442225598] [error="etcdserver: requested lease not found"]milvus        | [2025/07/18 15:41:28.122 +00:00] [ERROR] [coordinator/mix_coord.go:107] ["MixCoord disconnected from etcd, process will exit"] [serverID=2] [stack="github.com/milvus-io/milvus/internal/coordinator.(*mixCoordImpl).Register.(*mixCoordImpl).Register.func1.func3\n\t/workspace/source/internal/coordinator/mix_coord.go:107"]milvus        | [2025/07/18 15:41:28.122 +00:00] [WARN] [sessionutil/session_util.go:593] ["fail to retry keepAliveOnce"] [serverName=proxy] [LeaseID=7587888197442225923] [error="etcdserver: requested lease not found"]milvus        | [2025/07/18 15:41:28.122 +00:00] [ERROR] [proxy/proxy.go:181] ["Proxy disconnected from etcd, process will exit"] ["Server Id"=2] [stack="github.com/milvus-io/milvus/internal/proxy.(*Proxy).Register.func1\n\t/workspace/source/internal/proxy/proxy.go:181"]milvus        | [2025/07/18 15:41:28.122 +00:00] [WARN] [handler/handler_client_impl.go:178] ["create handler failed"] [pchannel=by-dev-rootcoord-dml_10] [handler=producer] [assignment=by-dev-rootcoord-dml_10:rw@3>2@172.23.0.4:22222] [error="/milvus.proto.streaming.StreamingNodeHandlerService/Produce; streaming error: code = STREAMING_CODE_CHANNEL_NOT_EXIST, cause = by-dev-rootcoord-dml_10 not exist; rpc error: code = FailedPrecondition, desc = "]milvus        | [2025/07/18 15:41:28.123 +00:00] [INFO] [handler/handler_client_impl.go:183] ["report assignment error"] [pchannel=by-dev-rootcoord-dml_10] [handler=producer] [assignmentError="/milvus.proto.streaming.StreamingNodeHandlerService/Produce; streaming error: code = STREAMING_CODE_CHANNEL_NOT_EXIST, cause = by-dev-rootcoord-dml_10 not exist; rpc error: code = FailedPrecondition, desc = "] []milvus        | [2025/07/18 15:41:28.120 +00:00] [ERROR] [streamingnode/service.go:389] ["StreamingNode disconnected from etcd, process will exit"] ["Server Id"=2] [stack="github.com/milvus-io/milvus/internal/distributed/streamingnode.(*Server).registerSessionToETCD.func1\n\t/workspace/source/internal/distributed/streamingnode/service.go:389"]milvus        | [2025/07/18 15:41:28.120 +00:00] [ERROR] [datanode/data_node.go:200] ["Data Node disconnected from etcd, process will exit"] ["Server Id"=2] [stack="github.com/milvus-io/milvus/internal/datanode.(*DataNode).Register.func1\n\t/workspace/source/internal/datanode/data_node.go:200"]

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

Milvus Docker Compose etcd 集群启动 故障排查
相关文章