掘金 人工智能 前天 18:13
graphrag 安装以及踩坑
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文档详细介绍了如何在本地环境中安装和配置GraphRAG,以便进行离线知识图谱检索。内容涵盖了从克隆代码仓库、安装依赖、初始化项目,到配置`settings.yaml`文件,以及运行索引和查询的完整步骤。特别关注了在离线环境下可能遇到的`tiktoken`问题,并提供了两种解决方案,包括手动下载编码器文件和修改`tiktoken`包的读取逻辑,确保GraphRAG能够在无网络环境中顺利运行。

🛠️ GraphRAG本地安装需先克隆代码仓库,然后通过`pip install -e .`安装项目依赖,并使用`python -m graphrag.index --init --root ./ragtest`初始化项目。

⚙️ 配置`settings.yaml`文件至关重要,需修改模型名称和地址为本地服务,或配置为远程调用(如阿里云百炼)。关键参数包括`llm.api_base`和`embeddings.llm.api_base`,指向本地Ollama服务的地址。

⚠️ 离线环境安装GraphRAG的关键在于解决`tiktoken`的编码器下载问题。文章提供了两种方法:一是手动下载`cl100k_base.tiktoken`文件并放置到指定缓存目录;二是修改`tiktoken`包中的`load.py`文件,将远程下载逻辑改为从本地读取。

🚀 成功配置后,运行`python -m graphrag.index --root ./ragtest`创建图索引,再通过`python -m graphrag.query --root ./ragtest --method global "What is machine learning?"`进行查询,验证GraphRAG是否成功运行。

安装过程

    克隆graph-local-ollama仓库:

    git clone https://github.com/TheAiSingularity/graphrag-local-ollama.git

    进入项目目录:

    cd graphrag-local-ollama/

    安装依赖:

    pip install -e .

    创建graphrag 的一个项目,将输入输出文件夹创建好:

    mkdir -p ./ragtest/inputcp input/* ./ragtest/input

    初始化项目:

    python -m graphrag.index --init --root ./ragtest

    将设置的配置文件拷贝到项目内:

    cp settings.yaml ./ragtest

    修改配置文件:

    encoding_model: cl100k_baseskip_workflows: []llm:  api_key: ${GRAPHRAG_API_KEY}  type: openai_chat # or azure_openai_chat  model: qwen-32b-instruct-fp16:latest  model_supports_json: true # recommended if this is available for your model.  # max_tokens: 4000  # request_timeout: 180.0  api_base: http://localhost:7889/v1  # api_version: 2024-02-15-preview  # organization: <organization_id>  # deployment_name: <azure_model_deployment_name>  # tokens_per_minute: 150_000 # set a leaky bucket throttle  # requests_per_minute: 10_000 # set a leaky bucket throttle  # max_retries: 10  # max_retry_wait: 10.0  # sleep_on_rate_limit_recommendation: true # whether to sleep when azure suggests wait-times  # concurrent_requests: 25 # the number of parallel inflight requests that may be madeparallelization:  stagger: 0.3  # num_threads: 50 # the number of threads to use for parallel processingasync_mode: threaded # or asyncioembeddings:  ## parallelization: override the global parallelization settings for embeddings  async_mode: threaded # or asyncio  llm:    api_key: ${GRAPHRAG_API_KEY}    type: openai_embedding # or azure_openai_embedding    model: bge-large-zh-v1.5:f16    api_base: http://localhost:7889/api    # api_version: 2024-02-15-preview    # organization: <organization_id>    # deployment_name: <azure_model_deployment_name>    # tokens_per_minute: 150_000 # set a leaky bucket throttle    # requests_per_minute: 10_000 # set a leaky bucket throttle    # max_retries: 10    # max_retry_wait: 10.0    # sleep_on_rate_limit_recommendation: true # whether to sleep when azure suggests wait-times    # concurrent_requests: 25 # the number of parallel inflight requests that may be made    # batch_size: 16 # the number of documents to send in a single request    # batch_max_tokens: 8191 # the maximum number of tokens to send in a single request    # target: required # or optional  chunks:  size: 300  overlap: 100  group_by_columns: [id] # by default, we don't allow chunks to cross documents    input:  type: file # or blob  file_type: text # or csv  base_dir: "input"  file_encoding: utf-8  file_pattern: ".*\.txt$"cache:  type: file # or blob  base_dir: "cache"  # connection_string: <azure_blob_storage_connection_string>  # container_name: <azure_blob_storage_container_name>storage:  type: file # or blob  base_dir: "output/${timestamp}/artifacts"  # connection_string: <azure_blob_storage_connection_string>  # container_name: <azure_blob_storage_container_name>reporting:  type: file # or console, blob  base_dir: "output/${timestamp}/reports"  # connection_string: <azure_blob_storage_connection_string>  # container_name: <azure_blob_storage_container_name>entity_extraction:  ## llm: override the global llm settings for this task  ## parallelization: override the global parallelization settings for this task  ## async_mode: override the global async_mode settings for this task  prompt: "prompts/entity_extraction.txt"  entity_types: [organization,person,geo,event]  max_gleanings: 0summarize_descriptions:  ## llm: override the global llm settings for this task  ## parallelization: override the global parallelization settings for this task  ## async_mode: override the global async_mode settings for this task  prompt: "prompts/summarize_descriptions.txt"  max_length: 500claim_extraction:  ## llm: override the global llm settings for this task  ## parallelization: override the global parallelization settings for this task  ## async_mode: override the global async_mode settings for this task  # enabled: true  prompt: "prompts/claim_extraction.txt"  description: "Any claims or facts that could be relevant to information discovery."  max_gleanings: 0community_report:  ## llm: override the global llm settings for this task  ## parallelization: override the global parallelization settings for this task  ## async_mode: override the global async_mode settings for this task  prompt: "prompts/community_report.txt"  max_length: 2000  max_input_length: 8000cluster_graph:  max_cluster_size: 10embed_graph:  enabled: false # if true, will generate node2vec embeddings for nodes  # num_walks: 10  # walk_length: 40  # window_size: 2  # iterations: 3  # random_seed: 597832umap:  enabled: false # if true, will generate UMAP embeddings for nodessnapshots:  graphml: yes  raw_entities: yes  top_level_nodes: yeslocal_search:  # text_unit_prop: 0.5  # community_prop: 0.1  # conversation_history_max_turns: 5  # top_k_mapped_entities: 10  # top_k_relationships: 10  # max_tokens: 12000global_search:  # max_tokens: 12000  # data_max_tokens: 12000  # map_max_tokens: 1000  # reduce_max_tokens: 2000  # concurrency: 32

关于配置文件,修改文件的模型名称和模型地址为本地,也可以修改为远程调用(如阿里云百炼):

    运行索引,创建图(直接运行下面的命令在离线的环境下不能成功,详见踩坑记录):

    python -m graphrag.index --root ./ragtest

运行结果

在以上命令都执行完之后,运行索引执行成功的结果如下图:

查询语句

python -m graphrag.query --root ./ragtest --method global "What is machine learning?"

运行查询,得到结果:

踩坑记录

关于 tiktoken

tiktoken 在没有编码器的情况下,自动请求tiktoken 的下载对应的文件。然而在离线环境下,这种下载是不可能成功的。所以需要手动下载来解决这个问题。

方法一(未验证)

tiktoken_ext/openai_public.py,文件中,如果blobpath 为 https://openaipublic.blob.core.windows.net/encodings/cl100k_base.tiktoken, 那么文件的hash就是9b5ad71b2ce5302211f9c61530b329a4922fc6a4

def cl100k_base():    mergeable_ranks = load_tiktoken_bpe(        "https://openaipublic.blob.core.windows.net/encodings/cl100k_base.tiktoken",        expected_hash="223921b76ee99bde995b7ff738513eef100fb51d18c93597a113bcffe865b2a7",    )

tiktoken/load.py 中,可以看到文件cache的存储路径,如果没有在环境变量中设置,默认使用的路径在Linux下是/tmp/data-gym-cache(可以自己验证路径)。 那么找一台联网的机器,运行过tiktoken程序的机器,去找到 /tmp/data-gym-cache/9b5ad71b2ce5302211f9c61530b329a4922fc6a4 文件,copy到另一台机器上相同路径下即可。

以上见原文链接:blog.csdn.net/zhilaizhiwa…

方法二(我采用)

在安装了tiktoken 包的路径下,有文件load.py,修改文件中的读取文件的逻辑:

from __future__ import annotationsimport base64import hashlibimport jsonimport osimport tempfileimport uuidfrom typing import Optionalimport requests# 以下注释掉的是原本的read_file() 的逻辑#def read_file(blobpath: str) -> bytes:#    if not blobpath.startswith("http://") and not blobpath.startswith("https://"):#        try:#            import blobfile#        except ImportError as e:#            raise ImportError(#                "blobfile is not installed. Please install it by running `pip install blobfile`."#            ) from e#        with blobfile.BlobFile(blobpath, "rb") as f:#            return f.read()    # avoiding blobfile for public files helps avoid auth issues, like MFA prompts#    resp = requests.get(blobpath)#    resp.raise_for_status()#    return resp.contentdef read_file(blobpath: str) -> bytes:    blobpath = "https://openaipublic.blob.core.windows.net/encodings/cl100k_base.tiktoken"      cache_key = hashlib.sha1(blobpath.encode()).hexdigest()    cache_dir = "/data/inspur/product/lbk-tests"    cache_path = os.path.join(cache_dir, cache_key)    with open(cache_path, "rb") as f:        data = f.read()    return data

其实上面就是将read_file()的逻辑从远程下载的逻辑改为从本地读取。本质上同方法一的底层逻辑一样,只不过方法一是从环境变量中读取。

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

GraphRAG 本地部署 离线知识图谱 tiktoken Ollama
相关文章