安装过程
克隆graph-local-ollama
仓库:
git clone https://github.com/TheAiSingularity/graphrag-local-ollama.git
进入项目目录:
cd graphrag-local-ollama/
安装依赖:
pip install -e .
创建graphrag
的一个项目,将输入输出文件夹创建好:
mkdir -p ./ragtest/inputcp input/* ./ragtest/input
初始化项目:
python -m graphrag.index --init --root ./ragtest
将设置的配置文件拷贝到项目内:
cp settings.yaml ./ragtest
修改配置文件:
encoding_model: cl100k_baseskip_workflows: []llm: api_key: ${GRAPHRAG_API_KEY} type: openai_chat # or azure_openai_chat model: qwen-32b-instruct-fp16:latest model_supports_json: true # recommended if this is available for your model. # max_tokens: 4000 # request_timeout: 180.0 api_base: http://localhost:7889/v1 # api_version: 2024-02-15-preview # organization: <organization_id> # deployment_name: <azure_model_deployment_name> # tokens_per_minute: 150_000 # set a leaky bucket throttle # requests_per_minute: 10_000 # set a leaky bucket throttle # max_retries: 10 # max_retry_wait: 10.0 # sleep_on_rate_limit_recommendation: true # whether to sleep when azure suggests wait-times # concurrent_requests: 25 # the number of parallel inflight requests that may be madeparallelization: stagger: 0.3 # num_threads: 50 # the number of threads to use for parallel processingasync_mode: threaded # or asyncioembeddings: ## parallelization: override the global parallelization settings for embeddings async_mode: threaded # or asyncio llm: api_key: ${GRAPHRAG_API_KEY} type: openai_embedding # or azure_openai_embedding model: bge-large-zh-v1.5:f16 api_base: http://localhost:7889/api # api_version: 2024-02-15-preview # organization: <organization_id> # deployment_name: <azure_model_deployment_name> # tokens_per_minute: 150_000 # set a leaky bucket throttle # requests_per_minute: 10_000 # set a leaky bucket throttle # max_retries: 10 # max_retry_wait: 10.0 # sleep_on_rate_limit_recommendation: true # whether to sleep when azure suggests wait-times # concurrent_requests: 25 # the number of parallel inflight requests that may be made # batch_size: 16 # the number of documents to send in a single request # batch_max_tokens: 8191 # the maximum number of tokens to send in a single request # target: required # or optional chunks: size: 300 overlap: 100 group_by_columns: [id] # by default, we don't allow chunks to cross documents input: type: file # or blob file_type: text # or csv base_dir: "input" file_encoding: utf-8 file_pattern: ".*\.txt$"cache: type: file # or blob base_dir: "cache" # connection_string: <azure_blob_storage_connection_string> # container_name: <azure_blob_storage_container_name>storage: type: file # or blob base_dir: "output/${timestamp}/artifacts" # connection_string: <azure_blob_storage_connection_string> # container_name: <azure_blob_storage_container_name>reporting: type: file # or console, blob base_dir: "output/${timestamp}/reports" # connection_string: <azure_blob_storage_connection_string> # container_name: <azure_blob_storage_container_name>entity_extraction: ## llm: override the global llm settings for this task ## parallelization: override the global parallelization settings for this task ## async_mode: override the global async_mode settings for this task prompt: "prompts/entity_extraction.txt" entity_types: [organization,person,geo,event] max_gleanings: 0summarize_descriptions: ## llm: override the global llm settings for this task ## parallelization: override the global parallelization settings for this task ## async_mode: override the global async_mode settings for this task prompt: "prompts/summarize_descriptions.txt" max_length: 500claim_extraction: ## llm: override the global llm settings for this task ## parallelization: override the global parallelization settings for this task ## async_mode: override the global async_mode settings for this task # enabled: true prompt: "prompts/claim_extraction.txt" description: "Any claims or facts that could be relevant to information discovery." max_gleanings: 0community_report: ## llm: override the global llm settings for this task ## parallelization: override the global parallelization settings for this task ## async_mode: override the global async_mode settings for this task prompt: "prompts/community_report.txt" max_length: 2000 max_input_length: 8000cluster_graph: max_cluster_size: 10embed_graph: enabled: false # if true, will generate node2vec embeddings for nodes # num_walks: 10 # walk_length: 40 # window_size: 2 # iterations: 3 # random_seed: 597832umap: enabled: false # if true, will generate UMAP embeddings for nodessnapshots: graphml: yes raw_entities: yes top_level_nodes: yeslocal_search: # text_unit_prop: 0.5 # community_prop: 0.1 # conversation_history_max_turns: 5 # top_k_mapped_entities: 10 # top_k_relationships: 10 # max_tokens: 12000global_search: # max_tokens: 12000 # data_max_tokens: 12000 # map_max_tokens: 1000 # reduce_max_tokens: 2000 # concurrency: 32
关于配置文件,修改文件的模型名称和模型地址为本地,也可以修改为远程调用(如阿里云百炼):
运行索引,创建图(直接运行下面的命令在离线的环境下不能成功,详见踩坑记录):
python -m graphrag.index --root ./ragtest
运行结果
在以上命令都执行完之后,运行索引执行成功的结果如下图:
查询语句
python -m graphrag.query --root ./ragtest --method global "What is machine learning?"
运行查询,得到结果:
踩坑记录
关于 tiktoken
tiktoken 在没有编码器的情况下,自动请求tiktoken 的下载对应的文件。然而在离线环境下,这种下载是不可能成功的。所以需要手动下载来解决这个问题。
方法一(未验证)
在 tiktoken_ext/openai_public.py
,文件中,如果blobpath 为 https://openaipublic.blob.core.windows.net/encodings/cl100k_base.tiktoken
, 那么文件的hash就是9b5ad71b2ce5302211f9c61530b329a4922fc6a4
def cl100k_base(): mergeable_ranks = load_tiktoken_bpe( "https://openaipublic.blob.core.windows.net/encodings/cl100k_base.tiktoken", expected_hash="223921b76ee99bde995b7ff738513eef100fb51d18c93597a113bcffe865b2a7", )
在 tiktoken/load.py
中,可以看到文件cache的存储路径,如果没有在环境变量中设置,默认使用的路径在Linux下是/tmp/data-gym-cache
(可以自己验证路径)。 那么找一台联网的机器,运行过tiktoken程序的机器,去找到 /tmp/data-gym-cache/9b5ad71b2ce5302211f9c61530b329a4922fc6a4
文件,copy到另一台机器上相同路径下即可。
以上见原文链接:blog.csdn.net/zhilaizhiwa…
方法二(我采用)
在安装了tiktoken
包的路径下,有文件load.py
,修改文件中的读取文件的逻辑:
from __future__ import annotationsimport base64import hashlibimport jsonimport osimport tempfileimport uuidfrom typing import Optionalimport requests# 以下注释掉的是原本的read_file() 的逻辑#def read_file(blobpath: str) -> bytes:# if not blobpath.startswith("http://") and not blobpath.startswith("https://"):# try:# import blobfile# except ImportError as e:# raise ImportError(# "blobfile is not installed. Please install it by running `pip install blobfile`."# ) from e# with blobfile.BlobFile(blobpath, "rb") as f:# return f.read() # avoiding blobfile for public files helps avoid auth issues, like MFA prompts# resp = requests.get(blobpath)# resp.raise_for_status()# return resp.contentdef read_file(blobpath: str) -> bytes: blobpath = "https://openaipublic.blob.core.windows.net/encodings/cl100k_base.tiktoken" cache_key = hashlib.sha1(blobpath.encode()).hexdigest() cache_dir = "/data/inspur/product/lbk-tests" cache_path = os.path.join(cache_dir, cache_key) with open(cache_path, "rb") as f: data = f.read() return data
其实上面就是将read_file()
的逻辑从远程下载的逻辑改为从本地读取。本质上同方法一的底层逻辑一样,只不过方法一是从环境变量中读取。