MarkTechPost@AI 前天 14:35
A Coding Tutorial of Model Context Protocol Focusing on Semantic Chunking, Dynamic Token Management, and Context Relevance Scoring for Efficient LLM Interactions
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文介绍了一种在Google Colab等资源受限环境下,有效管理大型语言模型上下文的方法。通过构建ModelContextManager,自动分割文本,利用Sentence-Transformers生成语义嵌入,并根据时间、重要性和相关性对每个文本块进行评分。同时,阐述了如何将此管理器与Hugging Face的FLAN-T5模型集成,以实现上下文的动态添加、优化和检索,确保模型始终基于最相关的上下文信息运行。此外,还涉及使用GPT-2 tokenizer进行token计数、上下文窗口优化策略以及实时查询和可视化动态上下文的交互式会话。

🧰**ModelContextManager**: 这是一个用于在LLM中实现模型上下文协议的管理器,特别适用于Google Colab环境。它能处理上下文窗口优化、token管理和相关性评分,确保模型在资源有限的环境下高效运行。

✨**ContextChunk数据类**: 用于封装文本片段及其元数据,包括文本内容、嵌入向量、重要性评分、时间戳和自定义元数据。每个chunk在创建时都会自动记录时间戳,并可设置重要性,方便后续进行优先级排序。

⚖️**上下文优化策略**: 当token数量超过设定的最大上下文长度时,`optimize_context`方法会被调用。该方法通过`score_chunks`计算每个chunk的得分,综合考虑时间、重要性和语义相关性,移除得分较低的chunk,保留最相关的部分,从而优化上下文窗口。

🔍**相关性评分机制**: `score_chunks`方法根据时间衰减、用户指定的重要性和语义相似度为每个文本块评分。时间衰减确保最近的文本块获得更高的分数,重要性评分则允许用户手动调整文本块的优先级,语义相似度通过比较文本块嵌入向量与查询嵌入向量的余弦相似度来计算。

Managing context effectively is a critical challenge when working with large language models, especially in environments like Google Colab, where resource constraints and long documents can quickly exceed available token windows. In this tutorial, we guide you through a practical implementation of the Model Context Protocol (MCP) by building a ModelContextManager that automatically chunks incoming text, generates semantic embeddings using Sentence-Transformers, and scores each chunk based on recency, importance, and relevance. You’ll learn how to integrate this manager with a Hugging Face sequence-to-sequence model, demonstrated here with FLAN-T5, to add, optimize, and retrieve only the most pertinent pieces of context. Along the way, we’ll cover token counting with a GPT-2 tokenizer, context-window optimization strategies, and interactive sessions that let you query and visualize your dynamic context in real time.

import torchimport numpy as npfrom typing import List, Dict, Any, Optional, Union, Tuplefrom dataclasses import dataclassimport timeimport gcfrom tqdm.notebook import tqdm

We import essential libraries for building a dynamic context manager: torch and numpy handle tensor and numerical operations, while typing and dataclasses provide structured type annotations and data containers. Utility modules, such as time and gc, support timestamping and memory cleanup, as well as tqdm.notebook offers interactive progress bars for chunk processing in Colab.

@dataclassclass ContextChunk:    """A chunk of text with metadata for the Model Context Protocol."""    text: str    embedding: Optional[torch.Tensor] = None    importance: float = 1.0    timestamp: float = 0.0    metadata: Dict[str, Any] = None       def __post_init__(self):        if self.metadata is None:            self.metadata = {}        if self.timestamp == 0.0:            self.timestamp = time.time()

The ContextChunk dataclass encapsulates a single segment of text along with its embedding, a user-assigned importance score, a timestamp, and arbitrary metadata. Its __post_init__ method ensures that each chunk is stamped with the current time upon creation and that metadata defaults to an empty dictionary if none is provided.

class ModelContextManager:    """    Manager for implementing Model Context Protocol in LLMs on Google Colab.    Handles context window optimization, token management, and relevance scoring.    """       def __init__(        self,        max_context_length: int = 8192,        embedding_model: str = "sentence-transformers/all-MiniLM-L6-v2",        relevance_threshold: float = 0.7,        recency_weight: float = 0.3,        importance_weight: float = 0.3,        semantic_weight: float = 0.4,        device: str = "cuda" if torch.cuda.is_available() else "cpu"    ):        """        Initialize the Model Context Manager.               Args:            max_context_length: Maximum number of tokens in context window            embedding_model: Model to use for text embeddings            relevance_threshold: Threshold for chunk relevance to be included            recency_weight: Weight for recency in relevance calculation            importance_weight: Weight for importance in relevance calculation            semantic_weight: Weight for semantic similarity in relevance calculation            device: Device to run computations on        """        self.max_context_length = max_context_length        self.device = device        self.chunks = []        self.current_token_count = 0        self.relevance_threshold = relevance_threshold               self.recency_weight = recency_weight        self.importance_weight = importance_weight        self.semantic_weight = semantic_weight               try:            from sentence_transformers import SentenceTransformer            print(f"Loading embedding model {embedding_model}...")            self.embedding_model = SentenceTransformer(embedding_model).to(self.device)            print(f"Embedding model loaded successfully on {self.device}")        except ImportError:            print("Installing sentence-transformers...")            import subprocess            subprocess.run(["pip", "install", "sentence-transformers"])            from sentence_transformers import SentenceTransformer            self.embedding_model = SentenceTransformer(embedding_model).to(self.device)            print(f"Embedding model loaded successfully on {self.device}")                   try:            from transformers import GPT2Tokenizer            self.tokenizer = GPT2Tokenizer.from_pretrained("gpt2")        except ImportError:            print("Installing transformers...")            import subprocess            subprocess.run(["pip", "install", "transformers"])            from transformers import GPT2Tokenizer            self.tokenizer = GPT2Tokenizer.from_pretrained("gpt2")       def add_chunk(self, text: str, importance: float = 1.0, metadata: Dict[str, Any] = None) -> None:        """        Add a new chunk of text to the context manager.               Args:            text: The text content to add            importance: Importance score (0-1)            metadata: Additional metadata for the chunk        """        with torch.no_grad():            embedding = self.embedding_model.encode(text, convert_to_tensor=True)               chunk = ContextChunk(            text=text,            embedding=embedding,            importance=importance,            timestamp=time.time(),            metadata=metadata or {}        )               self.chunks.append(chunk)        self.current_token_count += len(self.tokenizer.encode(text))               if self.current_token_count > self.max_context_length:            self.optimize_context()       def optimize_context(self) -> None:        """Optimize context by removing less relevant chunks to fit within token limit."""        if not self.chunks:            return                   print("Optimizing context window...")               scores = self.score_chunks()               sorted_indices = np.argsort(scores)[::-1]               new_chunks = []        new_token_count = 0               for idx in sorted_indices:            chunk = self.chunks[idx]            chunk_tokens = len(self.tokenizer.encode(chunk.text))                       if new_token_count + chunk_tokens <= self.max_context_length:                new_chunks.append(chunk)                new_token_count += chunk_tokens            else:                if scores[idx] > self.relevance_threshold * 1.5:                    for i, included_chunk in enumerate(new_chunks):                        included_idx = sorted_indices[i]                        if scores[included_idx] < self.relevance_threshold:                            included_tokens = len(self.tokenizer.encode(included_chunk.text))                            if new_token_count - included_tokens + chunk_tokens <= self.max_context_length:                                new_chunks.remove(included_chunk)                                new_token_count -= included_tokens                                new_chunks.append(chunk)                                new_token_count += chunk_tokens                                break               removed_count = len(self.chunks) - len(new_chunks)        self.chunks = new_chunks        self.current_token_count = new_token_count               print(f"Context optimized: Removed {removed_count} chunks, {len(new_chunks)} remaining, using {new_token_count}/{self.max_context_length} tokens")               gc.collect()        if torch.cuda.is_available():            torch.cuda.empty_cache()       def score_chunks(self, query: str = None) -> np.ndarray:        """        Score chunks based on recency, importance, and semantic relevance.               Args:            query: Optional query to calculate semantic relevance against                   Returns:            Array of scores for each chunk        """        if not self.chunks:            return np.array([])                   current_time = time.time()        max_age = max(current_time - chunk.timestamp for chunk in self.chunks) or 1.0        recency_scores = np.array([            1.0 - ((current_time - chunk.timestamp) / max_age)            for chunk in self.chunks        ])               importance_scores = np.array([chunk.importance for chunk in self.chunks])               if query is not None:            query_embedding = self.embedding_model.encode(query, convert_to_tensor=True)            similarity_scores = np.array([                torch.cosine_similarity(chunk.embedding, query_embedding, dim=0).item()                for chunk in self.chunks            ])                       similarity_scores = (similarity_scores - similarity_scores.min()) / (similarity_scores.max() - similarity_scores.min() + 1e-8)        else:            similarity_scores = np.ones(len(self.chunks))               final_scores = (            self.recency_weight * recency_scores +            self.importance_weight * importance_scores +            self.semantic_weight * similarity_scores        )               return final_scores       def retrieve_context(self, query: str = None, k: int = None) -> str:        """        Retrieve the most relevant context for a given query.               Args:            query: The query to retrieve context for            k: The maximum number of chunks to return (None = all relevant chunks)                   Returns:            String containing the combined relevant context        """        if not self.chunks:            return ""                   scores = self.score_chunks(query)               relevant_indices = np.where(scores >= self.relevance_threshold)[0]               relevant_indices = relevant_indices[np.argsort(scores[relevant_indices])[::-1]]               if k is not None:            relevant_indices = relevant_indices[:k]                   relevant_texts = [self.chunks[i].text for i in relevant_indices]        return "\n\n".join(relevant_texts)       def get_stats(self) -> Dict[str, Any]:        """Get statistics about the current context state."""        return {            "chunk_count": len(self.chunks),            "token_count": self.current_token_count,            "max_tokens": self.max_context_length,            "usage_percentage": self.current_token_count / self.max_context_length * 100 if self.max_context_length else 0,            "avg_chunk_size": self.current_token_count / len(self.chunks) if self.chunks else 0,            "oldest_chunk_age": time.time() - min(chunk.timestamp for chunk in self.chunks) if self.chunks else 0,        }    def visualize_context(self):        """Visualize the current context window distribution."""        try:            import matplotlib.pyplot as plt            import pandas as pd                       if not self.chunks:                print("No chunks to visualize")                return                       scores = self.score_chunks()            chunk_sizes = [len(self.tokenizer.encode(chunk.text)) for chunk in self.chunks]            timestamps = [chunk.timestamp for chunk in self.chunks]            relative_times = [time.time() - ts for ts in timestamps]            importance = [chunk.importance for chunk in self.chunks]                       df = pd.DataFrame({                'Size (tokens)': chunk_sizes,                'Age (seconds)': relative_times,                'Importance': importance,                'Score': scores            })                       fig, axs = plt.subplots(2, 2, figsize=(14, 10))                       axs[0, 0].bar(range(len(chunk_sizes)), chunk_sizes)            axs[0, 0].set_title('Token Distribution by Chunk')            axs[0, 0].set_ylabel('Tokens')            axs[0, 0].set_xlabel('Chunk Index')                       axs[0, 1].scatter(chunk_sizes, scores)            axs[0, 1].set_title('Score vs Chunk Size')            axs[0, 1].set_xlabel('Tokens')            axs[0, 1].set_ylabel('Score')                       axs[1, 0].scatter(relative_times, scores)            axs[1, 0].set_title('Score vs Chunk Age')            axs[1, 0].set_xlabel('Age (seconds)')            axs[1, 0].set_ylabel('Score')                       axs[1, 1].scatter(importance, scores)            axs[1, 1].set_title('Score vs Importance')            axs[1, 1].set_xlabel('Importance')            axs[1, 1].set_ylabel('Score')                       plt.tight_layout()            plt.show()                   except ImportError:            print("Please install matplotlib and pandas for visualization")            print('!pip install matplotlib pandas')

The ModelContextManager class orchestrates the end-to-end handling of context for LLMs by chunking input text, generating embeddings, and tracking token usage against a configurable limit. It implements relevance scoring (combining recency, importance, and semantic similarity), automatic context pruning, retrieval of the most pertinent chunks, and convenient utilities for monitoring and visualizing context statistics.

class MCPColabDemo:    """Demonstration of Model Context Protocol in Google Colab with a Language Model."""       def __init__(        self,        model_name: str = "google/flan-t5-base",        max_context_length: int = 2048,        device: str = "cuda" if torch.cuda.is_available() else "cpu"    ):        """        Initialize the MCP Colab demo with a specified model.               Args:            model_name: Hugging Face model name            max_context_length: Maximum context length for the MCP manager            device: Device to run the model on        """        self.device = device        self.context_manager = ModelContextManager(            max_context_length=max_context_length,            device=device        )               try:            from transformers import AutoModelForSeq2SeqLM, AutoTokenizer            print(f"Loading model {model_name}...")            self.model = AutoModelForSeq2SeqLM.from_pretrained(model_name).to(device)            self.tokenizer = AutoTokenizer.from_pretrained(model_name)            print(f"Model loaded successfully on {device}")        except ImportError:            print("Installing transformers...")            import subprocess            subprocess.run(["pip", "install", "transformers"])            from transformers import AutoModelForSeq2SeqLM, AutoTokenizer            self.model = AutoModelForSeq2SeqLM.from_pretrained(model_name).to(device)            self.tokenizer = AutoTokenizer.from_pretrained(model_name)            print(f"Model loaded successfully on {device}")       def add_document(self, text: str, chunk_size: int = 512, overlap: int = 50) -> None:        """        Add a document to the context by chunking it appropriately.               Args:            text: Document text            chunk_size: Size of each chunk in characters            overlap: Overlap between chunks in characters        """        chunks = []        for i in range(0, len(text), chunk_size - overlap):            chunk = text[i:i + chunk_size]            if len(chunk) > 20:                  chunks.append(chunk)               print(f"Adding {len(chunks)} chunks to context...")        for i, chunk in enumerate(tqdm(chunks)):            pos = i / len(chunks)            importance = 1.0 - 0.5 * min(pos, 1 - pos)                       self.context_manager.add_chunk(                text=chunk,                importance=importance,                metadata={"source": "document", "position": i, "total_chunks": len(chunks)}            )       def process_query(self, query: str, max_new_tokens: int = 256) -> str:        """        Process a query using the context manager and model.               Args:            query: The query to process            max_new_tokens: Maximum number of tokens in response                   Returns:            Model response        """        self.context_manager.add_chunk(query, importance=1.0, metadata={"type": "query"})               relevant_context = self.context_manager.retrieve_context(query=query)               prompt = f"Context: {relevant_context}\n\nQuestion: {query}\n\nAnswer:"               inputs = self.tokenizer(prompt, return_tensors="pt").to(self.device)               print("Generating response...")        with torch.no_grad():            outputs = self.model.generate(                **inputs,                max_new_tokens=max_new_tokens,                do_sample=True,                temperature=0.7,                top_p=0.9,            )               response = self.tokenizer.decode(outputs[0], skip_special_tokens=True)               self.context_manager.add_chunk(            response,            importance=0.9,            metadata={"type": "response", "query": query}        )               return response       def interactive_session(self):        """Run an interactive session in the notebook."""        from IPython.display import clear_output               print("Starting interactive MCP session. Type 'exit' to end.")        conversation_history = []               while True:            query = input("\nYour query: ")                       if query.lower() == 'exit':                break                           if query.lower() == 'stats':                print("\nContext Statistics:")                stats = self.context_manager.get_stats()                for key, value in stats.items():                    print(f"{key}: {value}")                self.context_manager.visualize_context()                continue                           if query.lower() == 'clear':                self.context_manager.chunks = []                self.context_manager.current_token_count = 0                conversation_history = []                clear_output(wait=True)                print("Context cleared!")                continue                       response = self.process_query(query)            conversation_history.append((query, response))                       print("\nResponse:")            print(response)            print("\n" + "-"*50)                       stats = self.context_manager.get_stats()            print(f"Context usage: {stats['token_count']}/{stats['max_tokens']} tokens ({stats['usage_percentage']:.1f}%)")

The MCPColabDemo class ties the context manager to a seq2seq LLM, loading FLAN-T5 (or any specified Hugging Face model) on the chosen device, and provides utility methods for chunking and ingesting entire documents, processing user queries by prepending only the most relevant context, and running an interactive Colab session complete with real-time stats, visualizations, and commands for clearing or inspecting the evolving context window.

def run_mcp_demo():    """Run a simple demo of the Model Context Protocol."""    print("Running Model Context Protocol Demo...")       context_manager = ModelContextManager(max_context_length=4096)       print("Adding sample chunks...")       context_manager.add_chunk(        "The Model Context Protocol (MCP) is a framework for managing context "        "windows in large language models. It helps optimize token usage and improve relevance.",        importance=1.0    )       context_manager.add_chunk(        "Context management involves techniques like sliding windows, chunking, "        "and relevance filtering to handle large documents efficiently.",        importance=0.8    )       for i in range(10):        context_manager.add_chunk(            f"This is test chunk {i} with some filler content to simulate a larger context "            f"window that needs optimization. This helps demonstrate the MCP functionality "            f"for context window management in language models on Google Colab.",            importance=0.5 - (i * 0.02)          )       stats = context_manager.get_stats()    print("\nInitial Statistics:")    for key, value in stats.items():        print(f"{key}: {value}")           query = "How does the Model Context Protocol work?"    print(f"\nRetrieving context for: '{query}'")    context = context_manager.retrieve_context(query)    print(f"\nRelevant context:\n{context}")       print("\nVisualizing context:")    context_manager.visualize_context()       print("\nDemo complete!")

The run_mcp_demo function ties everything together in a single script: it instantiates the ModelContextManager, adds a series of sample chunks with varying importance, prints out initial statistics, retrieves and displays the most relevant context for a test query, and finally visualizes the context window, providing a complete, end-to-end demonstration of the Model Context Protocol in action.

if __name__ == "__main__":    run_mcp_demo()

Finally, this standard Python entry-point guard ensures that the run_mcp_demo() function executes only when the script is run directly (rather than imported as a module), triggering the end-to-end demonstration of the Model Context Protocol workflow.

In conclusion, we will have a fully functional MCP system that not only curbs runaway token usage but also prioritizes context fragments that truly matter for your queries. The ModelContextManager equips you with tools to balance semantic relevance, temporal freshness, and user-assigned importance. At the same time, the accompanying MCPColabDemo class provides an accessible framework for real-time experimentation and visualization. Armed with these patterns, you can extend the core principles by adjusting relevance thresholds, experimenting with various embedding models, or integrating with alternative LLM backends to tailor your domain-specific workflows. Ultimately, this approach enables you to create concise yet highly relevant prompts, resulting in more accurate and efficient responses from your language models.


Here is the Colab Notebook. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 90k+ ML SubReddit.

[Register Now] miniCON Virtual Conference on AGENTIC AI: FREE REGISTRATION + Certificate of Attendance + 4 Hour Short Event (May 21, 9 am- 1 pm PST) + Hands on Workshop

The post A Coding Tutorial of Model Context Protocol Focusing on Semantic Chunking, Dynamic Token Management, and Context Relevance Scoring for Efficient LLM Interactions appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

LLM 上下文管理 模型优化 Sentence-Transformers FLAN-T5
相关文章