MarkTechPost@AI 02月17日
A Step-by-Step Guide to Setting Up a Custom BPE Tokenizer with Tiktoken for Advanced NLP Applications in Python
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文介绍了如何使用tiktoken库创建自定义的BPE分词器。首先,加载预训练的分词器模型,然后定义基础词汇和特殊token。接着,使用特定的正则表达式初始化分词器,用于文本分割。最后,通过编码和解码示例文本来测试分词器的功能。这个过程对于需要精确控制文本分词的自然语言处理任务至关重要。通过本文,读者可以掌握自定义分词器的基本步骤,为后续的NLP项目打下坚实的基础。

🔑介绍了使用tiktoken库创建自定义BPE分词器的步骤,包括加载预训练模型、定义基础和特殊tokens、以及使用正则表达式初始化分词器。

🧰展示了如何使用tiktoken.Encoding初始化分词器,其中name指定分词器名称,pat_str使用正则表达式定义分割规则,mergeable_ranks加载基础词汇,special_tokens映射特殊token到唯一ID。

🧪通过编码和解码示例文本,验证了自定义分词器的功能是否正常,确保文本可以正确地转换为token ID,并能从token ID还原回文本。

📚文章还提及了在NLP项目中,自定义分词器是进行定制化文本处理和分词的关键步骤,为读者后续的NLP项目打下基础。

In this tutorial, we’ll learn how to create a custom tokenizer using the tiktoken library. The process involves loading a pre-trained tokenizer model, defining both base and special tokens, initializing the tokenizer with a specific regular expression for token splitting, and testing its functionality by encoding and decoding some sample text. This setup is essential for NLP tasks requiring precise control over text tokenization.

from pathlib import Pathimport tiktokenfrom tiktoken.load import load_tiktoken_bpeimport json

Here, we import several libraries essential for text processing and machine learning. It uses Path from pathlib for easy file path management, while tiktoken and load_tiktoken_bpe facilitate loading and working with a Byte Pair Encoding tokenizer.

tokenizer_path = "./content/tokenizer.model"num_reserved_special_tokens = 256mergeable_ranks = load_tiktoken_bpe(tokenizer_path)num_base_tokens = len(mergeable_ranks)special_tokens = [    "<|begin_of_text|>",    "<|end_of_text|>",    "<|reserved_special_token_0|>",    "<|reserved_special_token_1|>",    "<|finetune_right_pad_id|>",    "<|step_id|>",    "<|start_header_id|>",    "<|end_header_id|>",    "<|eom_id|>",    "<|eot_id|>",    "<|python_tag|>",]

Here, we set the path to the tokenizer model, specifying 256 reserved special tokens. It then loads the mergeable ranks, which form the base vocabulary, calculates the number of base tokens, and defines a list of special tokens for marking text boundaries and other reserved purposes.

reserved_tokens = [    f"<|reserved_specialtoken{2 + i}|>"    for i in range(num_reserved_special_tokens - len(special_tokens))]special_tokens = special_tokens + reserved_tokenstokenizer = tiktoken.Encoding(    name=Path(tokenizer_path).name,    pat_str=r"(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^rnp{L}p{N}]?p{L}+|p{N}{1,3}| ?[^sp{L}p{N}]+[rn]|s[rn]+|s+(?!S)|s+",    mergeable_ranks=mergeable_ranks,    special_tokens={token: len(mergeable_ranks) + i for i, token in enumerate(special_tokens)},)

Now, we dynamically create additional reserved tokens to reach 256, then append them to the predefined special tokens list. It initializes the tokenizer using tiktoken. Encoding with a specified regular expression for splitting text, the loaded mergeable ranks as the base vocabulary, and mapping special tokens to unique token IDs.

#-------------------------------------------------------------------------# Test the tokenizer with a sample text#-------------------------------------------------------------------------sample_text = "Hello, this is a test of the updated tokenizer!"encoded = tokenizer.encode(sample_text)decoded = tokenizer.decode(encoded)print("Sample Text:", sample_text)print("Encoded Tokens:", encoded)print("Decoded Text:", decoded)

We test the tokenizer by encoding a sample text into token IDs and then decoding those IDs back into text. It prints the original text, the encoded tokens, and the decoded text to confirm that the tokenizer works correctly.

Here, we encode the string “Hey” into its corresponding token IDs using the tokenizer’s encoding method.

In conclusion, following this tutorial will teach you how to set up a custom BPE tokenizer using the TikToken library. You saw how to load a pre-trained tokenizer model, define both base and special tokens, and initialize the tokenizer with a specific regular expression for token splitting. Finally, you verified the tokenizer’s functionality by encoding and decoding sample text. This setup is a fundamental step for any NLP project that requires customized text processing and tokenization.


Here is the Colab Notebook for the above project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 75k+ ML SubReddit.

Recommended Open-Source AI Platform: ‘IntellAgent is a An Open-Source Multi-Agent Framework to Evaluate Complex Conversational AI System(Promoted)

The post A Step-by-Step Guide to Setting Up a Custom BPE Tokenizer with Tiktoken for Advanced NLP Applications in Python appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

Tiktoken BPE分词器 NLP 自定义分词 文本处理
相关文章