The GitHub Blog 2024年12月18日
So many tokens, so little time: Introducing a faster, more flexible byte-pair tokenizer
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文介绍了GitHub为解决语言模型缩放问题,创建新算法改进BPE实现的过程。阐述了Tokenization的重要性及面临的挑战,如现有算法的复杂性和局限性。还提到了改进后的BPE算法的优势、性能比较及相关原理。

Tokenization是将字节转换为令牌的过程,现有算法存在复杂性和局限性。

为解决问题,GitHub创建新算法,改进BPE实现,具有更好性能和可扩展性。

介绍了RAG对GitHub Copilot的重要性及Tokenization在其中的作用。

通过性能比较,展示了新算法在实践中的优势及最坏情况下的表现。

Large language models (LLMs), such as those used by GitHub Copilot, do not operate directly on bytes but on tokens, which can complicate scaling. In this post, we explain how we solved that challenge at GitHub to support the growing number of Copilot users and features since the first launching Copilot two years ago.

Tokenization is the process of turning bytes into tokens. The byte-pair encoding (BPE) algorithm is such a tokenizer, used (for example) by the OpenAI models we use at GitHub. The fastest of these algorithms not only have at least an O(n log(n)) complexity, they are also not incremental, and therefore badly suited for our use cases, which go beyond encoding an upfront known input. This limitation resulted in a number of scaling challenges that led us to create a novel algorithm to address them. Our algorithm not only scales linearly, but also easily out-performs popular libraries for all inputs.

Read on to find out more about why and how we created the open source bpe algorithm that substantially improves state of the art BPE implementations in order to address our broader set of use cases.

The importance of fast, flexible tokenization

Retrieval augmented generation (RAG) is essential for GitHub Copilot’s capabilities. RAG is used to improve model output by augmenting the user’s prompt with relevant snippets of text or code. A typical RAG approach works as follows:

Tokenization is important for both of these steps. Most code files exceed the number of tokens that can be encoded into a single embedding, so we need to split the files into chunks that are within the token limit. When building a prompt there are also limits on the number of tokens that can be used. The amount of tokens can also impact response time and cost. Therefore, it is common to have some kind of budgeting strategy, which requires being able to track the number of tokens that components of the prompt contribute. In both of these cases, we are dynamically constructing a text and constantly need the updated token count during that process to decide how to proceed. However, most tokenizers provide only a single operation: encode a complete input text into tokens.

When scaling to millions of repositories and billions of embeddings, the efficiency of token calculation really starts to matter. Additionally, we need to consider the worst-case performance of tokenization for the stability of our production systems. A system that processes untrusted user input in the form of billions of source code files cannot allow data that, intentionally or not, causes pathological run times and threatens availability. (See this discussion of potential denial-of-service issues in the context of OpenAI’s tiktoken tokenizer.)

Some of the features that can help address these needs are:

Implementing these operations using current tokenization algorithms would result in at least quadratic runtime, when we would like the runtime to be linear.

bpe: A fast tokenizer

We were able to make substantial improvements to the state of the art BPE implementations in order to address the broader set of use cases that we have. Not only were we able to support more features, but we do so with much better performance and scalability than the existing libraries provide.

Our implementation is open source with an MIT license can be found at https://github.com/github/rust-gems. The Rust crates are also published to crates.io as bpe and bpe-openai. The former contains the BPE implementation itself. The latter exposes convenience tokenizers (including pre-tokenization) for recent OpenAI token models.

Read on for benchmark results and an introduction to the algorithm itself.

Performance comparison

We compare performance with two benchmarks. Both compare tokenization on randomly generated inputs of different sizes.

We used OpenAI’s o200k_base token model and compared our implementation with tiktoken-rs, a wrapper for OpenAI’s tiktoken library, and Huggingface’s tokenizers. All benchmarks were run single-threaded on an Apple M1 MacBook Pro.

Here are the results, showing single-threaded throughput in MiB/s:

The first figure shows the results for the benchmark that includes pre-tokenization. We see that our tokenizer outperforms tiktoken by almost 4x and Huggingface by about 10x. (These numbers are in line with tiktoken’s reported performance results. Note that our single-threaded performance is only matched when using eight threads.) The second figure shows the worst case complexity difference between our linear, Huggingface’s heap-based, and tiktoken’s quadratic implementation.

The rest of this post details how we achieved this result. We explain the basic principle of byte-pair encoding, the insight that allows the faster algorithm, and a high-level description of the algorithm itself.

Byte-pair encoding

BPE is a technique to encode text as a sequence of tokens from a token dictionary. The token dictionary is just an ordered list of tokens. Each token is either a single byte, or the concatenation of a pair of previously defined tokens. A string is encoded by replacing bytes with single-byte tokens and token pairs by concatenated tokens, in dictionary order.

Let’s see how the string abacbb is tokenized using the following dictionary:

a b c ac bb ab acbb

Initially, the string is tokenized into the single-byte tokens. Next, all occurrences (left to right) of the token pair a c are replaced by the token ac. This procedure is repeated until no more replacements are possible. For our input string abacbb, tokenization proceeds as follows:

1. a b a c b b2. a b ac  b b3. a b ac  bb4. ab  ac  bb5. ab  acbb

Note that initially we have several pairs of single-byte tokens that appear in the dictionary, such as a b and a c. Even though ab appears earlier in the string, ac is chosen because the token appears first in the token dictionary. It is this behavior that makes BPE non-incremental with respect to string operations such as slicing or appending. For example, the substring abacb is tokenized as ab ac b, but if another b is added, the resulting string abacbb is tokenized as ab acbb. Two tokens from the prefix abacb are gone, and the encoding for the longer string even ends up being shorter.

The two main strategies for implementing BPE are:

However, all tokenizers require the full text up front. Tokenizing a substring or an extended string means starting the encoding from scratch. For that reason, the more interesting use cases from above quickly become very expensive (at least O(n2 log(n))). So, how can we do better?

Composing valid encodings

The difficulty of the byte-pair encoding algorithm (as described above) is that token pair replacements can happen anywhere in the string and can influence the final tokens at the beginning of the string. However, it turns out that there is a property, which we call compatibility, that allows us to build up tokenizations left-to-right:

Given a valid encoding, we can append an additional token to produce a new valid encoding if the pair of the last token and the appended token are a valid encoding.

A valid encoding means that the original BPE algorithm produces that same encoding. We’ll show what this means with an example, and refer to the crate’s README for a detailed proof.

The sequence ab ac is a valid encoding for our example token dictionary.

The next section explains how we can go from building valid encodings to finding the encoding for a given input string.

Linear encoding

Using the compatibility rule, we can implement linear encoding with a dynamic programming algorithm.

The algorithm works by checking for each of the possible last tokens whether it is compatible with the tokenization of the remaining prefix. As we saw in the previous section, we only need the last token of the prefix’s encoding to decide this.

Let’s apply this idea to our example abacbb and write down the full encodings for every prefix:

We only store the last token for every prefix. This gives us a ab a ac b for the first five prefixes. We can find the last token for a prefix with a simple lookup in the list. For example, the last token for ab is ab, and the last token for abac is ac.

For the last token of abacbb we have three token candidates: b, bb, and acbb. For each of these we must check whether it is compatible with the last token of the remaining prefix: b b, ac bb, or ab acbb. Retokenizing these combinations gives bb, acbb, and ab acbb, which means acbb is the only valid choice here. The algorithm works forward by computing the last token for every position in the input, using the last tokens for previous positions in the way we just described.

The resulting algorithm looks roughly like this:

let last_tokens = vec![];for pos in 0..text.len() {  for candidate in all_potential_tokens_for_suffix(text[0..pos + 1]) {    if token_len(candidate) == pos + 1 {      last_tokens.push(candidate);      break;    } else if is_compatible(      last_tokens[pos + 1 - token_len(candidate)],      candidate,    ) {      last_tokens.push(candidate);      break;    }  }}

How do we implement this efficiently?

The string matching automaton is linear for the input length. Both the number of overlapping tokens and retokenization are bounded by constants determined by the token dictionary. Together, this gives us a linear runtime.

Putting it together

The Rust crate contains several different encoders based on this approach:

We have explained our algorithm at a high level so far. The crate’s README contains more technical details and is a great starting point for studying the code itself.

The post So many tokens, so little time: Introducing a faster, more flexible byte-pair tokenizer appeared first on The GitHub Blog.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

GitHub Tokenization BPE算法 语言模型
相关文章