ByteByteGo 07月22日 23:31
How Nubank Uses AI Models to Analyze Transaction Data for 100M Users
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

Nubank正通过采用基于基础模型的先进方法,颠覆金融机构理解和利用客户数据的传统模式。该方法摒弃了耗时且脆弱的手动特征工程,转而直接在海量原始交易数据上进行自监督学习,从而自动提取通用且富有表现力的用户行为嵌入(embeddings)。这些嵌入能够驱动信贷建模、个性化推荐、异常检测等多种下游任务,实现了建模工作的统一,减少了重复劳动,并显著提升了预测性能。文章详细阐述了Nubank构建和部署这些基础模型的系统设计,涵盖了从数据表示、模型架构到预训练、微调及与传统表格系统融合的全生命周期。

💡 **告别手动特征工程,拥抱基础模型**:Nubank摒弃了金融行业依赖手动将原始交易数据转化为结构化特征的传统方式。这种手动特征工程不仅耗时,且易出错,并且为特定任务设计的特征难以泛化到其他应用。Nubank引入了类似自然语言处理和计算机视觉领域的基础模型(foundation models)概念,直接在原始交易数据上进行自监督学习,自动学习通用且高效的用户行为表示(embeddings),从而极大地提高了效率和模型性能。

🚀 **混合编码策略实现高效与泛化**:为解决交易数据类型混合、基数高和冷启动问题,Nubank开发了一种创新的混合编码策略。它将每笔交易分解为离散的token序列,包括金额符号、金额分桶、日期(月、日、周)以及经过BPE分词的商家描述。这种方法在保持数据结构紧凑性的同时,保留了关键信息,有效避免了全文本转换带来的计算成本,使得模型能够高效处理数据并具备良好的泛化能力,以应对未见过或罕见的交易。

🔄 **自监督学习驱动模型能力**:Nubank利用自监督学习(Self-Supervised Learning)目标,如Next Token Prediction(NTP)和Masked Language Modeling(MLM),在海量无标签交易数据上训练其基础模型。通过预测序列中的下一个token或填补被掩盖的token,模型能够深入理解用户消费行为模式,如消费周期、固定支付和异常交易。这种方法充分利用了用户完整的交易历史,无需人工标注,极大地提升了学习效率和模型对金融行为模式的捕捉能力。

🔗 **端到端联合融合提升模型协同性**:为了充分利用基础模型生成的序列嵌入和已有的结构化表格特征(如用户画像、信贷局数据),Nubank采用了“联合融合”(Joint Fusion)策略。通过DCNv2等深度神经网络架构,将Transformer模型与表格数据模型进行端到端的联合训练。这使得模型能够学习到能够互补表格数据的序列信息,并使两个组件都能针对同一预测任务进行优化,从而实现比独立训练的“晚期融合”(Late Fusion)更优越的性能。

🏢 **集中化AI平台加速模型应用**:Nubank构建了一个集中的AI平台,用于存储预训练的基础模型,并提供标准化的微调流程。这使得公司内部各团队能够便捷地访问、组合模型,并根据自身特定需求部署微调后的模型,而无需从头开始重新训练。该平台支持嵌入式模型和融合模型两种选择,确保了灵活性和可扩展性,加速了AI在金融业务中的落地应用,并促进了模型能力的共享和迭代。

Free NoSQL Training – and a Book by Discord Engineer Bo Ingram (Sponsored)

Get practical experience with the strategies used at Discord, Disney, Zillow, Tripadvisor & other gamechangers
July 30, 2025

Whether you’re just getting started with NoSQL or looking to optimize your NoSQL performance, this event is a fast way to learn more and get your questions answered by experts.

You can choose from two tracks:

This is live instructor-led training, so bring your toughest questions. You can interact with speakers and connect with fellow attendees throughout the event.

Register for Free

Bonus: Registrants get the ScyllaDB in Action ebook by Discord Staff Engineer Bo Ingram.


Disclaimer: The details in this post have been derived from the articles shared online by the Nubank Engineering Team. All credit for the technical details goes to the Nubank Engineering Team.  The links to the original articles and sources are present in the references section at the end of the post. We’ve attempted to analyze the details and provide our input about them. If you find any inaccuracies or omissions, please leave a comment, and we will do our best to fix them.

Understanding customer behavior at scale is one of the core challenges facing modern financial institutions. With millions of users generating billions of transactions, the ability to interpret and act on this data is critical for offering relevant products, detecting fraud, assessing risk, and improving user experience.

Historically, the financial industry has relied on traditional machine learning techniques built around tabular data. In these systems, raw transaction data is manually transformed into structured features (such as income levels, spending categories, or transaction counts) that serve as inputs to predictive models. 

While this approach has been effective, it suffers from two major limitations:

To address these constraints, Nubank adopted a foundation model-based approach, which has already been transforming domains like natural language processing and computer vision. Instead of relying on hand-crafted features, foundation models are trained directly on raw transaction data using self-supervised learning. This allows them to automatically learn general-purpose embeddings that represent user behavior in a compact and expressive form.

The objective is ambitious: to process trillions of transactions and extract universal user representations that can power a wide range of downstream tasks such as credit modeling, personalization, anomaly detection, and more. By doing so, Nubank aims to unify its modeling efforts, reduce repetitive feature work, and improve predictive performance across the board.

In this article, we look at Nubank’s system design for building and deploying these foundation models. We will trace the complete lifecycle from data representation and model architecture to pretraining, fine-tuning, and integration with traditional tabular systems. 

Overall Architecture of Nubank’s System

Nubank’s foundation model system is designed to handle massive volumes of financial data and extract general-purpose user representations from it. These representations, also known as embeddings, are later used across many business applications such as credit scoring, product recommendation, and fraud detection. 

The architecture is built around a transformer-based foundation model and is structured into several key stages, each with a specific purpose.

1 - Transaction Ingestion

The system starts by collecting raw transaction data for each customer. This includes information like the transaction amount, timestamp, and merchant description. The volume is enormous, covering trillions of transactions across more than 100 million users. 

Each user has a time-ordered sequence of transactions, which is essential for modeling spending behavior over time. In addition to transactions, other user interaction data, such as app events, can also be included.

2 - Embedding Interface

Before transactions can be fed into a transformer model, they need to be converted into a format the model can understand. This is done through a specialized encoding strategy. Rather than converting entire transactions into text, Nubank uses a hybrid method that treats each transaction as a structured sequence of tokens.

Each transaction is broken into smaller elements:

See the diagram below:

This tokenized sequence preserves both the structure and semantics of the original data, and it keeps the input length compact, which is important because attention computation in transformers scales with the square of the input length.

3 - Transformer Backbone

Once transactions are tokenized, they are passed into a transformer model. Nubank uses several transformer variants for experimentation and performance optimization. 

The models are trained using self-supervised learning, which means they do not require labeled data. Instead, they are trained to solve tasks like:

More on these in later sections. The output of the transformer is a fixed-length user embedding, usually taken from the final token's hidden state.

4 - Self-Supervised Training

The model is trained on large-scale, unlabeled transaction data using self-supervised learning objectives. Since no manual labeling is required, the system can leverage the full transaction history for each user. The models learn useful patterns about financial behavior, such as spending cycles, recurring payments, and anomalies, by simply trying to predict missing or future parts of a user’s transaction sequence. As a simplified example, the model might see “Coffee, then Lunch, then…”. It tries to guess “Dinner”.

The size of the training data and model parameters plays a key role. As the model scales in size and the context window increases, performance improves significantly. By constantly guessing and correcting itself across billions of transactions, the model starts to notice patterns in how people spend money.

For instance, switching from a basic MLM model to a large causal transformer with optimized attention layers resulted in a performance gain of over 7 percentage points in downstream tasks.

5 - Downstream Fine-Tuning and Fusion

After the foundation model is pre-trained, it can be fine-tuned for specific tasks. This involves adding a prediction head on top of the transformer and training it with labeled data. For example, a credit default prediction task would use known labels (whether a customer defaulted) to fine-tune the model.

To integrate with existing systems, the user embedding is combined with manually engineered tabular features. This fusion is done in two ways:

More on these in a later section.

6 - Centralized Model Repository

To make this architecture usable across the company, Nubank has built a centralized AI platform. 

This platform stores pretrained foundation models and provides standardized pipelines for fine-tuning them. Internal teams can access these models, combine them with their features, and deploy fine-tuned versions for their specific use cases without needing to retrain everything from scratch.

This centralization accelerates development, reduces redundancy, and ensures that all teams benefit from improvements made to the core models.

Transforming Transactions into Model-Ready Sequences

To train foundation models on transaction data, it is critical to convert each transaction into a format that transformer models can process. 

There are two main challenges when trying to represent transactions for use in transformer models:

Approach 1: ID-Based Representation (from Recommender Systems)

In this method, each unique transaction is assigned a numerical ID, similar to techniques used in sequential recommendation models. These IDs are then converted into embeddings using a lookup table.

While this approach is simple and efficient in terms of token length, it has two major weaknesses:

Approach 2: Text-Is-All-You-Need

This method treats each transaction as a piece of natural language text. The transaction fields are converted into strings by joining the attribute name and its value, such as "description=NETFLIX amount=32.40 date=2023-05-12", and then tokenized using a standard NLP tokenizer.

This representation can handle arbitrary transaction formats and unseen data, which makes it highly generalizable. 

However, it comes at a high computational cost. 

Transformers process data using self-attention, and the computational cost of attention increases with the square of the input length. Turning structured fields into long text sequences causes unnecessary token inflation, making training slower and less scalable.

Approach 3: Hybrid Encoding Scheme (Chosen by Nubank)

To balance generalization and efficiency, Nubank developed a hybrid encoding strategy that preserves the structure of each transaction without converting everything into full text.

Each transaction is tokenized into a compact set of discrete fields:

By combining these elements, a transaction is transformed into a short and meaningful sequence of tokens. See the diagram below:

This hybrid approach retains the key structured information in a compact format while allowing generalization to new inputs. It also avoids long text sequences, which keeps attention computation efficient.

Once each transaction is tokenized in this way, the full transaction history of a user can be concatenated into a sequence and used as input to the transformer. Separator tokens are inserted between transactions to preserve boundaries, and the sequence is truncated at a fixed context length to stay within computational limits.

Training the Foundation Models

Once transactions are converted into sequences of tokens, the next step is to train a transformer model that can learn patterns from this data. 

As mentioned, the Nubank engineering team uses self-supervised learning to achieve this, which means the model learns directly from the transaction sequences without needing any manual labels. This approach allows the system to take full advantage of the enormous volume of historical transaction data available across millions of users.

Two main objectives are used:

See the diagram below:

Blending Sequential Embeddings with Tabular Data

While foundation models trained on transaction sequences can capture complex behavioral patterns, many financial systems still rely on structured tabular data for critical features. 

For example, information from credit bureaus, user profiles, or application forms often comes in tabular format. To make full use of both sources (sequential embeddings from the transformer and existing tabular features), it is important to combine them in a way that maximizes predictive performance.

This process of combining different data modalities is known as fusion. 

Nubank explored two main fusion strategies: late fusion, which is easier to implement but limited in effectiveness, and joint fusion, which is more powerful and trains all components together in a unified system.

See the diagram below:

Late Fusion (Baseline Approach)

In the late fusion setup, tabular features are combined with the frozen embeddings produced by a pretrained foundation model. Think of it like taking the “frozen feeling” from the pre-trained model (embeddings) and combining it with the checklist of facts such as age, credit score, profile details, and so on.

These combined inputs are then passed into a traditional machine learning model, such as LightGBM or XGBoost. While this method is simple and leverages well-established tools, it has an important limitation.

Since the foundation model is frozen and trained separately, the embeddings do not adapt to the specific downstream task or interact meaningfully with the tabular data during training. As a result, there is no synergy between the two input sources, and the overall model cannot fully optimize performance.

Joint Fusion (Proposed Method)

To overcome this limitation, Nubank developed a joint fusion architecture. 

This approach trains the transformer and the tabular model together in a single end-to-end system. By doing this, the model can learn to extract information from transaction sequences that complements the structured tabular data, and both components are optimized for the same prediction task.

To implement this, Nubank selected DCNv2 (Deep and Cross Network v2) as the architecture for processing tabular features. DCNv2 is a deep neural network specifically designed to handle structured inputs. It combines deep layers with cross layers that capture interactions between features efficiently.

See the diagram below:

Conclusion

Nubank’s initiative to use foundational models represents a significant leap forward in how financial institutions can understand and serve their customers. By moving away from manually engineered features and embracing self-supervised learning on raw transaction data, Nubank has built a modeling system that is both scalable and expressive.

A key part of this success lies in how the system has been integrated into Nubank’s broader AI infrastructure. Rather than building isolated models for each use case, Nubank has developed a centralized AI platform where teams can access pretrained foundation models. These models are trained on massive volumes of user transaction data and are stored in a shared model repository.

Teams across the company can choose between two types of models based on their needs:

This flexibility is critical. 

Some teams may already have strong tabular models in place and can plug in the user embeddings with minimal changes. Others may prefer to rely entirely on the transformer-based sequence model, especially for new tasks where historical tabular features are not yet defined. 

The architecture is also forward-compatible with new data sources. While the current models are primarily trained on transactions, the design allows for the inclusion of other user interaction data, such as app usage patterns, customer support chats, or browsing behavior. 

In short, Nubank’s system is not just a technical proof-of-concept. It is a production-ready solution that delivers measurable gains across core financial prediction tasks.

References:


Shipping late? DevStats shows you why. (Sponsored)

Still pretending your delivery issues are a mystery? They’re not. You’re just not looking in the right place.

DevStats gives engineering leaders brutal clarity on where delivery breaks down, so you can fix the process instead of pointing fingers.

✅ Track DORA and flow metrics like a grown-up

✅ Spot stuck work, burnout risks, and aging issues

✅ Cut cycle time without cutting corners

✅ Ship faster. With fewer surprises.

More AI tools won’t fix your delivery. More Clarity will.

👉 Try DevStats for free


SPONSOR US

Get your product in front of more than 1,000,000 tech professionals.

Our newsletter puts your products and services directly in front of an audience that matters - hundreds of thousands of engineering leaders and senior engineers - who have influence over significant tech decisions and big purchases.

Space Fills Up Fast - Reserve Today

Ad spots typically sell out about 4 weeks in advance. To ensure your ad reaches this influential audience, reserve your space now by emailing sponsorship@bytebytego.com.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

Nubank 基础模型 自监督学习 金融科技 数据工程
相关文章