AWS Machine Learning Blog 2024年07月03日
The future of productivity agents with NinjaTech AI and AWS Trainium
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

NinjaTech AI旨在通过快速且经济实惠的AI代理来完成耗时的复杂任务,从而提高每个人的工作效率。他们最近推出了MyNinja.ai,这是全球首批多代理个人AI助手之一,旨在实现这一目标。MyNinja.ai从头开始构建,使用能够代表您完成任务的专用代理,包括安排会议、从网络上进行深入研究、生成代码和帮助写作。这些代理可以将复杂的多步骤任务分解成分支解决方案,并且能够在不断学习过去经验的同时动态地评估生成的解决方案。所有这些任务都以完全自主和异步的方式完成,让您在Ninja在后台处理这些任务时继续进行您的日常工作,并在需要您的输入时进行互动。

为了实现为用户完成任务的目标,NinjaTech AI需要多个针对特定任务优化的模型,例如Deep Researcher、Deep Coder和Advisor模型。他们发现现有的开源模型在满足其需求方面存在不足,因此选择对模型进行微调,以适应各种下游任务。

为了构建训练数据集,NinjaTech AI的目标是双重的:适应每个模型的适合的下游任务和角色(研究人员、顾问、编码员等),以及适应模型以遵循特定的输出结构。他们遵循Lima方法进行微调,使用大约2000万个标记的训练样本量,专注于输出的格式和语气,同时使用多样化但相对较小的样本量。

为了实现高精度和效率,NinjaTech AI选择使用AWS Trainium芯片进行训练。他们使用32个实例的集群来有效地并行化训练,每个微调迭代只需不到3个小时,成本不到1000美元。这种快速的迭代时间和低成本使他们能够快速调整和测试模型,并提高模型精度。在使用Trainium实例的集群后,他们只需要花费大约3万美元,与在传统训练加速器上进行训练相比,节省了数十万甚至数百万美元。

为了评估模型,NinjaTech AI设定了两个目标:评估模型回答用户问题的能力,以及评估系统使用提供的来源回答问题的能力。他们选择了HotPotQA和Natural Questions(NQ)Open数据集进行评估。

为了进一步提高模型性能和用户体验,NinjaTech AI正在进行一些开发工作,包括使用ORPO对模型进行微调,以及构建一个由他们微调的各种模型组成的自定义集成模型。他们相信这些发展将使他们能够更好地将模型与用户对齐,并简化模型服务和扩展架构。

This is a guest post by Arash Sadrieh, Tahir Azim, and Tengfui Xue from NinjaTech AI.

NinjaTech AI’s mission is to make everyone more productive by taking care of time-consuming complex tasks with fast and affordable artificial intelligence (AI) agents. We recently launched MyNinja.ai, one of the world’s first multi-agent personal AI assistants, to drive towards our mission. MyNinja.ai is built from the ground up using specialized agents that are capable of completing tasks on your behalf, including scheduling meetings, conducting deep research from the web, generating code, and helping with writing. These agents can break down complicated, multi-step tasks into branched solutions, and are capable of evaluating the generated solutions dynamically while continually learning from past experiences. All of these tasks are accomplished in a fully autonomous and asynchronous manner, freeing you up to continue your day while Ninja works on these tasks in the background, and engaging when your input is required.

Because no single large language model (LLM) is perfect for every task, we knew that building a personal AI assistant would require multiple LLMs optimized specifically for a variety of tasks. In order to deliver the accuracy and capabilities to delight our users, we also knew that we would require these multiple models to work together in tandem. Finally, we needed scalable and cost-effective methods for training these various models—an undertaking that has historically been costly to pursue for most startups. In this post, we describe how we built our cutting-edge productivity agent NinjaLLM, the backbone of MyNinja.ai, using AWS Trainium chips.

Building a dataset

We recognized early that to deliver on the mission of tackling tasks on a user’s behalf, we needed multiple models that were optimized for specific tasks. Examples include our Deep Researcher, Deep Coder, and Advisor models. After testing available open source models, we felt that the out-of-the-box capabilities and responses were insufficient with prompt engineering alone to meet our needs. Specifically, in our testing with open source models, we wanted to make sure each model was optimized for a ReAct/chain-of-thought style of prompting. Additionally, we wanted to make sure the model would, when deployed as part of a Retrieval Augmented Generation (RAG) system, accurately cite each source, as well as any bias towards saying “I don’t know” as opposed to generating false answers. For that purpose, we chose to fine-tune the models for the various downstream tasks.

In constructing our training dataset, our goal was twofold: adapt each model for its suited downstream task and persona (Researcher, Advisor, Coder, and so on), and adapt the models to follow a specific output structure. To that end, we followed the Lima approach for fine-tuning. We used a training sample size of roughly 20 million tokens, focusing on the format and tone of the output while using a diverse but relatively small sample size. To construct our supervised fine-tuning dataset, we began by creating initial seed tasks for each model. With these seed tasks, we generated an initial synthetic dataset using Meta’s Llama 2 model. We were able to use the synthetic dataset to perform an initial round of fine-tuning. To initially evaluate the performance of this fine-tuned model, we crowd-sourced user feedback to iteratively create more samples. We also used a series of benchmarks—internal and public—to assess model performance and continued to iterate.

Fine-tuning on Trainium

We elected to start with the Llama models for a pre-trained base model for several reasons: most notably the great out-of-the-box performance, strong ecosystem support from various libraries, and the truly open source and permissive license. At the time, we began with Llama 2, testing across the various sizes (7B, 13B, and 70B). For training, we chose to use a cluster of trn1.32xlarge instances to take advantage of Trainium chips. We used a cluster of 32 instances in order to efficiently parallelize the training. We also used AWS ParallelCluster to manage cluster orchestration. By using a cluster of Trainium instances, each fine-tuning iteration took less than 3 hours, at a cost of less than $1,000. This quick iteration time and low cost, allowed us to quickly tune and test our models and improve our model accuracy. To achieve the accuracies discussed in the following sections, we only had to spend around $30k, savings hundreds of thousands, if not millions of dollars if we had to train on traditional training accelerators.

The following diagram illustrates our training architecture.

After we had established our fine-tuning pipelines built on top of Trainium, we were able to fine-tune and refine our models thanks to the Neuron Distributed training libraries. This was exceptionally useful and timely, because leading up to the launch of MyNinja.ai, Meta’s Llama 3 models were released. Llama 3 and Llama 2 share similar architecture, so we were able to rapidly upgrade to the newer model. This velocity in switching allowed us to take advantage of the inherent gains in model accuracy, and very quickly run through another round of fine-tuning with the Llama 3 weights and prepare for launch.

Model evaluation

For evaluating the model, there were two objectives: evaluate the model’s ability to answer user questions, and evaluate the system’s ability to answer questions with provided sources, because this is our personal AI assistant’s primary interface. We selected the HotPotQA and Natural Questions (NQ) Open datasets, both of which are a good fit because of their open benchmarking datasets with public leaderboards.

We calculated accuracy by matching the model’s answer to the expected answer, using the top 10 passages retrieved from a Wikipedia corpus. We performed content filtering and ranking using ColBERTv2, a BERT-based retrieval model. We achieved accuracies of 62.22% on the NQ Open dataset and 58.84% on HotPotQA by using our enhanced Llama 3 RAG model, demonstrating notable improvements over other baseline models. The following figure summarizes our results.

Future work

Looking ahead, we’re working on several developments to continue improving our model’s performance and user experience. First, we intend to use ORPO to fine-tune our models. ORPO combines traditional fine-tuning with preference alignment, while using a single preference alignment dataset for both. We believe this will allow us to better align models to achieve better results for users.

Additionally, we intend to build a custom ensemble model from the various models we have fine-tuned thus far. Inspired by Mixture of Expert (MoE) model architectures, we intend to introduce a routing layer to our various models. We believe this will radically simplify our model serving and scaling architecture, while maintaining the quality in various tasks that our users have come to expect from our personal AI assistant.

Conclusion

Building next-gen AI agents to make everyone more productive is NinjaTech AI’s pathway to achieving its mission. To democratize access to this transformative technology, it is critical to have access to high-powered compute, open source models, and an ecosystem of tools that make training each new agent affordable and fast. AWS’s purpose-built AI chips, access to the top open source models, and its training architecture make this possible.

To learn more about how we built NinjaTech AI’s multi-agent personal AI, you can read our whitepaper. You can also try these AI agents for free at MyNinja.ai.


About the authors

 Arash Sadrieh is the Co-Founder and Chief Science Officer at Ninjatech.ai. Arash co-founded Ninjatech.ai with a vision to make everyone more productive by taking care of time-consuming tasks with AI agents. This vision was shaped during his tenure as a Senior Applied Scientist at AWS, where he drove key research initiatives that significantly improved infrastructure efficiency over six years, earning him multiple patents for optimizing core infrastructure. His academic background includes a PhD in computer modeling and simulation, with collaborations with esteemed institutions such as Oxford University, Sydney University, and CSIRO. Prior to his industry tenure, Arash had a postdoctoral research tenure marked by publications in high-impact journals, including Nature Communications.

Tahir Azim is a Staff Software Engineer at NinjaTech. Tahir focuses on NinjaTech’s Inf2 and Trn1 based training and inference platforms, its unified gateway for accessing these platforms, and its RAG-based research skill. He previously worked at Amazon as a senior software engineer, building data-driven systems for optimal utilization of Amazon’s global Internet edge infrastructure, driving down cost, congestion and latency. Before moving to industry, Tahir earned an M.S. and Ph.D. in Computer Science from Stanford University, taught for three years as an assistant professor at NUST(Pakistan), and did a post-doc in fast data analytics systems at EPFL. Tahir has authored several publications presented at top-tier conferences such as VLDB, USENIX ATC, MobiCom and MobiHoc.

Tengfei Xue is an Applied Scientist at NinjaTech AI. His current research interests include natural language processing and multimodal learning, particularly using large language models and large multimodal models. Tengfei completed his PhD studies at the School of Computer Science, University of Sydney, where he focused on deep learning for healthcare using various modalities. He was also a visiting PhD candidate at the Laboratory of Mathematics in Imaging (LMI) at Harvard University, where he worked on 3D computer vision for complex geometric data.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

AI代理 个人AI助手 AWS Trainium 模型微调 模型评估
相关文章