MarkTechPost@AI 2024年12月18日
ProteinZen: An All-Atom Protein Structure Generation Method Using Machine Learning
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

ProteinZen是一种新颖的双阶段生成框架,它结合了流匹配技术和潜在空间建模,旨在实现精确的全原子蛋白质生成。该方法首先利用SE(3)流匹配构建蛋白质骨架框架,同时生成每个残基的潜在表示。随后,混合VAE-MLM模型将这些潜在表示转化为原子级结构,预测侧链扭转角和序列标识。ProteinZen通过分离氨基酸标识和原子位置,在保持多样性和计算效率的同时,实现了原子级精度。该模型在AFDB512数据集上训练,并在序列-结构一致性方面表现出色,为蛋白质生成建模设立了新标准。

🧬 ProteinZen采用双阶段生成框架,第一阶段使用SE(3)流匹配生成蛋白质骨架,同时为每个残基生成潜在表示,有效解耦了原子位置与氨基酸标识。

⚛️ 第二阶段,混合VAE-MLM模型将潜在表示解码为原子级结构,预测侧链扭转角和序列信息,并通过passthrough损失提高生成结构的准确性。

📊 ProteinZen在序列-结构一致性方面达到46%,优于现有模型,同时保持了高结构和序列多样性,生成的蛋白质结构与已知折叠空间对齐,并展现了对新颖折叠的泛化能力。

⚙️ 该模型利用张量场网络(TFN)进行编码,并使用改进的IPMP层进行解码,确保了SE(3)等变性和计算效率,训练数据来自精心构建的AFDB512数据集。

🚀 ProteinZen的创新方法为全原子蛋白质生成设定了新标准,未来研究将关注长程结构建模、潜在空间与解码器之间的交互以及条件蛋白质设计任务。

Generating all-atom protein structures is a significant challenge in de novo protein design. Current generative models have improved significantly for backbone generation but remain difficult to solve for atomic precision because discrete amino acid identities are embedded within continuous placements of the atoms in 3D space. This issue is especially significant in designing functional proteins, including enzymes and molecular binders, as even minor inaccuracies at the atomic scale may impede practical application. Adopting a novel strategy that can effectively tackle these two facets while preserving both precision and computational efficiency is essential to surmount this challenge.

Current models such as RFDiffusion and Chroma concentrate mainly on backbone configurations and offer restricted atomic resolution. Extensions such as RFDiffusion-AA and LigandMPNN attempt to capture atomic-level complexities but are not able to represent all-atom configurations exhaustively. Superposition-based methods like Protpardelle and Pallatom attempt to approach atomic structures but suffer from high computational costs and challenges in handling discrete-continuous interactions. Moreover, these approaches struggle with achieving the trade-off between sequence-structure consistency and diversity, making them less useful for realistic applications in exact protein design.

Researchers from UC Berkeley and UCSF introduce ProteinZen, a two-stage generative framework that combines flow matching for backbone frames with latent space modeling to achieve precise all-atom protein generation. In the initial phase, ProteinZen constructs protein backbone frames within the SE(3) space while concurrently generating latent representations for each residue through flow-matching methodologies. This underlying abstraction, therefore avoids direct entanglement between atomic positioning and amino acid identities, making the generation process more streamlined. In this subsequent phase, a VAE that is hybrid with MLM interprets the latent representations into atomic-level structures, predicting sidechain torsion angles, as well as sequence identities. The incorporation of passthrough losses improves the alignment of the generated structures with the actual atomic properties, ensuring increased accuracy and consistency. This new framework addresses the limitations of existing approaches by achieving atomic-level accuracy without sacrificing diversity and computational efficiency.

ProteinZen employs SE(3) flow matching for backbone frame generation and Euclidean flow matching for latent features, minimizing losses for rotation, translation, and latent representation prediction. A hybrid VAE-MLM autoencoder encodes atomic details into latent variables and decodes them into a sequence and atomic configurations. The model’s architecture incorporates Tensor-Field Networks (TFN) for encoding and modified IPMP layers for decoding, ensuring SE(3) equivariance and computational efficiency. Training is done on the AFDB512 dataset, which is very carefully built by combining PDB-Clustered monomers along with representatives from the AlphaFold Database that contains proteins with up to 512 residues. The training of this model makes use of a mix of real and synthetic data to improve generalization.

ProteinZen achieves a sequence-structure consistency (SSC) of 46%, outperforming existing models while maintaining high structural and sequence diversity. It balances accuracy with novelty well, producing protein structures that are diverse yet unique with competitive precision. Performance analysis indicates that ProteinZen works well on smaller protein sequences while showing promise to be further developed for long-range modeling. The synthesized samples range from a variety of secondary structures, with a weak propensity toward alpha-helices. The structural evaluation confirms that most of the proteins generated are aligned with the known fold spaces while showing generalization towards novel folds. The results show that ProteinZen can produce highly accurate and diverse all-atom protein structures, thus marking a significant advance compared to existing generative approaches. 

In conclusion, ProteinZen introduces an innovative methodology for the generation of all-atom proteins by integrating SE(3) flow matching for backbone synthesis alongside latent flow matching for the reconstruction of atomic structures. Through the separation of distinct amino acid identities and the continuous positioning of atoms, the technique attains precision at the atomic level, all the while preserving diversity and computational efficiency. With a sequence-structure consistency of 46% and evidenced structural uniqueness, ProteinZen establishes a novel standard for generative protein modeling. Future work will include the improvement of long-range structural modeling, refinement of the interaction between the latent space and decoder, and the exploration of conditional protein design tasks. This development signifies a significant progression toward the precise, effective, and practical design of all-atom proteins.


Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 60k+ ML SubReddit.

Trending: LG AI Research Releases EXAONE 3.5: Three Open-Source Bilingual Frontier AI-level Models Delivering Unmatched Instruction Following and Long Context Understanding for Global Leadership in Generative AI Excellence….

The post ProteinZen: An All-Atom Protein Structure Generation Method Using Machine Learning appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

ProteinZen 蛋白质结构生成 机器学习 流匹配 VAE-MLM
相关文章