Using deep learning to annotate the protein universe

Bileschi, Maxwell L.; Belanger, David; Bryant, Drew H.; Sanderson, Theo; Carter, Brandon; Sculley, D.; Bateman, Alex; DePristo, Mark A.; Colwell, Lucy J.

doi:10.1038/s41587-021-01179-w

Article
Published: 21 February 2022

Using deep learning to annotate the protein universe

Nature Biotechnology volume 40, pages 932–937 (2022)Cite this article

28k Accesses
84 Citations
293 Altmetric
Metrics details

Subjects

Abstract

Understanding the relationship between amino acid sequence and protein function is a long-standing challenge with far-reaching scientific and translational implications. State-of-the-art alignment-based techniques cannot predict function for one-third of microbial protein sequences, hampering our ability to exploit data from diverse organisms. Here, we train deep learning models to accurately predict functional annotations for unaligned amino acid sequences across rigorous benchmark assessments built from the 17,929 families of the protein families database Pfam. The models infer known patterns of evolutionary substitutions and learn representations that accurately cluster sequences from unseen families. Combining deep models with existing methods significantly improves remote homology detection, suggesting that the deep models learn complementary information. This approach extends the coverage of Pfam by >9.5%, exceeding additions made over the last decade, and predicts function for 360 human reference proteome proteins with no previous Pfam annotation. These results suggest that deep learning models will be a core component of future protein annotation tools.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on Springer Link
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

**Fig. 1: Model performance on Pfam-seed.**

**Fig. 3: Clustered split performance of ProtCNN, ProtENN, TPHMM and ProtREP, which uses the learned representation of sequence space.**

**Fig. 4: A combination of ProtENN and TPHMM improves performance on the remote homology task.**

Deep learning extends de novo protein modelling coverage of genomes using iteratively predicted structural constraints

Article Open access 04 September 2019

Joe G. Greener, Shaun M. Kandathil & David T. Jones

Uncovering new families and folds in the natural protein universe

Article Open access 13 September 2023

Janani Durairaj, Andrew M. Waterhouse, … Joana Pereira

Current progress and open challenges for applying deep learning across the biosciences

Article Open access 01 April 2022

Nicolae Sapoval, Amirali Aghazadeh, … Todd J. Treangen

Data availability

The data splits described in this manuscript are available for download at https://console.cloud.google.com/storage/browser/brain-genomics-public/research/proteins/pfam/random_split and https://console.cloud.google.com/storage/browser/brain-genomics-public/research/proteins/pfam/clustered_split, and an interactive notebook for data loading is available at https://www.kaggle.com/googleai/pfam-seed-random-split. Model predictions for Pfam-N are freely available to download as part of the Pfam v.34.0 release from http://ftp.ebi.ac.uk/pub/databases/Pfam/releases/Pfam34.0/.

Code availability

The TensorFlow API, specifically tensorflow-gpu v.1.15.4, was used to implement and train all deep models using the architectures described in the Methods. Code that documents model training using Python v.3.7 is available on GitHub at https://github.com/google-research/google-research/tree/master/using_dl_to_annotate_protein_universe. The training and validation datasets used for creating each model are available as described in the preceding section. Trained models are available in Google Cloud Storage at https://console.cloud.google.com/storage/browser/brain-genomics-public/research/proteins/pfam/models/single_domain_per_sequence_zipped_models, including the ensembles trained on the Pfam-seed random split, Pfam-seed clustered split, Pfam-full random split (all Pfam v.32.0) and the models used to generate Pfam-N v.34.0. ProtCNN inference was run using a custom Python script that (1) read in FASTA records and (2) ran inference of the ProtCNN as a TensorFlow SavedModel. An interactive notebook that demonstrates inference using ProtCNN is available at https://colab.research.google.com/github/google-research/google-research/blob/master/using_dl_to_annotate_protein_universe/neural_network/Neural_network_accuracy_on_random_seed_split.ipynb. An interactive notebook showing use of the trained models to produce Pfam class predictions as well as embeddings is available in GitHub at https://colab.sandbox.google.com/github/google-research/google-research/blob/master/using_dl_to_annotate_protein_universe/Using_Deep_Learning_to_Annotate_the_Protein_Universe.ipynb.

References

Steinegger, M. & Söding, J. Mmseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nat. Biotechnol. 35, 1026–1028 (2017).
Article CAS Google Scholar
Steinegger, M. & Söding, J. Clustering huge protein sequence sets in linear time. Nat. Commun. 9, 2542 (2018).
Article Google Scholar
Söding, J. Protein homology detection by HMM–HMM comparison. Bioinformatics 21, 951–960 (2004).
Article Google Scholar
Biegert, A. & Söding, J. Sequence context-specific profiles for homology searching. Proc. Natl Acad. Sci. USA 106, 3770–3775 (2009).
Article CAS Google Scholar
Finn, R. D., Clements, J. & Eddy, S. R. HMMER web server: interactive sequence similarity searching. Nucleic Acids Res. 39, W29–W37 (2011).
Article CAS Google Scholar
Mistry, J. et al. Pfam: the protein families database in 2021. Nucleic Acids Res. 49, D412–D419 (2021).
Article CAS Google Scholar
Altschul, S. F. et al. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25, 3389–3402 (1997).
Article CAS Google Scholar
Price, M. N. et al. Mutant phenotypes for thousands of bacterial genes of unknown function. Nature 557, 503–509 (2018).
Chang, Y.-C. et al. COMBREX-DB: an experiment centered database of protein function: knowledge, predictions and knowledge gaps. Nucleic Acids Res. 44, D330–D335 (2015).
Article Google Scholar
UniProt Consortium. UniProt: the universal protein knowledgebase. Nucleic Acids Res. 45, D158–D169 (2017).
Hou, J., Adhikari, B. & Cheng, J. DeepSF: deep convolutional neural network for mapping protein sequences to folds. Bioinformatics 34, 1295–1303 (2017).
Article Google Scholar
Kulmanov, M., Khan, M. A. & Hoehndorf, R. DeepGO: predicting protein functions from sequence and interactions using a deep ontology-aware classifier. Bioinformatics 34, 660–668 (2017).
Article Google Scholar
Cao, R. et al. ProLanGO: protein function prediction using neural machine translation based on a recurrent neural network. Molecules 22, 1732 (2017).
Article Google Scholar
Li, Y. et al. DEEPre: sequence-based enzyme ec number prediction by deep learning. Bioinformatics 34, 760–769 (2017).
Article Google Scholar
Szalkai, B. & Grolmusz, V. Near perfect protein multi-label classification with deep neural networks. Methods 132, 50–56 (2018).
Article CAS Google Scholar
Zou, Z., Tian, S., Gao, X. & Li, Y. mlDEEPre: multi-functional enzyme function prediction with hierarchical multi-label deep learning. Front. Genet. 9, 714 (2019).
Schwartz, A. S. et al. Deep semantic protein representation for annotation, discovery, and engineering. Preprint at bioRxiv https://doi.org/10.1101/365965 (2018).
Zhang, D. and Kabuka, M. R. Protein family classification with multi-layer graph convolutional networks. In 2018 IEEE International Conference on Bioinformatics and Biomedicine (BIBM) 2390–2393 (IEEE, 2018).
Liu, X. Deep recurrent neural network for protein function prediction from sequence. Preprint at https://arxiv.org/abs/1701.08318 (2017).
Asgari, E. & Mofrad, M. R. K. Continuous distributed representation of biological sequences for deep proteomics and genomics. PloS ONE 10, e0141287 (2015).
Article Google Scholar
Sinai, S., Kelsic, E., Church, G. M. & Nowak, M. A. Variational auto-encoding of protein sequences. Preprint at https://arxiv.org/abs/1712.03346 (2017).
Alley, E. C., Khimulya, G., Biswas, S., AlQuraishi, M. & Church, G. M. Unified rational protein engineering with sequence-based deep representation learning. Nat. Methods 16, 1315–1322 (2019).
Article CAS Google Scholar
Rives, A. et al. Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. Proc. Natl Acad. Sci. USA 118, e2016239118 (2021).
Littmann, M., Heinzinger, M., Dallago, C., Olenyi, T. & Rost, B. Embeddings from deep learning transfer GO annotations beyond homology. Sci. Rep. 11, 1160 (2021).
Article CAS Google Scholar
El-Gebali, S. et al. The Pfam protein families database in 2019. Nucleic Acids Res. 47, D427–D432 (2018).
Article Google Scholar
Eddy, S. R. Accelerated profile HMM searches. PLoS Comput. Biol. 7, e1002195 (2011).
Article CAS Google Scholar
LeCun, Y., Bengio, Y. & Hinton, G. Deep learning. Nature 521, 436–444 (2015).
Article CAS Google Scholar
Johnson, L. S., Eddy, S. R. & Portugaly, E. Hidden Markov model speed heuristic and iterative hmm search procedure. BMC Bioinformatics 11, 431 (2010).
Article Google Scholar
Henikoff, S. & Henikoff, J. G. Amino acid substitution matrices from protein blocks. Proc. Natl Acad. Sci. USA 89, 10915–10919 (1992).
Article CAS Google Scholar
Campen, A. et al. TOP-IDP-scale: a new amino acid scale measuring propensity for intrinsic disorder. Protein Pept. Lett. 15, 956–963 (2008).
Article CAS Google Scholar
Pace, C. N. & Scholtz, J. M. A helix propensity scale based on experimental studies of peptides and proteins. Biophysical J. 75, 422–427 (1998).
Article CAS Google Scholar
Finn, R. D. et al. Pfam: clans, web tools and services. Nucleic Acids Res. 34, D247–D251 (2006).
Article CAS Google Scholar
Bateman, A. What are these new families with 2, 3, 4 endings? Xfam Blog https://xfam.wordpress.com/2012/01/19/what-are-these-new-families-with-_2-_3-_4-endings/ (2012).
Finn, R. D. et al. The Pfam protein families database: towards a more sustainable future. Nucleic Acids Res. 44, D279–D285 (2015).
Article Google Scholar
Bateman, A. Google research team bring deep learning to Pfam. Xfam Blog https://xfam.wordpress.com/2021/03/24/google-research-team-bring-deep-learning-to-pfam/ (2021).
UniProt Consortium. UniProt: a hub for protein information. Nucleic Acids Res. 43, D204–D212 (2014).
Li, Y., Jourdain, A. A., Calvo, S. E., Liu, J. S. & Mootha, V. K. CLIC, a tool for expanding biological pathways based on co-expression across thousands of datasets. PLoS Comput. Biol. 13, e1005653 (2017).
Article Google Scholar
Hausrath, A. C., Ramirez, N. A., Ly, A. T. & McEvoy, M. M. The bacterial copper resistance protein CopG contains a cysteine-bridged tetranuclear copper cluster. J. Biol. Chem. 295, 11364–11376 (2020).
Article Google Scholar
Hinton, G., Vinyals, O. & Dean, J. Distilling the knowledge in a neural network. Preprint at https://arxiv.org/abs/1503.02531 (2015).
L.L. Sonnhammer, E., Eddy, S. R. & Durbin, R. Pfam: a comprehensive database of protein domain families based on seed alignments. Proteins 28, 405–420 (1997).
Article Google Scholar
He, K., Zhang, X., Ren, S. & Sun, J. Deep residual learning for image recognition. In Proc. IEEE Conference on Computer Vision and Pattern Recognition 770–778 (IEEE, 2016).
Yu, F. and Koltun, V. Multi-scale context aggregation by dilated convolutions. Preprint at https://arxiv.org/abs/1511.07122 (2015).
Altschul, S. F., Gish, W., Miller, W., Myers, E. W. & Lipman, D. J. Basic local alignment search tool. J. Mol. Biol. 215, 403–410 (1990).
Article CAS Google Scholar
El-Gebali, S., Richardson, L. & Finn, R. Repeats in Pfam. EMBL-EBI Training https://doi.org/10.6019/TOL.Pfam_repeats-t.2018.00001.1 (2018).
UniProt Consortium. UniProt: the universal protein knowledgebase. Nucleic Acids Res. 46, 2699 (2018).

Download references

Acknowledgements

We thank J. Smith for countless conversations and guidance throughout this project; E. Bixby for an implementation of ragged tensor processing that sped up our ProtCNN implementation substantially on GPU; C. McClean, B. Alipanahi and S. Kearnes for extensive proofreading and feedback and Z. Nado for programming advice. L.J.C. gratefully acknowledges support from the Simons Foundation.

Author information

Authors and Affiliations

Google Research, Cambridge, MA, USA
Maxwell L. Bileschi, David Belanger, Drew H. Bryant, Theo Sanderson, D. Sculley, Mark A. DePristo & Lucy J. Colwell
The Francis Crick Institute, London, UK
Theo Sanderson
MIT Computer Science and Artificial Intelligence Laboratory, Cambridge, MA, USA
Brandon Carter
European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Hinxton, UK
Alex Bateman
BigHat Biosciences, San Mateo, CA, USA
Mark A. DePristo
Department of Chemistry, University of Cambridge, Cambridge, UK
Lucy J. Colwell

Authors

Maxwell L. Bileschi
View author publications
You can also search for this author in PubMed Google Scholar
David Belanger
View author publications
You can also search for this author in PubMed Google Scholar
Drew H. Bryant
View author publications
You can also search for this author in PubMed Google Scholar
Theo Sanderson
View author publications
You can also search for this author in PubMed Google Scholar
Brandon Carter
View author publications
You can also search for this author in PubMed Google Scholar
D. Sculley
View author publications
You can also search for this author in PubMed Google Scholar
Alex Bateman
View author publications
You can also search for this author in PubMed Google Scholar
Mark A. DePristo
View author publications
You can also search for this author in PubMed Google Scholar
Lucy J. Colwell
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

M.L.B., D.B., M.A.D. and L.J.C. conceived the study. All authors designed, implemented and used machine learning models to annotate protein domain sequences, analyzed the data and developed the approach used for Pfam-N. M.L.B., D.B. and L.J.C. wrote the paper, with input from all authors.

Corresponding authors

Correspondence to Maxwell L. Bileschi or Lucy J. Colwell.

Ethics declarations

Competing interests

M.L.B., D.B., D.H.B., T.S., B.C., D.S., M.A.D. and L.J.C. performed research as part of their employment at Google LLC. Google is a technology company that sells machine learning services as part of its business. Portions of this work are covered by US patent WO2020210591A1, filed by Google.

Peer review

Peer review information

Nature Biotechnology thanks Christian Dallago for their contribution to the peer review of this work.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information

Supplementary Figs. 1–13, Methods and Tables 1–19.

Reporting Summary.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Bileschi, M.L., Belanger, D., Bryant, D.H. et al. Using deep learning to annotate the protein universe. Nat Biotechnol 40, 932–937 (2022). https://doi.org/10.1038/s41587-021-01179-w

Download citation

Received: 06 May 2021
Accepted: 02 December 2021
Published: 21 February 2022
Issue Date: June 2022
DOI: https://doi.org/10.1038/s41587-021-01179-w

This article is cited by

Antimicrobial resistance crisis: could artificial intelligence be the solution?
- Guang-Yu Liu
- Dan Yu
- Xiao-Fen Liu
Military Medical Research (2024)
Genomic language model predicts protein co-regulation and function
- Yunha Hwang
- Andre L. Cornman
- Peter R. Girguis
Nature Communications (2024)
Efficient evolution of human antibodies from general protein language models
- Brian L. Hie
- Varun R. Shanker
- Peter S. Kim
Nature Biotechnology (2024)
Artificial intelligence and illusions of understanding in scientific research
- Lisa Messeri
- M. J. Crockett
Nature (2024)
PLMSearch: Protein language model powers accurate and fast sequence search for remote homology
- Wei Liu
- Ziye Wang
- Shanfeng Zhu
Nature Communications (2024)