Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Article
  • Published:

Using deep learning to annotate the protein universe

Abstract

Understanding the relationship between amino acid sequence and protein function is a long-standing challenge with far-reaching scientific and translational implications. State-of-the-art alignment-based techniques cannot predict function for one-third of microbial protein sequences, hampering our ability to exploit data from diverse organisms. Here, we train deep learning models to accurately predict functional annotations for unaligned amino acid sequences across rigorous benchmark assessments built from the 17,929 families of the protein families database Pfam. The models infer known patterns of evolutionary substitutions and learn representations that accurately cluster sequences from unseen families. Combining deep models with existing methods significantly improves remote homology detection, suggesting that the deep models learn complementary information. This approach extends the coverage of Pfam by >9.5%, exceeding additions made over the last decade, and predicts function for 360 human reference proteome proteins with no previous Pfam annotation. These results suggest that deep learning models will be a core component of future protein annotation tools.

This is a preview of subscription content, access via your institution

Access options

Buy this article

Prices may be subject to local taxes which are calculated during checkout

Fig. 1: Model performance on Pfam-seed.
Fig. 2: ProtCNN architecture.
Fig. 3: Clustered split performance of ProtCNN, ProtENN, TPHMM and ProtREP, which uses the learned representation of sequence space.
Fig. 4: A combination of ProtENN and TPHMM improves performance on the remote homology task.

Similar content being viewed by others

Data availability

The data splits described in this manuscript are available for download at https://console.cloud.google.com/storage/browser/brain-genomics-public/research/proteins/pfam/random_split and https://console.cloud.google.com/storage/browser/brain-genomics-public/research/proteins/pfam/clustered_split, and an interactive notebook for data loading is available at https://www.kaggle.com/googleai/pfam-seed-random-split. Model predictions for Pfam-N are freely available to download as part of the Pfam v.34.0 release from http://ftp.ebi.ac.uk/pub/databases/Pfam/releases/Pfam34.0/.

Code availability

The TensorFlow API, specifically tensorflow-gpu v.1.15.4, was used to implement and train all deep models using the architectures described in the Methods. Code that documents model training using Python v.3.7 is available on GitHub at https://github.com/google-research/google-research/tree/master/using_dl_to_annotate_protein_universe. The training and validation datasets used for creating each model are available as described in the preceding section. Trained models are available in Google Cloud Storage at https://console.cloud.google.com/storage/browser/brain-genomics-public/research/proteins/pfam/models/single_domain_per_sequence_zipped_models, including the ensembles trained on the Pfam-seed random split, Pfam-seed clustered split, Pfam-full random split (all Pfam v.32.0) and the models used to generate Pfam-N v.34.0. ProtCNN inference was run using a custom Python script that (1) read in FASTA records and (2) ran inference of the ProtCNN as a TensorFlow SavedModel. An interactive notebook that demonstrates inference using ProtCNN is available at https://colab.research.google.com/github/google-research/google-research/blob/master/using_dl_to_annotate_protein_universe/neural_network/Neural_network_accuracy_on_random_seed_split.ipynb. An interactive notebook showing use of the trained models to produce Pfam class predictions as well as embeddings is available in GitHub at https://colab.sandbox.google.com/github/google-research/google-research/blob/master/using_dl_to_annotate_protein_universe/Using_Deep_Learning_to_Annotate_the_Protein_Universe.ipynb.

References

  1. Steinegger, M. & Söding, J. Mmseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nat. Biotechnol. 35, 1026–1028 (2017).

    Article  CAS  Google Scholar 

  2. Steinegger, M. & Söding, J. Clustering huge protein sequence sets in linear time. Nat. Commun. 9, 2542 (2018).

    Article  Google Scholar 

  3. Söding, J. Protein homology detection by HMM–HMM comparison. Bioinformatics 21, 951–960 (2004).

    Article  Google Scholar 

  4. Biegert, A. & Söding, J. Sequence context-specific profiles for homology searching. Proc. Natl Acad. Sci. USA 106, 3770–3775 (2009).

    Article  CAS  Google Scholar 

  5. Finn, R. D., Clements, J. & Eddy, S. R. HMMER web server: interactive sequence similarity searching. Nucleic Acids Res. 39, W29–W37 (2011).

    Article  CAS  Google Scholar 

  6. Mistry, J. et al. Pfam: the protein families database in 2021. Nucleic Acids Res. 49, D412–D419 (2021).

    Article  CAS  Google Scholar 

  7. Altschul, S. F. et al. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25, 3389–3402 (1997).

    Article  CAS  Google Scholar 

  8. Price, M. N. et al. Mutant phenotypes for thousands of bacterial genes of unknown function. Nature 557, 503–509 (2018).

  9. Chang, Y.-C. et al. COMBREX-DB: an experiment centered database of protein function: knowledge, predictions and knowledge gaps. Nucleic Acids Res. 44, D330–D335 (2015).

    Article  Google Scholar 

  10. UniProt Consortium. UniProt: the universal protein knowledgebase. Nucleic Acids Res. 45, D158–D169 (2017).

  11. Hou, J., Adhikari, B. & Cheng, J. DeepSF: deep convolutional neural network for mapping protein sequences to folds. Bioinformatics 34, 1295–1303 (2017).

    Article  Google Scholar 

  12. Kulmanov, M., Khan, M. A. & Hoehndorf, R. DeepGO: predicting protein functions from sequence and interactions using a deep ontology-aware classifier. Bioinformatics 34, 660–668 (2017).

    Article  Google Scholar 

  13. Cao, R. et al. ProLanGO: protein function prediction using neural machine translation based on a recurrent neural network. Molecules 22, 1732 (2017).

    Article  Google Scholar 

  14. Li, Y. et al. DEEPre: sequence-based enzyme ec number prediction by deep learning. Bioinformatics 34, 760–769 (2017).

    Article  Google Scholar 

  15. Szalkai, B. & Grolmusz, V. Near perfect protein multi-label classification with deep neural networks. Methods 132, 50–56 (2018).

    Article  CAS  Google Scholar 

  16. Zou, Z., Tian, S., Gao, X. & Li, Y. mlDEEPre: multi-functional enzyme function prediction with hierarchical multi-label deep learning. Front. Genet. 9, 714 (2019).

  17. Schwartz, A. S. et al. Deep semantic protein representation for annotation, discovery, and engineering. Preprint at bioRxiv https://doi.org/10.1101/365965 (2018).

  18. Zhang, D. and Kabuka, M. R. Protein family classification with multi-layer graph convolutional networks. In 2018 IEEE International Conference on Bioinformatics and Biomedicine (BIBM) 2390–2393 (IEEE, 2018).

  19. Liu, X. Deep recurrent neural network for protein function prediction from sequence. Preprint at https://arxiv.org/abs/1701.08318 (2017).

  20. Asgari, E. & Mofrad, M. R. K. Continuous distributed representation of biological sequences for deep proteomics and genomics. PloS ONE 10, e0141287 (2015).

    Article  Google Scholar 

  21. Sinai, S., Kelsic, E., Church, G. M. & Nowak, M. A. Variational auto-encoding of protein sequences. Preprint at https://arxiv.org/abs/1712.03346 (2017).

  22. Alley, E. C., Khimulya, G., Biswas, S., AlQuraishi, M. & Church, G. M. Unified rational protein engineering with sequence-based deep representation learning. Nat. Methods 16, 1315–1322 (2019).

    Article  CAS  Google Scholar 

  23. Rives, A. et al. Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. Proc. Natl Acad. Sci. USA 118, e2016239118 (2021).

  24. Littmann, M., Heinzinger, M., Dallago, C., Olenyi, T. & Rost, B. Embeddings from deep learning transfer GO annotations beyond homology. Sci. Rep. 11, 1160 (2021).

    Article  CAS  Google Scholar 

  25. El-Gebali, S. et al. The Pfam protein families database in 2019. Nucleic Acids Res. 47, D427–D432 (2018).

    Article  Google Scholar 

  26. Eddy, S. R. Accelerated profile HMM searches. PLoS Comput. Biol. 7, e1002195 (2011).

    Article  CAS  Google Scholar 

  27. LeCun, Y., Bengio, Y. & Hinton, G. Deep learning. Nature 521, 436–444 (2015).

    Article  CAS  Google Scholar 

  28. Johnson, L. S., Eddy, S. R. & Portugaly, E. Hidden Markov model speed heuristic and iterative hmm search procedure. BMC Bioinformatics 11, 431 (2010).

    Article  Google Scholar 

  29. Henikoff, S. & Henikoff, J. G. Amino acid substitution matrices from protein blocks. Proc. Natl Acad. Sci. USA 89, 10915–10919 (1992).

    Article  CAS  Google Scholar 

  30. Campen, A. et al. TOP-IDP-scale: a new amino acid scale measuring propensity for intrinsic disorder. Protein Pept. Lett. 15, 956–963 (2008).

    Article  CAS  Google Scholar 

  31. Pace, C. N. & Scholtz, J. M. A helix propensity scale based on experimental studies of peptides and proteins. Biophysical J. 75, 422–427 (1998).

    Article  CAS  Google Scholar 

  32. Finn, R. D. et al. Pfam: clans, web tools and services. Nucleic Acids Res. 34, D247–D251 (2006).

    Article  CAS  Google Scholar 

  33. Bateman, A. What are these new families with 2, 3, 4 endings? Xfam Blog https://xfam.wordpress.com/2012/01/19/what-are-these-new-families-with-_2-_3-_4-endings/ (2012).

  34. Finn, R. D. et al. The Pfam protein families database: towards a more sustainable future. Nucleic Acids Res. 44, D279–D285 (2015).

    Article  Google Scholar 

  35. Bateman, A. Google research team bring deep learning to Pfam. Xfam Blog https://xfam.wordpress.com/2021/03/24/google-research-team-bring-deep-learning-to-pfam/ (2021).

  36. UniProt Consortium. UniProt: a hub for protein information. Nucleic Acids Res. 43, D204–D212 (2014).

  37. Li, Y., Jourdain, A. A., Calvo, S. E., Liu, J. S. & Mootha, V. K. CLIC, a tool for expanding biological pathways based on co-expression across thousands of datasets. PLoS Comput. Biol. 13, e1005653 (2017).

    Article  Google Scholar 

  38. Hausrath, A. C., Ramirez, N. A., Ly, A. T. & McEvoy, M. M. The bacterial copper resistance protein CopG contains a cysteine-bridged tetranuclear copper cluster. J. Biol. Chem. 295, 11364–11376 (2020).

    Article  Google Scholar 

  39. Hinton, G., Vinyals, O. & Dean, J. Distilling the knowledge in a neural network. Preprint at https://arxiv.org/abs/1503.02531 (2015).

  40. L.L. Sonnhammer, E., Eddy, S. R. & Durbin, R. Pfam: a comprehensive database of protein domain families based on seed alignments. Proteins 28, 405–420 (1997).

    Article  Google Scholar 

  41. He, K., Zhang, X., Ren, S. & Sun, J. Deep residual learning for image recognition. In Proc. IEEE Conference on Computer Vision and Pattern Recognition 770–778 (IEEE, 2016).

  42. Yu, F. and Koltun, V. Multi-scale context aggregation by dilated convolutions. Preprint at https://arxiv.org/abs/1511.07122 (2015).

  43. Altschul, S. F., Gish, W., Miller, W., Myers, E. W. & Lipman, D. J. Basic local alignment search tool. J. Mol. Biol. 215, 403–410 (1990).

    Article  CAS  Google Scholar 

  44. El-Gebali, S., Richardson, L. & Finn, R. Repeats in Pfam. EMBL-EBI Training https://doi.org/10.6019/TOL.Pfam_repeats-t.2018.00001.1 (2018).

  45. UniProt Consortium. UniProt: the universal protein knowledgebase. Nucleic Acids Res. 46, 2699 (2018).

Download references

Acknowledgements

We thank J. Smith for countless conversations and guidance throughout this project; E. Bixby for an implementation of ragged tensor processing that sped up our ProtCNN implementation substantially on GPU; C. McClean, B. Alipanahi and S. Kearnes for extensive proofreading and feedback and Z. Nado for programming advice. L.J.C. gratefully acknowledges support from the Simons Foundation.

Author information

Authors and Affiliations

Authors

Contributions

M.L.B., D.B., M.A.D. and L.J.C. conceived the study. All authors designed, implemented and used machine learning models to annotate protein domain sequences, analyzed the data and developed the approach used for Pfam-N. M.L.B., D.B. and L.J.C. wrote the paper, with input from all authors.

Corresponding authors

Correspondence to Maxwell L. Bileschi or Lucy J. Colwell.

Ethics declarations

Competing interests

M.L.B., D.B., D.H.B., T.S., B.C., D.S., M.A.D. and L.J.C. performed research as part of their employment at Google LLC. Google is a technology company that sells machine learning services as part of its business. Portions of this work are covered by US patent WO2020210591A1, filed by Google.

Peer review

Peer review information

Nature Biotechnology thanks Christian Dallago for their contribution to the peer review of this work.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information

Supplementary Figs. 1–13, Methods and Tables 1–19.

Reporting Summary.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Bileschi, M.L., Belanger, D., Bryant, D.H. et al. Using deep learning to annotate the protein universe. Nat Biotechnol 40, 932–937 (2022). https://doi.org/10.1038/s41587-021-01179-w

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1038/s41587-021-01179-w

This article is cited by

Search

Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing