Proteins are central to our existence. Among others, they perform functions such as DNA replication, response to stimulus, catalysts, and molecule transport. Proteins are determined by the sequence of constituent amino acids, and they fold into specific 3D structures that determine much of their bioactivity. A key objective of computational protein design is to automate the process of inventing new protein molecules that have specified structural and functional properties. Efforts in this direction have led to design of new 3D folds, enzymes, and complexes. However, a significant effort is wasted in the current methodology that relies on energy functions that describe the underlying physics, and efficient sampling algorithms.
We instead introduce a top-down framework, and learn a generative model for protein sequences conditioned on the specified target structure. Going significantly beyond the previous approaches, we encode the 3D structure in terms of a graph and build structured self-attention, which allows our model to capture higher order dependencies between structure and sequence. Thus, we are able to model long range dependencies in the sequence space that are localized in the 3D space. This allows us to generalize to sequences of protein folds absent in the training data. Besides being more accurate than Rosetta fixbb (state-of-the-art framework for computational protein design) in recovering the native sequences, our model is significantly (~20000 times) faster than Rosetta. Thus, our work opens up the possibility to efficiently design and engineer new proteins towards solving problems in fields such as biomedicine, energy, and materials science.