SUPPLEMENTARY INFORMATION
Dataset for assessment of designed sequences
The tar gzipped files of the four databases used in the assessment of the applicability of designed sequences in remote homology detection can be downloaded from the links given below: -
(1) CONTROL database
This sequence database contains protein sequences of known structures (SCOP 1.75v) and their homologues obtained from non-redundant sequence database (4,694,921 sequences).
(2) AUGMENTED database
The database contains protein sequences in the CONTROL database which are augmented with computationally designed intermediate sequences (4,694,921 natural sequences + 3,611,010 designed intermediate sequences).
Designed intermediate sequences are annotated as "Int" and the two parent SCOP domain families are also provided in the annotation line which are pipe-separated. For example - "Int_1|a.1.1.1_1|a.1.1.2_1". Here "Int" denotes 'designed intermediate sequence' and a.1.1.1 and a.1.1.2 are the two parent SCOP domain families between which this sequence was designed.
(3) Seq_CONTROL database
The database contains non-redundant protein sequences from 14,831 Pfam families which were downloaded from Pfam's ftp site (10,626,097 sequences).
(4) Seq_AUGMENTED database
This database contains protein sequences from Seq_CONTROL database and computationally designed intermediate sequences generated using multiple PSSMs of SCOP domain families (10,626,097 sequences + 3,611,010 designed intermediate sequences).
Detailed description of these databases can be found in the following paper: -
Filling-in void and sparse regions in protein sequence space by protein-like artificial sequences
enables remarkable enhancement in remote homology detection capability.
R Mudgal, R Sowdhamini, N Chandra, N Srinivasan and S Sandhya
Journal of Molecular Biology (under revision).