We have developed a naming scheme to uniquely identify all units (amino acids, nucleotides, ligands, atoms, etc.) in any 3D structure from PDB. This allows for unambiguous identification and naming of not only individual components, but also collections of them such as loops and helices. With a clear naming scheme it becomes simpler for researchers to share data and annotations and provide powerful web services. These IDs are fundamental to our structural annotations.
The IDs are a string of ordered fields separated by vertical bars (‘|’). Below we describe how to create each field.
Unit IDs are based on the data in mmCIF files. Note that we use author-assigned chain and unit number values. Unit ids may contain symmetry operators, when symmetry operators are needed to be applied to generate all relevant coordinates in the structure. An introduction to symmetry operators and biological assemblies is here.
Several fields in the format are considered optional and when not present have default values. Fields which are optional are marked as ‘(Optional)’. If an optional field is included then all fields must be included, with the exception of symmetry operators.
For the sake of consistency all case insensitive fields should be in upper case.
Unit Identifier Specification
We describe the type and case sensitivity of each field in the list below. In addition, we list which item in the mmCIF the data for each field comes from. We also show several examples of the IDs and their interpretation and use at the end.
Unit IDs can also be used to identify atoms. When identifying entire residues, the atom field is left blank.
- PDB ID Code
- From PDBx/mmCIF item: _entry.id
- 4 characters, case-insensitive
- Model Number
- From PDBx/mmCIF item: _atom_site.pdbx_PDB_model_num
- integer, range 1-99
- Chain ID
- From PDBx/mmCIF item: _atom_site.auth_asym_id
- string, case-sensitive
- Residue/Nucleotide/Component Identifier
- From PDBx/mmCIF item: _atom_site.label_comp_id
- 1-3 characters, case-insensitive
- Residue/Nucleotide/Component Number
- From PDBx/mmCIF item: _atom_site.auth_seq_id
- integer, range: -999..9999 (there are negative residue numbers)
- Atom Name (Optional, default: blank)
- From PDBx/mmCIF item: _atom_site.label_atom_id
- 0-4 characters, case-insensitive
- blank means all atoms
- Alternate ID (Optional, default: blank)
- From PDBx/mmCIF item: _atom_site.label_alt_id
- Default value: blank
- One of ['A', 'B', 'C', '0'], case-insensitive
- This represents alternate coordinates for the model of one or more atoms
- Insertion Code (Optional, default: blank)
- From PDBx/mmCIF item: _atom_site.pdbx_PDB_ins_code
- 1 character, case-insensitive
- Symmetry Operation (Optional, default: 1_555)
- As defined in PDBx/mmCIF item: _pdbx_struct_oper_list.name
- 5-6 characters, case-insensitive
- For viral icosahedral structures, use “P_” + model number instead of symmetry operators. For example, 1A34|1|A|VAL|88|||P_1
- Chain A in model 1 of 1ABC = “1ABC|1|A”
- Nucleotide U(10) in chain B of 1ABC = “1ABC|1|B|U|10”
- Residue ARG 188 in 6TQA with alternate ID A: 6TQA|1|B|ARG|188||A. View ARG 188 at this link. You can also see atoms with alternate ID B when you click Show neighborhood.
- Nucleotide 190 with insertion code G: 1J5E|1|A|G|190|||G. View G190G at this link.
- Nucleotide C(25) chain D subject to symmetry operation 2_655 = “1ABC|1|D|C|25||||2_655”
Unit ids for entire residues can contain 4, 7, or 8 string separators (|).
Atom Identifier Format
To be added later.
Tools for Unit IDs
We have developed some tools to generate and work with these formats.
- UnitParser is a python module to parse new style unit ids. It is the reference parser for the Unit IDs.
- UnitIdTranslation is a python tool which will generate all new style ids for the unit in a PDB file given an mmCIF file.
- Translator is a web service to translate between the two ID formats.
Web services that use Unit IDs
Updated: 05/23/2022 10:43PM