Unit Ids

We are developing a naming scheme to uniquely identify all units (amino acids, nucleotides, ligands, atoms, etc) in any 3D structure. This will allow for unambiguous identification and naming of not only individual components, but also collections of them such as loops and helices. With a clear naming scheme it becomes simpler for researchers to share data and annotations and provide powerful web services. These IDs are fundamental to our structural annotations.

The IDs are a string of ordered fields separated by vertical bars  (‘|’). Below we describe how to create each field and its meaning for the two types of unit IDs.

Unit Id Format

These IDs are based on the data in mmCIF files and may contain symmetry operators. An introduction to symmetry operators and biological assemblies is here. This format will uniquely identify all units and atoms in a structure.

Several fields in the format are considered optional and when not present have default values. Fields which are optional are marked as ‘(Optional)’. If an optional field is included then all fields must be included, with an exception of symmetry operators.

For the sake of consistency all case insensitive fields should be in upper case.

Unit Identifier Specification

We describe the type and case sensitivity of each field in the list below. In addition, we list which item in the mmCIF the data for each field comes from. We also show several examples of the IDs and their interpretation at the end.

Unit ids can also be used to identify atoms. When identifying entire residues, the atom field is left blank.

  1. PDB ID Code
    • From PDBx/mmCIF item: _entry.id
    • 4 characters, case-insensitive
  2. Model Number
    • From PDBx/mmCIF item: _atom_site.pdbx_PDB_model_num
    • integer, range 1-99
  3. Chain ID
    • From PDBx/mmCIF item: _atom_site.auth_asym_id
    • string, case-sensitive
  4. Residue/Nucleotide/Component Identifier
    • From PDBx/mmCIF item: _atom_site.label_comp_id
    • 1-3 characters, case-insensitive
  5. Residue/Nucleotide/Component Number
    • From PDBx/mmCIF item: _atom_site.auth_seq_id
    • integer, range: -999..9999 (there are negative residue numbers)
  6. Atom Name (Optional, default: blank)
    • From PDBx/mmCIF item: _atom_site.label_atom_id
    • 0-4 characters, case-insensitive
    • blank means all atoms
  7. Alternate ID (Optional, default: blank)
    • From PDBx/mmCIF item: _atom_site.label_alt_id
    • Default value: blank
    • One of ['A', 'B', '0'], case-insensitive
  8. Insertion Code (Optional, default: blank)
    • From PDBx/mmCIF item: _atom_site.pdbx_PDB_ins_code
    • 1 character, case-insensitive
  9. Symmetry Operation (Optional, default: 1_555)
    • As defined in PDBx/mmCIF item: _pdbx_struct_oper_list.name
    • 5-6 characters, case-insensitive
    • For viral icosahedral structures, use “P_” + model number instead of symmetry operators. For example, 1A34|1|A|VAL|88|||P_1

Examples

  • Chain A in 1ABC = “1ABC|1|A”
  • Nucleotide U(10) chain B of 1ABC = “1ABC|1|B|U|10”
  • Nucleotide U(15A) chain B, default symmetry operator = “1ABC|1|B|U|15|||A”
  • Nucleotide C(25) chain D subject to symmetry operation 2_655 = “1ABC|1|D|C|25||||2_655”

Unit ids for entire residues can contain 4, 7, or 8 string separators (|).

Atom Identifier Format

To be added later.

 

Tools

We have developed some tools to generate and work with these formats.

  • UnitParser is a python module to parse new style unit ids. It is the
    reference parser for the Unit IDs.
  • UnitIdTranslation is a python tool which will generate all new style ids for the unit in a PDB file given an mmCIF file.
  • Translator is a web service to translate between the two ID formats.
  • Coordinates is a web service to retrieve the coordinates of a unit given the new or old style ID.