APIs
Several APIs are available, where one can input unit ids or loop ids or chain ids and get back useful information. Note that our site uses author-assigned chain identifiers (auth_asym_id). Read about unit ids that we use to uniquely specify each nucleotide or amino acid in each 3D structure.
Many of the URLs below need to be copied and pasted; they aren't active links because this university page won't allow active links.
Correspondence APIs
Unit ID to unit ID mapping between two chains in an RNA equivalence class
This API takes as input two RNA chain IDs and returns a listing of the corresponding unit IDs, as determined by the alignment of the experimental sequences for the chains. Note that only complete rows are listed; when a nucleotide is not resolved or does not exist in one structure, neither unit ID is listed.
- http://rna.bgsu.edu/correspondence/pairwise_structure?chain1=5J7L|1|AA&chain2=5J7L|1|BA (E. coli 16S)
- http://rna.bgsu.edu/correspondence/pairwise_structure?chain1=4V88|1|A6&chain2=5TBW|1|sR (Yeast 18S)
- http://rna.bgsu.edu/correspondence/pairwise_structure?chain1=3A3A|1|A&chain2=4ZDP|1|E (Human selenocysteine tRNA)
- http://rna.bgsu.edu/correspondence/pairwise_structure?chain1=4V9F|1|0&chain2=1S72|1|0 (H. marismortui LSU, where 1S72|1|0 has unobserved nucleotides from 971 to 998)
Corresponding nucleotides and pairwise interactions across an equivalence class
This API takes as input one or more unit ids from an RNA 3D structure and returns unit ids of corresponding nucleotides across the equivalence class, along with FR3D-annotated pairwise interactions between them. The header line indicates the order in which the nucleotides and the basepairs are listed.
- http://rna.bgsu.edu/correspondence/pairwise_interactions?resolution_threshold=4.0&exp_method=xray&selection_type=unit_id&selection=5J7L|1|AA|C|186 Individual nucleotide, x-ray experimental method
- http://rna.bgsu.edu/correspondence/pairwise_interactions?resolution_threshold=4.0&exp_method=em&selection_type=unit_id&selection=5J7L|1|AA|C|186,5J7L|1|AA|G|191 Basepair, cryo-em experimental method
- http://rna.bgsu.edu/correspondence/pairwise_interactions?resolution_threshold=3.0&exp_method=all&selection_type=unit_id&selection=6ZMI|1|S2|A|11,6ZMI|1|S2|A|1200,6ZMI|1|S2|A|1357 Base triple, all experimental methods, lower resolution cutoff
- http://rna.bgsu.edu/correspondence/pairwise_interactions?resolution_threshold=3.0&exp_method=all&chain=6ZMI|1|S2&selection_type=res_num&selection=11,1200,1357 Base triple using short format
- http://rna.bgsu.edu/correspondence/pairwise_interactions?resolution_threshold=all&exp_method=all&selection_type=loop_id&selection=HL_5J7L_001 Hairpin loop, all resolutions, all methods
Sequence position to unit ID mapping
This API takes as input a PDB ID, model, and RNA chain and gives as output a listing of all experimental sequence positions and the corresponding unit ID, if the coordinates of the unit were observed in the PDB file.
- http://rna.bgsu.edu/rna3dhub/rest/SeqtoUnitMapping?ife=1S72|1|0 (shows unobserved nucleotides and different numbering)
- http://rna.bgsu.edu/rna3dhub/rest/SeqtoUnitMapping?ife=5J7L|1|AA
Nucleotide to nucleotide alignments between RNA chains from 3D structures
This API takes as input two or more PDB chains of the same molecule and returns a pairwise or multiple alignment of the nucleotides, by presenting unit ids on the same line to indicate alignment. Alignments are produced by Infernal alignment to the Rfam covariance model, so alignments between different species can be produced. Our sense is that the alignments are generally good.
- http://rna.bgsu.edu/correspondence/align_chains?chains=5J7L|1|AA,1J5E|1|A (alignment between E. coli and Thermus thermophilus small ribosomal subunits, numbering is designed to be the same, but each structure has a number of unobserved nucleotides, indicated by NULL)
- http://rna.bgsu.edu/correspondence/align_chains?chains=5J7L|1|DA,7RQB|1|1A,7A0S|1|X,4WF9|1|X (multiple alignment of E. coli, T. thermophilus, D. radiodurans, S. aureus large subunit ribosomal RNA)
- http://rna.bgsu.edu/correspondence/align_chains?chains=5J7L|1|AA,1J5E|1|A,6TH6|1|Aa,4V88|1|A6,6ZMI|1|S2 (cross-domain alignment of E. coli, T. thermophilus, yeast, human by aligning all sequences to the eukaryotic SSU Rfam family; these alignments may not be as reliable as those within one domain)
Pairwise alignment diagnostic and basepair bar diagram
This API takes as input two PDB chains of the same molecule and returns a PDF which shows the alignment by listing nucleotides on the same horizontal line, a per-nucleotide score that reflects how well the alignment conserves basepairs and other pairwise interactions, and circular arcs that show the pairwise interactions in each structure. Symmetry of the interaction arcs indicate a correct alignment. Nested cWW basepairs are colored dark blue, long-range cWW basepairs (pseudoknots) are colored red, nested non-cWW basepairs are colored cyan, long-range non-cWW basepairs are colored green, base-phosphate interactions are colored purple, base-ribose interactions are colored orange, stacking interactions are colored yellow. Pairwise interactions made by modified nucleotides are not shown as of 2/2/2024. Horizontal bars are colored darker for larger per-nucleotide score. Per-nucleotide score is larger when long-range basepairs are aligned; dark bars indicate strong evidence that the nucleotides are correctly aligned.
- http://rna.bgsu.edu/correspondence/basepair_bar_diagram?chains=5J7L|1|AA,1J5E|1|A (E. coli and T. thermophilus SSU rRNA; very good alignment and identical numbering by design; changes in helix length around E. coli number C188 and U209 are well known but lead to small deficiencies in the alignment)
- http://rna.bgsu.edu/correspondence/basepair_bar_diagram?chains=5J7L|1|DA,4WF9|1|X (E. coli and S. aureus LSU rRNA)
- http://rna.bgsu.edu/correspondence/basepair_bar_diagram?chains=5J7L|1|AA,8GLP|1|S2 (E. coli and human SSU rRNA cross domain alignment; human has many insertions compared to E. coli)
- http://rna.bgsu.edu/correspondence/basepair_bar_diagram?chains=8GLP|1|S2,4V88|1|A6 (human and yeast SSU rRNA)
- http://rna.bgsu.edu/correspondence/basepair_bar_diagram?chains=4V9F|1|9,8GLP|1|L7 (H. marismortui and human 5S rRNA; note the scoring of the G-bulge at H.m. G78, U79)
- http://rna.bgsu.edu/correspondence/basepair_bar_diagram?chains=4TNA|1|A,3Q1Q|1|C (tRNAs)
Annotations
Pairwise interactions for given nucleotides or loop
This API takes as input a loop or individual nucleotides and returns a list of nucleotides and the pairwise interactions they make.
- http://rna.bgsu.edu/correspondence/pairwise_interactions_single?selection_type=loop_id&selection=HL_5J7L_001 Hairpin loop
- http://rna.bgsu.edu/correspondence/pairwise_interactions_single?selection_type=unit_id&selection=5J7L|1|AA|A|320,5J7L|1|AA|A|321,5J7L|1|AA|C|322 Unit IDs
- http://rna.bgsu.edu/correspondence/pairwise_interactions_single?chain=5J7L|1|AA&selection_type=res_num&selection=320,321,322 Shorthand notation using residue numbers
Helix annotations
This API maps the given chain to a reference chain in the same Rfam family where helix numbers have been annotated. Coverage includes ribosomal small subunit, large subunit, 5S, eukaryotic 5.8S, and certain other small RNAs.
- http://rna.bgsu.edu/correspondence/nucleotide_annotation?chain=8C3A|1|3 5S ribosomal RNA
- http://rna.bgsu.edu/correspondence/nucleotide_annotation?chain=8GLP|1|L5 Human LSU
Coordinate and data quality APIs
Coordinate API
Coordinates is a web service to retrieve the coordinates of a unit given the unit ID, or a loop given the loop id. What is returned is in mmCIF format, with the selected units in Model 1 and units within 16 Angstroms in Model 2.
- http://rna.bgsu.edu/rna3dhub/rest/getCoordinates?coord=2QBG|1|A|G|69,2QBG|1|A|G|107 Using unit IDs to get coordinates
- http://rna.bgsu.edu/rna3dhub/rest/getCoordinates?coord=IL_1S72_009 Using loop ID to get coordinates
3D Coordinate Viewer
This API shows the 3D coordinates of selected units by constructing a URL in one of the following forms:
- http://rna.bgsu.edu/rna3dhub/display3D/unitid/1S72|1|0|A|965,1S72|1|0|U|1003,1S72|1|H|HIS|92 (unit ids including amino acids)
- http://rna.bgsu.edu/rna3dhub/display3D/chain/1QTQ|1|B (entire chain)
- http://rna.bgsu.edu/rna3dhub/display3D/chain/7VFT|1|A (entire chain, default symmetry operator)
- http://rna.bgsu.edu/rna3dhub/display3D/chain/7VFT|1|A||||||12_555 (entire chain with given symmetry)
- http://rna.bgsu.edu/rna3dhub/display3D/chain/7VFT|1|A||||||. (entire chain with all symmetry operators)
- http://rna.bgsu.edu/rna3dhub/display3D/multiple/8B0X|1|a|G|581,8B0X|1|a|U|582,8B0X|1|a|C|583,8B0X|1|a|G|1261,8B0X|1|a|A|1262,8B0X|1|a|C|1263;4P43|1|B|G|28,4P43|1|B|U|29,4P43|1|B|C|30,4P43|1|A|G|19,4P43|1|A|U|20,4P43|1|A|C|21;5ED1|1|B|G|13,5ED1|1|B|A|14,5ED1|1|B|C|15,5ED1|1|C|G|9,5ED1|1|C|A|10,5ED1|1|C|C|11;4M4O|1|B|G|12,4M4O|1|B|G|13,4M4O|1|B|C|14,4M4O|1|B|G|46,4M4O|1|B|A|47,4M4O|1|B|C|48 (two or more sets of unit ids, separated by semicolons)
In the first link, note that individual units are listed, separated by commas. This example shows a UA cWW basepair interacting with an amino acid. One can provide such links to easily direct readers of a website or article to a view of the coordinates.
RSR / RSRZ API
This API takes as input unit ids or loop ids and returns RSR and RSRZ values for specified units or loops, in JSON format. This allows programmatic access to these structure quality data for individual units or sets of units. Construct a request using the following forms:
- http://rna.bgsu.edu/rna3dhub/rest/getRSR?quality=1FJG|1|A|A|16,1FJG|1|A|C|18 Using unit ids to get RSR
- http://rna.bgsu.edu/rna3dhub/rest/getRSRZ?quality=1FJG|1|A|A|16,1FJG|1|A|C|18 Use unit ids to get RSRZ
- http://rna.bgsu.edu/rna3dhub/rest/getRSR?quality=HL_1FJG_001 Use a loop id to get RSR
- http://rna.bgsu.edu/rna3dhub/rest/getRSRZ?quality=HL_1FJG_001 Using a loop id to get RSRZ
Representative sets and Rfam mappings
Representative set download in csv format
This API has long been available from the representative set pages. Example URL:
- http://rna.bgsu.edu/rna3dhub/nrlist/download/NR/3.348/3.5A/csv
Here, 3.348 is the representative set release identifier, 3.5A is the resolution threshold, and csv indicates to produce the data in csv format, which is the only available format. Each line of the file has the equivalence class identifier like NR_3.5_26150.3, then the representative IFE like 4V9F|1|0, then all of the equivalence class members in a comma-separated list, in order from best structure quality indicators to worst. By taking the representative IFE from each line, you get a representative set of RNA chains. Note that an IFE is an integrated functional element; these recognize two or more chains that are inextricably linked by Watson-Crick pairs.
Representative set download with Rfam mapping and more
This API is new in August 2024. Example URL indicates the representative release identifier and the resolution threshold:
- http://rna.bgsu.edu/rna3dhub/nrlist/download/NR/3.348/3.5A/full for comma-separated value output
- http://rna.bgsu.edu/rna3dhub/nrlist/download/NR/3.348/3.5A/full_csv for comma-separated value output
- http://rna.bgsu.edu/rna3dhub/nrlist/download/NR/3.348/3.5A/full_tsv for tab-separated value output
The first line is a header line which labels the columns. Each line represents a single IFE. Here we describe each data entry.
- ec_id is the equivalence class identifier, like NR_3.5_26150.3
- ife_id is the ife identifier, like 4V9F|1|0
- pdb_resolution is the structure resolution in Angstroms, like 2.4
- na_type is the type of nucleic acid, like RNA; other values are DNA, hybrid
- ec_rank is the rank of the IFE within its equivalence class starting at 1
- ec_cqs is the the composite quality score as computed across entries in the equivalence class, like 7.96673; lower is better. Note: cryo-em and NMR structures do not currently have all the same structure quality factors as x-ray structures, so the "worst" values are used for those, with the result that within an equivalence class, the x-ray structures are ranked above the cryo-em and NMR structures. The calculation of Composite Quality Score is explained briefly at http://rna.bgsu.edu/rna3dhub/nrlist It is based on resolution, RSR, RSCC, percent clash, rfree, and the fraction of nucleotides observed with xyz coordinates.
- average_rsr, average_rscc, percent_clash, and rfree are obtained from the PDB validation reports, applied to the units in the chains in the IFE
- ec_fraction_observed compares the number of observed nucleotides in the given IFE to the maximum number of observed nucleotides across the equivalence class. Complications occur when the class contains things like stapled ribosomes, joint 5S and 23S molecules, and similar constructions that have more nucleotides than usual.
- nts_observed is the number of nucleotides in the IFE with xyz nucleotides. In August 2024 we realized that this only counts A, C, G, U nucleotides and not modified nucleotides or DNA nucleotides. We'll fix that and make a note of it here, hopefully by October 2024.
- rfam_reference_nts is the largest number of observed nucleotides among all IFEs mapping to the indicated Rfam families. Here again, complications arise, so this number is manually adjusted in some cases to improve the comparison across Rfam families.
- rfam_fraction_observed is the ratio of nts_observed over rfam_reference_nts and capped at 1. This number will be used to compute the rfam_cqs. Capping it at 1 prevents stapled ribosomes or other joined molecules from getting a higher ranking when they have more than the usual number of nucleotides for the molecule.
- rfam_cqs is a version of composite quality score that allows the comparison of IFEs across the Rfam families that they map to. As with ec_cqs, it uses resolution, RSR, RSCC, percent clash, Rfree, but it uses the rfam_fraction_observed instead of ec_fraction_observed. Lower rfam_cqs is better. 4V9F|1|0 has the lowest rfam_cqs among chains mapping to Rfam family RF02540, the Archaeal LSU family.
- source can be archaea, bacteria, eukarya, mitochondria, chloroplast, or virus. The biological domain is determined by the taxonomy id and NCBI lineage, but is overridden when the chain description, structure title, or protein chains make it clear that the source is mitochondrion or chloroplast. NA is used when the source is not available. When the IFE consists of multiple chains, the sources of each chain are separated by + characters.
- rfam shows the Rfam family that each chain maps to, either by the Rfam.pdb file posted on the Rfam FTP site or by running cmsearch on the PDB chain sequences against Rfam covariance models that already have PDB chains mapped to them. Ask Craig Zirbel at zirbel@bgsu.edu for more information. IFEs with multiple chains will have multiple Rfam families, separated by + signs.
- standardized_name is a manually-curated name that attempts to accurate name the molecule in a consistent fashion across Rfam families and more. In some cases there is a long form and short form separated by a semicolon character. Suggestions for improvements to the naming are welcome, please contact Craig Zirbel at zirbel@bgsu.edu. Approximately half of the IFEs have standardized names; the other 6,500 will be a challenge to identify and name manually! See also pdb_description below.
- pdb_species is the species of each chain as retrieved from PDB
- pdb_taxid is the NCBI taxonomy id of each chain as retrieved from PDB
- species_taxid is the result of mapping pdb_taxid to a species-level taxonomy id. For example, there are many strains of E. coli, but they all have species_taxid equal to 562.
- pdb_description is the author-supplied text for each chain as retrieved from PDB. If you want to type up standard versions of each one, please do and send that to Craig Zirbel at zirbel@bgsu.edu.
- pdb_release_date is downloaded from PDB
- pdb_experimental_technique is downloaded from PDB
- pdb_title is downloaded from PDB. These last fields are provided as a convenience.
The output of this API could be used to produce a set of high-quality RNA chains that are minimally redundant. That might be helpful for training machine learning models. Using Rfam families can also help to separate training and testing data. Here are some suggestions on the procedure to do so:
- Keep only lines that have a non-trivial entry in the rfam column
- Discard lines with Rfam values such as RF00002 and RF02543 because it is better to keep only lines from IFEs having both chains and thus entry RF00002+RF02543 in the rfam column
- Sort by rfam and then by increasing value of rfam_cqs
- Keep the first entry from each Rfam family, or if you like, the first k entries that have different species_taxid values, to get a little more coverage of each molecule.
- Optionally keep only one IFE from the following sets of Rfam families, to reduce redundancy across domains:
- SSU: RF00177,RF01959,RF01960,RF02542,RF02545
- LSU: RF02540,RF02541,RF02543,RF02546
- SSU: RF00177,RF01959,RF01960,RF02542,RF02545
On the other hand, it might be nice to keep chains from each domain from Rfam families RF00001 (5S rRNA) and RF00005 (tRNA).
Updated: 08/21/2024 01:36PM