A3 Specification
Amino Acid Annotation (A3) — language-agnostic specification
Purpose
A3 is a structured format for annotating amino acid sequences with site, region, post-translational modification, processing, and variant information, alongside sequence metadata. It is designed for:
- Exchange between analysis tools and visualization applications
- Long-term storage of curated protein annotation data
- 100% round-trip fidelity through JSON serialization
Wire Format
JSON is the canonical serialization format. TOML is a secondary target for human-authoring workflows.
The canonical file extension for A3 JSON files is .a3.json.
All five annotation families are always present in serialized output,
even when empty. The type field is always present on annotation
entries (empty string when unset).
Example — MAPT (P10636)
Data Model
A3
├── sequence: string
├── annotations:
│ ├── site: map<string, { index: integer[], type: string }>
│ ├── region: map<string, { index: [integer,integer][], type: string }>
│ ├── ptm: map<string, { index: integer[] | [integer,integer][], type: string }>
│ ├── processing: map<string, { index: integer[] | [integer,integer][], type: string }>
│ └── variant: list of { position: integer, [key: string]: any }
└── metadata:
├── uniprot_id: string
├── description: string
├── reference: string
└── organism: stringField Definitions
sequence
- Non-empty string; minimum 2 characters
- Characters:
[A-Z*]— standard IUPAC amino acid codes plus*(stop codon)
position (integer[])
An ordered collection of 1-based residue positions.
- All values are positive integers (≥ 1)
range ([integer, integer][])
An ordered collection of inclusive [start, end] range pairs.
- All values are positive integers (≥ 1)
- Each pair:
start < end(strict — degenerate single-position ranges are not permitted; use a position-indexed family instead) - No two ranges may overlap: ranges
[a, b]and[c, d](wherec > a) overlap whenc ≤ b. Adjacent ranges (c = b + 1) are permitted.
Annotation families
Five fixed families — no others are permitted:
| Family | Index type | Semantics |
|---|---|---|
site | positions only | Individual residues of interest |
region | ranges only | Contiguous spans |
ptm | positions or ranges | Post-translational modifications |
processing | positions or ranges | Signal peptides, cleavage, maturation |
variant | (see below) | Sequence variants |
Each entry within site, region, ptm, and processing is a named
object with two fields:
index— positions or ranges (as defined above)type— string label; optional on input, always present in output (default"")
Annotation names (map keys) are non-empty strings. No constraint on characters beyond that.
Bare index arrays (without the { index, type } wrapper) are not
permitted. The canonical object form is the only accepted input.
variant
An ordered list (not a named map) of variant records. Each record:
position: required, 1-based positive integer- All other fields: optional, must be JSON-compatible (no functions,
symbols, class instances, or
undefined)
metadata
Four string fields, all optional (default ""):
uniprot_id— UniProt accessiondescription— human-readable protein descriptionreference— citation or URLorganism— species name
Unknown metadata fields are not permitted. Unknown top-level keys are not permitted.