UATU: A Universal Blockchain Database with a Natural-Language Query API
One API for every kind of blockchain data. Two ways to ask: a typed query language, or plain English. The databases were never the hard part. The hard part was one query contract that an LLM can write, a developer can read, and a router can fan out across a document store, a wide-column store, and a time-series store at once. All three audiences served, none of them leaking into the other two.
Snapshot
| Role | Single-endpoint data API for blockchain, on-chain and off-chain, hiding node RPC, indexing, and metadata behind one query contract |
| Domain | Web3 / blockchain data infrastructure |
| Built | December 2022 |
| Interface | Two endpoints only: /ql (UATU QL) and /nl (natural language) |
| Core stack | Node.js (data services) · Python (query + NLP processors) · React (dashboards, chat plugin) · ethers.js (EVM ingestion) |
| Datastores | MongoDB + Cassandra + InfluxDB (read plane) · Redis (cache) · PostgreSQL / MySQL (control plane) |
| Footprint | 20+ repositories; custom CI/CD; modular plug-in data services; a React chat plugin shipped as a live demo |
| Team | 5 backend + 1 frontend engineer; 4 founders (1 leading product and architecture, 3 on business) |
| Status | Pivoted. Architecture and codebase built; did not reach sustained production scale |
The problem
Blockchain data is not one shape. It does not live in one place.
On-chain, you get an append-only firehose: blocks, transactions, event logs. You read it through RPC calls to a node, one at a time, and it is miserable to query in aggregate. Off-chain sits everything that gives that data meaning. Token and project metadata. Prices. Historical metrics. Project descriptions in plain text. Now ask one ordinary question: the average daily price of this wallet's three biggest holdings last month. You just joined four different worlds without noticing.
Late 2022 had tools for this. Each took one slice.
| Tool | Query model | Data plane | Front door |
|---|---|---|---|
| The Graph | GraphQL over per-contract subgraphs you author and deploy yourself | On-chain | GraphQL |
| Covalent | Fixed REST endpoints, normalized across chains | On-chain (historical data lake) | REST |
| Dune | SQL for dashboards and analysis | On-chain | SQL |
| Bitquery | One GraphQL schema across many chains | On-chain | GraphQL |
None of them gave you two things. A natural-language entry point. And on-chain plus off-chain in one query. Each one also locked you into a single query style and, mostly, a single data plane. UATU bet the other way. One endpoint. One query contract. Every data type. Ask in typed QL, or ask in English.
The obvious build is to normalize all of it into one schema in one database and put GraphQL on top. It collapses under its own generality. Wallet activity is a graph. Token metrics are a time series. Project descriptions are text. Force all three into one store and every query is a compromise. The range scans crawl. The text search is weak. The wallet reads pull more than you asked for.
There is a second obvious build: serve reads straight off the same database your indexer writes to. That one fails differently. The chain never stops writing. So reads fight ingestion for the same locks, and tail latency turns unpredictable at exactly the moment traffic peaks. UATU is what you get when you refuse both defaults.
The architecture
Strip it to first principles. The system is a read API sitting on top of a slow, hostile, append-only source. Two rules fall out of that, and the rest of the design follows them.
- Reads must never touch the write path.
- No single store serves every access pattern well.
So UATU runs as three planes, with hard walls between them.
| Plane | Owns | Tech | Consistency |
|---|---|---|---|
| Write (ingestion) | Following the chain; turning raw blocks and events into structured records | ethers.js full-node / watcher services, indexer services (Node.js) | Source of truth |
| Read (serving) | Answering queries, one store per access pattern | Single-purpose data services (Node.js); Python query + NLP layer | Eventually consistent (as of last sync) |
| Control | Identity, metering, billing, dashboards | React front-ends, PostgreSQL / MySQL, Redis | Strongly consistent (ACID) |
A periodic sync sits between the write and read planes. The read stores are projections of the indexed data, rebuilt on a schedule. The read services never take a live write from the chain. That is what makes their reads lockless. No writer to contend with. No read locks to wait behind. Latency you can predict under load. The pattern has a name: CQRS, command query responsibility segregation, with read-optimized projections. Here it is pointed at a blockchain firehose.
The trade is real, so say it plainly. Reads are as fresh as the last sync, not the last block. For "average balance last month," that costs you nothing. For "did this transaction confirm two seconds ago," it hands you the wrong answer. UATU is built for the first kind of question. The trap is to dress synced data up as real-time and hope nobody asks the second kind. What you owe the user is a freshness guarantee, not a latency number.
What the diagram shows: the chain firehose is walled off in the write plane. The read services answer queries off their own stores and never wait on ingestion. Identity and money sit on their own consistent store. A request comes in through one of two doors, typed QL or English, and both land on the same query processor.
Polyglot persistence: one store per access pattern
Each read service does one job and owns exactly one datastore. You pick the store for how the data gets read, not for neatness.
| Data domain | Store | Why this store wins | Dominant access pattern |
|---|---|---|---|
| Wallet-address data | MongoDB | Flexible nested documents (holdings, history) keyed by address | Fetch the document for one address |
| Token / project data | Cassandra | High write throughput, partition-by-key, wide rows | All rows under a token or project partition |
| Historical off-chain metrics | InfluxDB | Purpose-built time-series DB: time-range scans, windowed aggregates, retention and downsampling | Range scan plus aggregate over a window |
| Project off-chain text | MongoDB | Document model with text indexing | Text search and document fetch |
| Hot / repeated queries | Redis | Sub-millisecond cache, keyed on the canonicalized query | Get and set by query hash |
| Users, keys, credits, billing | PostgreSQL / MySQL | ACID; usage and money must be transactional | Joins and ledgers |
This has a cost, and you should name it. You now run five database technologies. Each has its own failure mode, its own backup story, its own scaling curve. So why pay that? Because the alternative, one store for everything, taxes every single query instead of taxing your operations once. Most teams get this trade backwards. They take the easy operations and quietly accept that no query is ever fast.
UATU QL: the contract everything hangs on
UATU QL is the spine. The NLP layer emits it. The developer hand-writes it. The processor routes on it. Underneath, it is an abstract syntax tree: one operation over one or more sources, with filters and transforms on top. The brief calls its design the project's major feat, and that is not a throwaway. A query language a machine can reliably generate and a human can still read is rare. Those two audiences usually pull a grammar in opposite directions.
UATU's answer: serialize the same tree two ways.
| JSON form | Lisp / S-expression form | |
|---|---|---|
| Best for | Machine generation (NLP output), transport, validation | Nested operations, composition, hand-authoring |
| Strength | Ubiquitous tooling; easy to validate against a schema; easy to diff and cache | Expression trees map one to one onto nested transforms, like (sum (avg ...)) |
| Cost | Verbose once nesting gets deep | Needs its own parser; less familiar syntax |
| Role in UATU | The wire format the NLP layer emits and the edge validates | The algebra for multi-source operations |
Same meaning, two feels. A /nl request produces JSON, because that is what a generator should emit and a validator should check. A power user stacking nested aggregations reaches for the S-expression, because an operator tree is what an S-expression already is.
Representative shape, not the production grammar. It shows the three properties the brief names: many sources in one query, transforms over the result, and a choice of serialization for the same tree.
{
"select": ["wallet.balances", "token.metadata", "historical.price"],
"where": { "wallet": "vitalik.eth", "window": "last_30d" },
"transform": [
{ "op": "top", "n": 3, "by": "wallet.balances.usd" },
{ "op": "avg", "of": "historical.price", "by": "day" }
],
"format": "table"
}
(format table
(avg :by day
(top 3 :by usd
(join wallet/balances token/metadata historical/price
:where (= wallet "vitalik.eth") (window last-30d)))))
Canonicalization before routing
The processor resolves ambiguous values into indexes before it fans anything out. "vitalik.eth," "USDC," a project name: each one becomes the canonical address, token id, or partition key the stores actually hold. This is the blockchain version of schema linking in text-to-SQL. It buys you two things at once. The downstream sub-queries become deterministic. And the cache key becomes stable, because canonicalization runs before the cache lookup, so two users who phrase the same entity differently hit the same Redis entry. Skip this step and your cache quietly splinters along the surface wording of the query.
Reading the sequence: the /nl path is a thin wrapper. It turns English into QL, and from there it is just a /ql call. Two cache layers guard two different expensive things. One guards the NL translation. One guards the cross-store fan-out. Metering lands per service, after the result returns.
The NL pipeline, and a note on timing
The /nl endpoint turns English into UATU QL through a hybrid pipeline: semantic parsing, dependency graphs, and an LLM. Not a single model call.
That hybrid shape was the right call for December 2022, and the reason is a date. ChatGPT went public on 30 November 2022. UATU was being designed in the same few weeks. At that moment you could not trust an LLM to emit production-grade structured output in one shot. The state of the art in text-to-SQL was constrained, grammar-aware decoding, PICARD and its peers, sitting on top of sequence-to-sequence parsers. UATU's pipeline follows the same logic. Use linguistic structure to narrow the problem. Let the LLM draft. Then validate that draft against the QL grammar before it runs. UATU QL is a strict, parseable tree, so a bad draft gets caught instead of executed.
That gate matters because of how natural-language interfaces fail, then and now. The dangerous query is not the one that errors out. It is the one that parses, runs, and hands you a confident wrong number. A grammar you can validate against turns some of those silent errors into loud ones. It will not catch the subtle miss, "average" when the user meant "median," and no architecture fully does. Watch that gap. It will hurt you long before a syntax error does.
Data model
There is no single schema. That is the whole point of the read plane, and the store map above is the logical model. One part of UATU is relational through and through, though, and it is where schema design earns its keep as the backbone of the system: the control plane. Identity, metering, money.
Illustrative control-plane model. It matches the billing and usage system the brief describes.
What this maps: one query can charge against several services, and SERVICE_PRICE carries a credits_per_unit per service. That single relationship is what lets a Cassandra wide-row read and an InfluxDB windowed aggregate cost different amounts, which is the pricing structure the brief asks for. QUERY_LOG is what backs the dashboard view of your queries, responses, usage, and billing, and the reporting built on top of it. Usage and credits live in an ACID store on purpose. You can drop a cache. You cannot drop a charge.
Infrastructure and operations
Two endpoints on the outside become 20+ repositories on the inside. A repo per data service. Plus indexers, watchers and full-node services, the query and NLP processors, the dashboards, the front-ends, and the chat plugin. Two languages run side by side, picked per job. Python sits where the parsing and the ML pipeline live. Node.js sits on the I/O-bound data and chain services. The services are modular on purpose, so a new data domain plugs in as a new service without touching the ones already running.
The unglamorous part was shipping it. A 20-repo polyglot estate is a mammoth to move, which is why UATU ran a custom CI/CD pipeline. That pipeline's real job is not any one deploy. Its job is to give very different repos one uniform path to build, test, and ship on their own. The trap is to let each repo grow its own pipeline. By the twentieth, no two deploy the same way, and a new engineer has to learn twenty conventions before shipping a line.
Multi-tenancy works in an unusual way here, and the odd part is where the risk sits. In a normal SaaS, tenant isolation lives in the data plane. My rows must never leak into your account. UATU is different. The queryable data is public chain data, the same for everyone. So isolation moves almost entirely onto the control plane. A tenant should only ever see their own keys, usage, credits, and billing. Auth is an API key, checked at the edge against the control store. The user dashboard and the admin panel are separate front-ends across a role boundary. Get the control-plane isolation right and the data-plane leak you would normally fear barely exists. That is the opposite of where most teams point their security budget.
Where load bites, and how the design takes it
| Pressure point | Naive failure | How UATU absorbs it |
|---|---|---|
| Chain firehose (constant new blocks and events) | Reads contend with ingestion; tail latency spikes under traffic | Writes quarantined in the watcher and indexer plane; read stores are synced projections, so reads stay lockless |
| Hot / repeated queries | The same expensive fan-out runs again and again | Redis cache keyed on the canonicalized QL, at both the processor and NLP layers |
| NL translation cost and latency | Every /nl call pays the full pipeline plus LLM cost | NLP-layer cache on similar queries; /nl is a thin wrapper over the cheaper /ql path |
| Cross-store fan-out latency | The slowest store gates the whole response | Each store tuned to its access pattern; the processor fans out in parallel and aggregates on return |
| Ambiguous identifiers | Inconsistent results; cache fragments by phrasing | Upstream canonicalization, so sub-queries stay deterministic and the cache key stays stable |
| Billing integrity | Lost or double-charged usage | Usage and credits in an ACID relational store, metered per service |
Outcome
- It was built with 20+ repositories, covering data services, indexers, watcher and full-node services, the query and NLP processors, the dashboards, and the front-ends.
- A working React chat plugin shipped as a live demo of the natural-language API, running the English to QL to data path end to end.
- A modular, plug-in service architecture, held together by a custom CI/CD pipeline.
- UATU QL, designed so both the NLP layer and a person could generate and read it, in JSON or S-expression form.
What I'd watch
This design gets four things right that most teams get wrong. Each one can still bite you. Here is what I would watch, and what I would do first.
The QL turns into the org's bottleneck. One query contract sits between the NL layer, every data service, and every client. So every team is downstream of its grammar. That centralization is the strength. It is also the one place a careless schema change ripples through twenty repos at once. Version the QL like a public API from day one. Pay for it later if you do not.
Eventual consistency is a contract, not a footnote. The lockless-read win is real. The bill arrives the first time someone treats synced data as live. Publish a freshness number. Refuse the queries that need real-time confirmation. Do not quietly serve them stale and hope.
Polyglot persistence is a tax you pay up front. Five datastores means five backup stories, five scaling curves, five things to get woken up for. It is worth it only while each store earns its access pattern. The day a service that could have lived in MongoDB gets its own Cassandra "to be safe," the tax stops buying you anything. Cut it.
The silent wrong answer is the real risk on the NL side. Grammar validation catches the query that will not run. It does not catch the one that runs and answers a slightly different question than the user asked. That gap never fully closes. The best you can do is keep the query log honest, so a user can see exactly what was asked on their behalf. That is what QUERY_LOG is for.