// LIVE: uatu.xyz

UATU: A Universal Blockchain Database with a Natural-Language Query API

One API for every kind of blockchain data. Two ways to ask: a typed query language, or plain English. The databases were never the hard part. The hard part was one query contract that an LLM can write, a developer can read, and a router can fan out across a document store, a wide-column store, and a time-series store at once. All three audiences served, none of them leaking into the other two.

Snapshot


Role	Single-endpoint data API for blockchain, on-chain and off-chain, hiding node RPC, indexing, and metadata behind one query contract
Domain	Web3 / blockchain data infrastructure
Built	December 2022
Interface	Two endpoints only: `/ql` (UATU QL) and `/nl` (natural language)
Core stack	Node.js (data services) · Python (query + NLP processors) · React (dashboards, chat plugin) · ethers.js (EVM ingestion)
Datastores	MongoDB + Cassandra + InfluxDB (read plane) · Redis (cache) · PostgreSQL / MySQL (control plane)
Footprint	20+ repositories; custom CI/CD; modular plug-in data services; a React chat plugin shipped as a live demo
Team	5 backend + 1 frontend engineer; 4 founders (1 leading product and architecture, 3 on business)
Status	Pivoted. Architecture and codebase built; did not reach sustained production scale

The problem

Blockchain data is not one shape. It does not live in one place.

On-chain, you get an append-only firehose: blocks, transactions, event logs. You read it through RPC calls to a node, one at a time, and it is miserable to query in aggregate. Off-chain sits everything that gives that data meaning. Token and project metadata. Prices. Historical metrics. Project descriptions in plain text. Now ask one ordinary question: the average daily price of this wallet's three biggest holdings last month. You just joined four different worlds without noticing.

Late 2022 had tools for this. Each took one slice.

Tool	Query model	Data plane	Front door
The Graph	GraphQL over per-contract subgraphs you author and deploy yourself	On-chain	GraphQL
Covalent	Fixed REST endpoints, normalized across chains	On-chain (historical data lake)	REST
Dune	SQL for dashboards and analysis	On-chain	SQL
Bitquery	One GraphQL schema across many chains	On-chain	GraphQL

None of them gave you two things. A natural-language entry point. And on-chain plus off-chain in one query. Each one also locked you into a single query style and, mostly, a single data plane. UATU bet the other way. One endpoint. One query contract. Every data type. Ask in typed QL, or ask in English.

The obvious build is to normalize all of it into one schema in one database and put GraphQL on top. It collapses under its own generality. Wallet activity is a graph. Token metrics are a time series. Project descriptions are text. Force all three into one store and every query is a compromise. The range scans crawl. The text search is weak. The wallet reads pull more than you asked for.

There is a second obvious build: serve reads straight off the same database your indexer writes to. That one fails differently. The chain never stops writing. So reads fight ingestion for the same locks, and tail latency turns unpredictable at exactly the moment traffic peaks. UATU is what you get when you refuse both defaults.

The architecture

Strip it to first principles. The system is a read API sitting on top of a slow, hostile, append-only source. Two rules fall out of that, and the rest of the design follows them.

Reads must never touch the write path.
No single store serves every access pattern well.

So UATU runs as three planes, with hard walls between them.

Plane	Owns	Tech	Consistency
Write (ingestion)	Following the chain; turning raw blocks and events into structured records	ethers.js full-node / watcher services, indexer services (Node.js)	Source of truth
Read (serving)	Answering queries, one store per access pattern	Single-purpose data services (Node.js); Python query + NLP layer	Eventually consistent (as of last sync)
Control	Identity, metering, billing, dashboards	React front-ends, PostgreSQL / MySQL, Redis	Strongly consistent (ACID)

A periodic sync sits between the write and read planes. The read stores are projections of the indexed data, rebuilt on a schedule. The read services never take a live write from the chain. That is what makes their reads lockless. No writer to contend with. No read locks to wait behind. Latency you can predict under load. The pattern has a name: CQRS, command query responsibility segregation, with read-optimized projections. Here it is pointed at a blockchain firehose.

The trade is real, so say it plainly. Reads are as fresh as the last sync, not the last block. For "average balance last month," that costs you nothing. For "did this transaction confirm two seconds ago," it hands you the wrong answer. UATU is built for the first kind of question. The trap is to dress synced data up as real-time and hope nobody asks the second kind. What you owe the user is a freshness guarantee, not a latency number.

What the diagram shows: the chain firehose is walled off in the write plane. The read services answer queries off their own stores and never wait on ingestion. Identity and money sit on their own consistent store. A request comes in through one of two doors, typed QL or English, and both land on the same query processor.

Polyglot persistence: one store per access pattern

Each read service does one job and owns exactly one datastore. You pick the store for how the data gets read, not for neatness.

Data domain	Store	Why this store wins	Dominant access pattern
Wallet-address data	MongoDB	Flexible nested documents (holdings, history) keyed by address	Fetch the document for one address
Token / project data	Cassandra	High write throughput, partition-by-key, wide rows	All rows under a token or project partition
Historical off-chain metrics	InfluxDB	Purpose-built time-series DB: time-range scans, windowed aggregates, retention and downsampling	Range scan plus aggregate over a window
Project off-chain text	MongoDB	Document model with text indexing	Text search and document fetch
Hot / repeated queries	Redis	Sub-millisecond cache, keyed on the canonicalized query	Get and set by query hash
Users, keys, credits, billing	PostgreSQL / MySQL	ACID; usage and money must be transactional	Joins and ledgers

This has a cost, and you should name it. You now run five database technologies. Each has its own failure mode, its own backup story, its own scaling curve. So why pay that? Because the alternative, one store for everything, taxes every single query instead of taxing your operations once. Most teams get this trade backwards. They take the easy operations and quietly accept that no query is ever fast.

UATU QL: the contract everything hangs on

UATU QL is the spine. The NLP layer emits it. The developer hand-writes it. The processor routes on it. Underneath, it is an abstract syntax tree: one operation over one or more sources, with filters and transforms on top. The brief calls its design the project's major feat, and that is not a throwaway. A query language a machine can reliably generate and a human can still read is rare. Those two audiences usually pull a grammar in opposite directions.

UATU's answer: serialize the same tree two ways.

	JSON form	Lisp / S-expression form
Best for	Machine generation (NLP output), transport, validation	Nested operations, composition, hand-authoring
Strength	Ubiquitous tooling; easy to validate against a schema; easy to diff and cache	Expression trees map one to one onto nested transforms, like `(sum (avg ...))`
Cost	Verbose once nesting gets deep	Needs its own parser; less familiar syntax
Role in UATU	The wire format the NLP layer emits and the edge validates	The algebra for multi-source operations

Same meaning, two feels. A /nl request produces JSON, because that is what a generator should emit and a validator should check. A power user stacking nested aggregations reaches for the S-expression, because an operator tree is what an S-expression already is.

Representative shape, not the production grammar. It shows the three properties the brief names: many sources in one query, transforms over the result, and a choice of serialization for the same tree.

{
  "select": ["wallet.balances", "token.metadata", "historical.price"],
  "where":  { "wallet": "vitalik.eth", "window": "last_30d" },
  "transform": [
    { "op": "top", "n": 3, "by": "wallet.balances.usd" },
    { "op": "avg", "of": "historical.price", "by": "day" }
  ],
  "format": "table"
}

(format table
  (avg :by day
    (top 3 :by usd
      (join wallet/balances token/metadata historical/price
            :where (= wallet "vitalik.eth") (window last-30d)))))

Canonicalization before routing

The processor resolves ambiguous values into indexes before it fans anything out. "vitalik.eth," "USDC," a project name: each one becomes the canonical address, token id, or partition key the stores actually hold. This is the blockchain version of schema linking in text-to-SQL. It buys you two things at once. The downstream sub-queries become deterministic. And the cache key becomes stable, because canonicalization runs before the cache lookup, so two users who phrase the same entity differently hit the same Redis entry. Skip this step and your cache quietly splinters along the surface wording of the query.

Reading the sequence: the /nl path is a thin wrapper. It turns English into QL, and from there it is just a /ql call. Two cache layers guard two different expensive things. One guards the NL translation. One guards the cross-store fan-out. Metering lands per service, after the result returns.

The NL pipeline, and a note on timing

The /nl endpoint turns English into UATU QL through a hybrid pipeline: semantic parsing, dependency graphs, and an LLM. Not a single model call.

That hybrid shape was the right call for December 2022, and the reason is a date. ChatGPT went public on 30 November 2022. UATU was being designed in the same few weeks. At that moment you could not trust an LLM to emit production-grade structured output in one shot. The state of the art in text-to-SQL was constrained, grammar-aware decoding, PICARD and its peers, sitting on top of sequence-to-sequence parsers. UATU's pipeline follows the same logic. Use linguistic structure to narrow the problem. Let the LLM draft. Then validate that draft against the QL grammar before it runs. UATU QL is a strict, parseable tree, so a bad draft gets caught instead of executed.

That gate matters because of how natural-language interfaces fail, then and now. The dangerous query is not the one that errors out. It is the one that parses, runs, and hands you a confident wrong number. A grammar you can validate against turns some of those silent errors into loud ones. It will not catch the subtle miss, "average" when the user meant "median," and no architecture fully does. Watch that gap. It will hurt you long before a syntax error does.

Data model

There is no single schema. That is the whole point of the read plane, and the store map above is the logical model. One part of UATU is relational through and through, though, and it is where schema design earns its keep as the backbone of the system: the control plane. Identity, metering, money.

Illustrative control-plane model. It matches the billing and usage system the brief describes.

What this maps: one query can charge against several services, and SERVICE_PRICE carries a credits_per_unit per service. That single relationship is what lets a Cassandra wide-row read and an InfluxDB windowed aggregate cost different amounts, which is the pricing structure the brief asks for. QUERY_LOG is what backs the dashboard view of your queries, responses, usage, and billing, and the reporting built on top of it. Usage and credits live in an ACID store on purpose. You can drop a cache. You cannot drop a charge.

Infrastructure and operations

Two endpoints on the outside become 20+ repositories on the inside. A repo per data service. Plus indexers, watchers and full-node services, the query and NLP processors, the dashboards, the front-ends, and the chat plugin. Two languages run side by side, picked per job. Python sits where the parsing and the ML pipeline live. Node.js sits on the I/O-bound data and chain services. The services are modular on purpose, so a new data domain plugs in as a new service without touching the ones already running.

The unglamorous part was shipping it. A 20-repo polyglot estate is a mammoth to move, which is why UATU ran a custom CI/CD pipeline. That pipeline's real job is not any one deploy. Its job is to give very different repos one uniform path to build, test, and ship on their own. The trap is to let each repo grow its own pipeline. By the twentieth, no two deploy the same way, and a new engineer has to learn twenty conventions before shipping a line.

Multi-tenancy works in an unusual way here, and the odd part is where the risk sits. In a normal SaaS, tenant isolation lives in the data plane. My rows must never leak into your account. UATU is different. The queryable data is public chain data, the same for everyone. So isolation moves almost entirely onto the control plane. A tenant should only ever see their own keys, usage, credits, and billing. Auth is an API key, checked at the edge against the control store. The user dashboard and the admin panel are separate front-ends across a role boundary. Get the control-plane isolation right and the data-plane leak you would normally fear barely exists. That is the opposite of where most teams point their security budget.

Where load bites, and how the design takes it

Pressure point	Naive failure	How UATU absorbs it
Chain firehose (constant new blocks and events)	Reads contend with ingestion; tail latency spikes under traffic	Writes quarantined in the watcher and indexer plane; read stores are synced projections, so reads stay lockless
Hot / repeated queries	The same expensive fan-out runs again and again	Redis cache keyed on the canonicalized QL, at both the processor and NLP layers
NL translation cost and latency	Every `/nl` call pays the full pipeline plus LLM cost	NLP-layer cache on similar queries; `/nl` is a thin wrapper over the cheaper `/ql` path
Cross-store fan-out latency	The slowest store gates the whole response	Each store tuned to its access pattern; the processor fans out in parallel and aggregates on return
Ambiguous identifiers	Inconsistent results; cache fragments by phrasing	Upstream canonicalization, so sub-queries stay deterministic and the cache key stays stable
Billing integrity	Lost or double-charged usage	Usage and credits in an ACID relational store, metered per service

Outcome

It was built with 20+ repositories, covering data services, indexers, watcher and full-node services, the query and NLP processors, the dashboards, and the front-ends.
A working React chat plugin shipped as a live demo of the natural-language API, running the English to QL to data path end to end.
A modular, plug-in service architecture, held together by a custom CI/CD pipeline.
UATU QL, designed so both the NLP layer and a person could generate and read it, in JSON or S-expression form.

What I'd watch

This design gets four things right that most teams get wrong. Each one can still bite you. Here is what I would watch, and what I would do first.

The QL turns into the org's bottleneck. One query contract sits between the NL layer, every data service, and every client. So every team is downstream of its grammar. That centralization is the strength. It is also the one place a careless schema change ripples through twenty repos at once. Version the QL like a public API from day one. Pay for it later if you do not.

Eventual consistency is a contract, not a footnote. The lockless-read win is real. The bill arrives the first time someone treats synced data as live. Publish a freshness number. Refuse the queries that need real-time confirmation. Do not quietly serve them stale and hope.

Polyglot persistence is a tax you pay up front. Five datastores means five backup stories, five scaling curves, five things to get woken up for. It is worth it only while each store earns its access pattern. The day a service that could have lived in MongoDB gets its own Cassandra "to be safe," the tax stops buying you anything. Cut it.

The silent wrong answer is the real risk on the NL side. Grammar validation catches the query that will not run. It does not catch the one that runs and answers a slightly different question than the user asked. That gap never fully closes. The best you can do is keep the query log honest, so a user can see exactly what was asked on their behalf. That is what QUERY_LOG is for.