// LIVE: sigil.nxtcurve.com
SiGiL / NxtCurve PMS: An architecture teardown
A multi-tenant platform that defines, measures, and closes workforce skill gaps with AI. Generating good questions turned out to be the easy part. The hard problem was trusting a score enough to base a promotion on it.
Snapshot
| Field | Detail |
|---|---|
| Role | B2B workforce skill-intelligence and performance-management platform (UI brand: SiGiL; deployment: NxtCurve PMS) |
| Domain | HR-tech · learning and development · talent measurement |
| Primary users | Org Admin · Manager · Employee, each on a role-scoped dashboard |
| Core stack | TypeScript end-to-end · React 18 + Wouter + TanStack Query + Radix/shadcn (client) · Node/Express (server) · MongoDB + Mongoose, 31 schemas (data) · Redis + BullMQ (queue) · OpenAI + Anthropic (AI) |
| Async / AI work | Job state in MongoDB, with the queue externalized to a Redis-backed BullMQ setup and run by isolated worker processes |
| Infra | Ubuntu · Nginx · PM2 · MongoDB · Redis · GitHub Actions CI/CD · Docker Compose available |
| Status | Live |
The problem
The platform has to answer three questions for an organization, at scale, across many tenants, with no human grading in the loop: what does each role require, what does each person actually have, and how do you close the difference?
The middle question is the one that breaks naive approaches.
| Obvious approach | Why it fails as a skill measure |
|---|---|
| Fixed-form quiz (everyone gets the same items) | A beginner and an expert sit the same questions, which makes the test long, demoralizing, and wasteful. A small fixed bank also leaks: people memorize the popular items and pass them around, so you end up measuring recall rather than skill. |
| Self-assessment | Confidence barely tracks competence, so the signal is mostly noise. |
| Manager rating | Encodes the manager's bias and recent memory rather than the employee's actual ability. |
| "Ask an LLM to score them" | Ungrounded, not reproducible, and impossible to audit. Fine for a rough draft, but you can't base someone's pay on it. |
The defensible answer comes from psychometrics: Item Response Theory (IRT) combined with computerized adaptive testing (CAT). The test adapts to each person and estimates a separate ability per skill, efficiently, with a precision you can actually measure.
This is where most teams who reach for IRT run into trouble. The system can report a precise score long before it can report a correct one, and keeping those two apart is most of the work. It gets harder, not easier, once an AI is writing the questions and estimating their difficulty. Most of this teardown circles back to that one problem.
There is a second, quieter problem. Many organizations' data sits in one MongoDB cluster, so a single query that forgets its tenant filter lets Company A read Company B's skill matrix. For a tool that holds HR data, a leak like that is roughly the worst thing that can happen.
The architecture
The core decision is a modular TypeScript monolith. One codebase, split firmly into three layers (client/, server/, shared/), sitting on MongoDB, now run as separate web and worker processes over a Redis queue. The less common move is what goes in shared/. It holds not only the shared types but the Mongoose schemas and the IRT math too. The data contract and the measurement logic live in one place and compile into the browser, the API, and the workers, so none of them can disagree about what a skill, a question, or a proficiency actually is. There is one definition of each.
The rest followed from a few priorities. One language, so a small team moves quickly. One codebase, because a coherent product beats a tidy distributed diagram. But cut the internal seams ahead of time (modular routers, a storage abstraction, an externalized job tier) so the runtime can fan out later without a rewrite.
| Decision | Chosen | Over | Because | Cost / what to watch |
|---|---|---|---|---|
| App shape | Modular TS monolith, 3 layers (client/server/shared) | Microservices | One team, one language; schemas and IRT math single-sourced; the quickest route to a coherent product | One codebase can hide a god-file that quietly becomes the architecture |
| Datastore | MongoDB + Mongoose | (migrated from) PostgreSQL | Flexible shapes for evolving roles and skills and for AI-generated nested docs; no migration per field tweak | No DB-enforced foreign keys or joins, so integrity and tenant isolation fall to application code |
| Persistence seam | IStorage interface | Mongoose calls scattered everywhere | A clean swap point during the Postgres to Mongo migration; one place to mock and test | Grew to 300+ methods; the indirection now costs more than it saves |
| Async work | Job state in Mongo, run on BullMQ/Redis workers | In-process fire-and-forget | Isolates long AI work, lets the web tier scale, and prevents double-runs through atomic job claiming | Redis now has to be kept highly available; jobs must be idempotent |
| Item generation | LLM-generated, real-data-anchored questions | A fixed human-authored bank | Acts as a question multiplier against memorization and leakage; anchoring transfers difficulty and style | Generated difficulty runs on an LLM prior until production responses re-fit it; needs dedup to stay diverse |
| Measurement | IRT adaptive (CAT) | Fixed quiz / self-rating | Efficient, per-skill ability with a defensible precision | Needs response data to calibrate; complex stop logic; worthless if the difficulty is wrong |
| Client routing/state | Wouter + TanStack Query | React Router + hand-rolled fetching | ~1.5 KB router; the query cache handles loading, caching, and background refetch, so the SPA can treat the API as a cache instead of wiring fetches by hand | Smaller ecosystem than React Router |
System topology
The web tier handles requests and the fast, synchronous IRT scoring that runs during a live assessment. Anything slow, like pulling skills out of an uploaded role doc, generating a question bank, or writing summaries, gets handed to Redis and run by isolated workers. Both tiers run from the same codebase. Because each worker claims a job atomically, running several copies of the web tier no longer means the same job fires twice.
The adaptive assessment engine
The engine runs on Item Response Theory (IRT). IRT treats each answer as the product of two quantities: how able the person is at a skill (θ, theta) and how hard the item is (b). After each answer, the engine re-estimates θ for that skill and picks the next question whose difficulty sits at the current estimate. That is the point where a right or wrong outcome carries the most information, since information per item peaks when the person has about a 50/50 chance of getting it right. Get one right and the next gets harder; miss and it eases off. The test settles on each skill in a handful of items instead of marching everyone through forty fixed ones. The precision is exact and computable: the standard error of θ is 1 / √(total information), so the engine always knows how confident it is.
The item format scales with the level being tested: MCQ for recall, SJT (situational judgment) for applied decisions, and caselets for multi-step scenario reasoning.
One assessment, taken skill by skill. Each answer updates the ability estimate and gets written to the response log, and three stop rules decide when a skill has been measured well enough to lock. The note at the bottom is the loop that keeps a predicted difficulty honest against real data.
| Stop rule | Fires when | Protects against | Psychometric basis |
|---|---|---|---|
| Precision | A skill's standard error drops below threshold (SE = 1/√information) | Over-testing once the estimate is already tight | Standard SE stopping |
| Fatigue | Item count per skill or session hits a cap | Test-taker exhaustion, item over-exposure, and the noisy data both produce | Maximum-length stopping |
| Mercy | The person sits clearly and stably below the skill floor | Demoralizing someone with ever-harder items they can't reach | Product-level early exit on a low, settled estimate |
Why an LLM writes the bank
Two things wear an assessment down over time: too few items, so questions recur and get memorized or passed around, and a stale bank, so "skill" quietly turns into recall of leaked answers. The platform handles both by using an LLM as a question multiplier. For each skill, a background job generates a large pool of items, but not blindly. It gets fed real questions for that skill from historical assessment data, so the generated items pick up the style, framing, and difficulty band of items real people have already sat. Embedding-based deduplication then strips near-duplicates so the pool stays varied. The result is a deep, fresh bank that resists gaming, rather than a small fixed set everyone eventually memorizes.
Where irt_b comes from, and the line that still matters
Difficulty is not guessed. The same real-data anchors that steer generation also let the LLM predict an irt_b for each new item: a few-shot estimate calibrated against items whose empirical difficulty is already known. This is the standard answer to the cold-start problem. A brand-new question has no response data, and pretesting it on hundreds of people is slow and expensive, so a prediction anchored in real data lets the item go live right away with a usable difficulty. It makes a solid prior. Anchored predictions like this track empirical difficulty moderately to strongly, far better than guessing by eye.
On its own, though, a predicted irt_b is a prior, not a calibration, and the gap between those two is exactly where high-stakes scoring gets decided. There are two reasons the prediction can't be the final number.
- An LLM's sense of difficulty is not the same as your workforce's. The two correlate, but they are different signals: something that stumps a person can be trivial for the model, and the gap can grow rather than shrink as the model reasons harder. The prediction is about the item, not the people answering it.
- Difficulty is population-relative by definition. In IRT,
blives on the same scale as the ability (θ) of the population that produced it. The real difficulty of an item for this organization's employees is something only their own responses can reveal.
So the loop is closed in production. The LLM prediction is the prior, the logged responses are the data, and the re-fit produces the calibrated posterior. It runs continuously, not as a one-off. Items whose empirical difficulty drifts from the prediction get flagged, which catches a bad question and a bad prediction with the same check.
Calibration happens by degrees rather than flipping on all at once. A freshly generated item runs on its LLM prior until enough people have answered it, so at any moment the bank holds well-calibrated items next to newly seeded ones still converging on their true difficulty. The divergence flag stays on as a standing monitor. Drift from a curriculum change, leakage, or over-exposure is exactly what it exists to catch.
The trap here is subtle. A number that comes from real data and a capable model looks empirical, and a number that looks empirical is more dangerous when it's wrong than a figure someone openly admits they estimated. Running the re-fit, rather than trusting the prior, is what lets a score stand behind a real decision. Any system that measures people faces the same test eventually. The question is not whether the score is precise but whether you'd trust it enough to base your own promotion on it. For an item with the responses behind it, the honest answer here can be yes.
Data model
Thirty-one Mongoose schemas. What follows is the core set, the entities the main workflows turn on. The rest (settings, flags, cooldowns, approval records, embedding caches, audit trails) hangs off this skeleton.
The domain reads as a single chain. An organization defines roles, roles expect skills, skills generate questions, employees answer them in sessions, sessions lock a per-skill proficiency, the gap between assessed and expected proficiency drives a goal, and a learning journey closes the goal. The organizationId on tenant-scoped collections is the key the whole isolation story rests on.
Three schema decisions matter more than the rest:
organizationIdis stamped on every tenant-scoped document. It's the partition key for the whole platform. Having it there is the easy part. Enforcing it on every read is the hard part (see Operations).- The response log is what turns a prediction into a measurement. The LLM gives each item a difficulty prior. The logged responses (item, correctness, latency) are what the production re-fit reads to turn that prior into a population-calibrated
irt_b. Without this table the engine would be stuck on the model's guess, which makes it a first-class asset and not an afterthought. It's worth designing carefully up front, because you will lean on it. - Questions carry an embedding for deduplication. This is the other half of the question-multiplier approach: generate a large pool, then use vector similarity to strip near-duplicates that would quietly skew item exposure and difficulty coverage.
The expected-versus-assessed proficiency pair (both on the 1 to 5 scale, on SKILL_PROFICIENCY) is what the back half of the product is built on. The gap between the two drives everything downstream: it feeds the AI gap summary, seeds the SMART goal, and scopes the learning journey.
Infrastructure and operations
| Layer | Choice | What it does in production |
|---|---|---|
| Reverse proxy | Nginx | TLS termination, static assets, and fronting the web tier |
| Process management | PM2 | Keeps processes alive, restarts on crash, and runs the web tier in cluster mode (multiple replicas) |
| Database | MongoDB | Source of state, including durable job records |
| Queue / broker | Redis + BullMQ | Holds the job queue; workers claim jobs atomically |
| CI/CD | GitHub Actions | Build, test, and deploy pipeline |
| Packaging | Docker Compose | Containerized deployment option alongside the bare-Ubuntu path |
Where load and failure bite
Externalizing the queue removed the main thing that used to block scaling. Long AI work (role extraction, question generation, summaries) no longer runs inside the web process. It gets enqueued to Redis and claimed by isolated workers, one job per worker. That gives two scaling axes that move independently: the stateless web tier can run as many PM2 replicas as traffic needs, and the worker pool can grow or shrink with AI load, without the double-execution problem that in-process jobs would have caused in cluster mode.
The new things to watch come with the queue, not from going without one. Redis is now a dependency that needs persistence and a high-availability plan, since it holds in-flight work. Jobs have to be idempotent, because at-least-once delivery means a retried job can run twice. And with both the web and worker tiers scaled out, MongoDB becomes the next likely bottleneck under contention.
Security posture
| Concern | Implementation | Read this as |
|---|---|---|
| Multi-tenancy | organizationId filter applied by hand in every query | Forget one predicate and a tenant's data leaks. Isolation is currently a habit and not a guarantee |
| RBAC | Three roles (Admin / Manager / Employee), role-scoped dashboards and routes; managers gain role-description and question-bank powers only if permitted | Authority is least-privilege and graduated, not a single on/off switch |
| Onboarding control | Org approval flow: new organizations sit pending until approved | A gate against spam and unvetted tenants before anyone can act inside a tenant |
| Assessment integrity | Cooldowns, forced retakes, question flagging, and support tickets, plus a generated, deduplicated bank that resists leakage | The anti-gaming and quality controls a measurement system depends on |
| AI safety | Human-in-the-loop verify step before extracted skills count; model-seeded item difficulty corrected by production responses | The model proposes and a person signs off; nothing AI-generated is load-bearing until a human approves it or real data has corrected it |
The one to internalize is structural tenant isolation. Behavioral isolation, where everyone remembers to add the org filter to all 300-odd storage methods, works right up until the single query where someone forgets, and that one omission is a cross-tenant breach. The fix is to make isolation structural: a Mongoose plugin that injects the tenant predicate automatically, or a per-tenant query helper that no query can bypass. The principle is simple to state and hard to keep: don't rely on every engineer remembering the most important WHERE clause on every query for the life of the codebase.
The architectural debt ledger
| Mandate | State | If left unaddressed | Fix | Priority |
|---|---|---|---|---|
| Tenant isolation | organizationId filtered by hand per query | One missed predicate equals a cross-tenant breach | Mongoose plugin or per-tenant query helper, for structural isolation | 1: irreversible blast radius |
| IRT calibration | Resolved. irt_b LLM-seeded, re-fit against production responses, divergent items flagged | n/a | Done; now a standing monitor (new-item convergence, item drift) | n/a |
| Queue externalization | Resolved. BullMQ/Redis, isolated workers | n/a | Done; now harden idempotency and Redis HA | n/a |
routes.ts god-file | ~5,500 lines spanning skills/goals/learning/legacy | Becomes the architecture by default; change-risk compounds | Split by domain to match the modular routers; deprecate by a fixed date | 2: entropy |
IStorage interface | 300+ methods on one surface | Hardens into something nobody dares touch | Split along the same domain seams as the routers | 2: entropy |
Legacy /dashboard + dual surfaces | A personal flow running parallel to enterprise | Two products to maintain forever | Set a deprecation date; collapse to the enterprise surface | 3: focus |
Outcome
What the system clearly is, based on the brief and nothing more:
- A working end-to-end pipeline across the full skill-intelligence loop: define (AI extraction from PDF/Word/Excel role docs) → verify (adaptive IRT assessment) → diagnose (assessed-versus-expected gap analysis) → plan (AI SMART goals) → learn (AI learning journeys, with optional Coursera/Moodle hooks), all under a real multi-tenant org/team/role structure.
- A 31-schema domain model that came through a real PostgreSQL to MongoDB migration. The 300-method
IStorageinterface is the scar tissue from that move. It did its job during the cutover and is now correctly flagged as indirection to retire. Abstractions that earn their keep during a migration rarely keep earning it afterward, and the trick is noticing when the scaffolding has quietly become the building. - A horizontally scalable runtime: the background queue is externalized to Redis/BullMQ with isolated workers, so the web and worker tiers scale independently and cluster mode no longer double-runs jobs.
- An assessment engine that solves the item cold-start problem with LLM-seeded difficulty, uses that generation deliberately as a question multiplier against leakage, and then re-fits the difficulty against production responses on a continuous loop. The model's estimate is a starting prior that real data corrects, not a number trusted on faith. Calibrating against your own population rather than the model's guess is what lets a skill score stand behind a promotion.
Maturity, stated plainly: feature-complete, horizontally scalable, and running a live calibration loop, with model-seeded difficulty re-fit against the organization's own responses. The open items are structural and security hygiene (tenant isolation first, then the routes.ts and IStorage decomposition and legacy deprecation) rather than measurement integrity or scale.