V.Trivedy_
← Architectural Teardowns

// LIVE: sigil.nxtcurve.com

SiGiL / NxtCurve PMS: An architecture teardown

A multi-tenant platform that defines, measures, and closes workforce skill gaps with AI. Generating good questions turned out to be the easy part. The hard problem was trusting a score enough to base a promotion on it.


Snapshot

FieldDetail
RoleB2B workforce skill-intelligence and performance-management platform (UI brand: SiGiL; deployment: NxtCurve PMS)
DomainHR-tech · learning and development · talent measurement
Primary usersOrg Admin · Manager · Employee, each on a role-scoped dashboard
Core stackTypeScript end-to-end · React 18 + Wouter + TanStack Query + Radix/shadcn (client) · Node/Express (server) · MongoDB + Mongoose, 31 schemas (data) · Redis + BullMQ (queue) · OpenAI + Anthropic (AI)
Async / AI workJob state in MongoDB, with the queue externalized to a Redis-backed BullMQ setup and run by isolated worker processes
InfraUbuntu · Nginx · PM2 · MongoDB · Redis · GitHub Actions CI/CD · Docker Compose available
StatusLive

The problem

The platform has to answer three questions for an organization, at scale, across many tenants, with no human grading in the loop: what does each role require, what does each person actually have, and how do you close the difference?

The middle question is the one that breaks naive approaches.

Obvious approachWhy it fails as a skill measure
Fixed-form quiz (everyone gets the same items)A beginner and an expert sit the same questions, which makes the test long, demoralizing, and wasteful. A small fixed bank also leaks: people memorize the popular items and pass them around, so you end up measuring recall rather than skill.
Self-assessmentConfidence barely tracks competence, so the signal is mostly noise.
Manager ratingEncodes the manager's bias and recent memory rather than the employee's actual ability.
"Ask an LLM to score them"Ungrounded, not reproducible, and impossible to audit. Fine for a rough draft, but you can't base someone's pay on it.

The defensible answer comes from psychometrics: Item Response Theory (IRT) combined with computerized adaptive testing (CAT). The test adapts to each person and estimates a separate ability per skill, efficiently, with a precision you can actually measure.

This is where most teams who reach for IRT run into trouble. The system can report a precise score long before it can report a correct one, and keeping those two apart is most of the work. It gets harder, not easier, once an AI is writing the questions and estimating their difficulty. Most of this teardown circles back to that one problem.

There is a second, quieter problem. Many organizations' data sits in one MongoDB cluster, so a single query that forgets its tenant filter lets Company A read Company B's skill matrix. For a tool that holds HR data, a leak like that is roughly the worst thing that can happen.


The architecture

The core decision is a modular TypeScript monolith. One codebase, split firmly into three layers (client/, server/, shared/), sitting on MongoDB, now run as separate web and worker processes over a Redis queue. The less common move is what goes in shared/. It holds not only the shared types but the Mongoose schemas and the IRT math too. The data contract and the measurement logic live in one place and compile into the browser, the API, and the workers, so none of them can disagree about what a skill, a question, or a proficiency actually is. There is one definition of each.

The rest followed from a few priorities. One language, so a small team moves quickly. One codebase, because a coherent product beats a tidy distributed diagram. But cut the internal seams ahead of time (modular routers, a storage abstraction, an externalized job tier) so the runtime can fan out later without a rewrite.

DecisionChosenOverBecauseCost / what to watch
App shapeModular TS monolith, 3 layers (client/server/shared)MicroservicesOne team, one language; schemas and IRT math single-sourced; the quickest route to a coherent productOne codebase can hide a god-file that quietly becomes the architecture
DatastoreMongoDB + Mongoose(migrated from) PostgreSQLFlexible shapes for evolving roles and skills and for AI-generated nested docs; no migration per field tweakNo DB-enforced foreign keys or joins, so integrity and tenant isolation fall to application code
Persistence seamIStorage interfaceMongoose calls scattered everywhereA clean swap point during the Postgres to Mongo migration; one place to mock and testGrew to 300+ methods; the indirection now costs more than it saves
Async workJob state in Mongo, run on BullMQ/Redis workersIn-process fire-and-forgetIsolates long AI work, lets the web tier scale, and prevents double-runs through atomic job claimingRedis now has to be kept highly available; jobs must be idempotent
Item generationLLM-generated, real-data-anchored questionsA fixed human-authored bankActs as a question multiplier against memorization and leakage; anchoring transfers difficulty and styleGenerated difficulty runs on an LLM prior until production responses re-fit it; needs dedup to stay diverse
MeasurementIRT adaptive (CAT)Fixed quiz / self-ratingEfficient, per-skill ability with a defensible precisionNeeds response data to calibrate; complex stop logic; worthless if the difficulty is wrong
Client routing/stateWouter + TanStack QueryReact Router + hand-rolled fetching~1.5 KB router; the query cache handles loading, caching, and background refetch, so the SPA can treat the API as a cache instead of wiring fetches by handSmaller ecosystem than React Router

System topology

The web tier handles requests and the fast, synchronous IRT scoring that runs during a live assessment. Anything slow, like pulling skills out of an uploaded role doc, generating a question bank, or writing summaries, gets handed to Redis and run by isolated workers. Both tiers run from the same codebase. Because each worker claims a job atomically, running several copies of the web tier no longer means the same job fires twice.

The adaptive assessment engine

The engine runs on Item Response Theory (IRT). IRT treats each answer as the product of two quantities: how able the person is at a skill (θ, theta) and how hard the item is (b). After each answer, the engine re-estimates θ for that skill and picks the next question whose difficulty sits at the current estimate. That is the point where a right or wrong outcome carries the most information, since information per item peaks when the person has about a 50/50 chance of getting it right. Get one right and the next gets harder; miss and it eases off. The test settles on each skill in a handful of items instead of marching everyone through forty fixed ones. The precision is exact and computable: the standard error of θ is 1 / √(total information), so the engine always knows how confident it is.

The item format scales with the level being tested: MCQ for recall, SJT (situational judgment) for applied decisions, and caselets for multi-step scenario reasoning.

One assessment, taken skill by skill. Each answer updates the ability estimate and gets written to the response log, and three stop rules decide when a skill has been measured well enough to lock. The note at the bottom is the loop that keeps a predicted difficulty honest against real data.

Stop ruleFires whenProtects againstPsychometric basis
PrecisionA skill's standard error drops below threshold (SE = 1/√information)Over-testing once the estimate is already tightStandard SE stopping
FatigueItem count per skill or session hits a capTest-taker exhaustion, item over-exposure, and the noisy data both produceMaximum-length stopping
MercyThe person sits clearly and stably below the skill floorDemoralizing someone with ever-harder items they can't reachProduct-level early exit on a low, settled estimate

Why an LLM writes the bank

Two things wear an assessment down over time: too few items, so questions recur and get memorized or passed around, and a stale bank, so "skill" quietly turns into recall of leaked answers. The platform handles both by using an LLM as a question multiplier. For each skill, a background job generates a large pool of items, but not blindly. It gets fed real questions for that skill from historical assessment data, so the generated items pick up the style, framing, and difficulty band of items real people have already sat. Embedding-based deduplication then strips near-duplicates so the pool stays varied. The result is a deep, fresh bank that resists gaming, rather than a small fixed set everyone eventually memorizes.

Where irt_b comes from, and the line that still matters

Difficulty is not guessed. The same real-data anchors that steer generation also let the LLM predict an irt_b for each new item: a few-shot estimate calibrated against items whose empirical difficulty is already known. This is the standard answer to the cold-start problem. A brand-new question has no response data, and pretesting it on hundreds of people is slow and expensive, so a prediction anchored in real data lets the item go live right away with a usable difficulty. It makes a solid prior. Anchored predictions like this track empirical difficulty moderately to strongly, far better than guessing by eye.

On its own, though, a predicted irt_b is a prior, not a calibration, and the gap between those two is exactly where high-stakes scoring gets decided. There are two reasons the prediction can't be the final number.

  • An LLM's sense of difficulty is not the same as your workforce's. The two correlate, but they are different signals: something that stumps a person can be trivial for the model, and the gap can grow rather than shrink as the model reasons harder. The prediction is about the item, not the people answering it.
  • Difficulty is population-relative by definition. In IRT, b lives on the same scale as the ability (θ) of the population that produced it. The real difficulty of an item for this organization's employees is something only their own responses can reveal.

So the loop is closed in production. The LLM prediction is the prior, the logged responses are the data, and the re-fit produces the calibrated posterior. It runs continuously, not as a one-off. Items whose empirical difficulty drifts from the prediction get flagged, which catches a bad question and a bad prediction with the same check.

Calibration happens by degrees rather than flipping on all at once. A freshly generated item runs on its LLM prior until enough people have answered it, so at any moment the bank holds well-calibrated items next to newly seeded ones still converging on their true difficulty. The divergence flag stays on as a standing monitor. Drift from a curriculum change, leakage, or over-exposure is exactly what it exists to catch.

The trap here is subtle. A number that comes from real data and a capable model looks empirical, and a number that looks empirical is more dangerous when it's wrong than a figure someone openly admits they estimated. Running the re-fit, rather than trusting the prior, is what lets a score stand behind a real decision. Any system that measures people faces the same test eventually. The question is not whether the score is precise but whether you'd trust it enough to base your own promotion on it. For an item with the responses behind it, the honest answer here can be yes.


Data model

Thirty-one Mongoose schemas. What follows is the core set, the entities the main workflows turn on. The rest (settings, flags, cooldowns, approval records, embedding caches, audit trails) hangs off this skeleton.

The domain reads as a single chain. An organization defines roles, roles expect skills, skills generate questions, employees answer them in sessions, sessions lock a per-skill proficiency, the gap between assessed and expected proficiency drives a goal, and a learning journey closes the goal. The organizationId on tenant-scoped collections is the key the whole isolation story rests on.

Three schema decisions matter more than the rest:

  • organizationId is stamped on every tenant-scoped document. It's the partition key for the whole platform. Having it there is the easy part. Enforcing it on every read is the hard part (see Operations).
  • The response log is what turns a prediction into a measurement. The LLM gives each item a difficulty prior. The logged responses (item, correctness, latency) are what the production re-fit reads to turn that prior into a population-calibrated irt_b. Without this table the engine would be stuck on the model's guess, which makes it a first-class asset and not an afterthought. It's worth designing carefully up front, because you will lean on it.
  • Questions carry an embedding for deduplication. This is the other half of the question-multiplier approach: generate a large pool, then use vector similarity to strip near-duplicates that would quietly skew item exposure and difficulty coverage.

The expected-versus-assessed proficiency pair (both on the 1 to 5 scale, on SKILL_PROFICIENCY) is what the back half of the product is built on. The gap between the two drives everything downstream: it feeds the AI gap summary, seeds the SMART goal, and scopes the learning journey.


Infrastructure and operations

LayerChoiceWhat it does in production
Reverse proxyNginxTLS termination, static assets, and fronting the web tier
Process managementPM2Keeps processes alive, restarts on crash, and runs the web tier in cluster mode (multiple replicas)
DatabaseMongoDBSource of state, including durable job records
Queue / brokerRedis + BullMQHolds the job queue; workers claim jobs atomically
CI/CDGitHub ActionsBuild, test, and deploy pipeline
PackagingDocker ComposeContainerized deployment option alongside the bare-Ubuntu path

Where load and failure bite

Externalizing the queue removed the main thing that used to block scaling. Long AI work (role extraction, question generation, summaries) no longer runs inside the web process. It gets enqueued to Redis and claimed by isolated workers, one job per worker. That gives two scaling axes that move independently: the stateless web tier can run as many PM2 replicas as traffic needs, and the worker pool can grow or shrink with AI load, without the double-execution problem that in-process jobs would have caused in cluster mode.

The new things to watch come with the queue, not from going without one. Redis is now a dependency that needs persistence and a high-availability plan, since it holds in-flight work. Jobs have to be idempotent, because at-least-once delivery means a retried job can run twice. And with both the web and worker tiers scaled out, MongoDB becomes the next likely bottleneck under contention.

Security posture

ConcernImplementationRead this as
Multi-tenancyorganizationId filter applied by hand in every queryForget one predicate and a tenant's data leaks. Isolation is currently a habit and not a guarantee
RBACThree roles (Admin / Manager / Employee), role-scoped dashboards and routes; managers gain role-description and question-bank powers only if permittedAuthority is least-privilege and graduated, not a single on/off switch
Onboarding controlOrg approval flow: new organizations sit pending until approvedA gate against spam and unvetted tenants before anyone can act inside a tenant
Assessment integrityCooldowns, forced retakes, question flagging, and support tickets, plus a generated, deduplicated bank that resists leakageThe anti-gaming and quality controls a measurement system depends on
AI safetyHuman-in-the-loop verify step before extracted skills count; model-seeded item difficulty corrected by production responsesThe model proposes and a person signs off; nothing AI-generated is load-bearing until a human approves it or real data has corrected it

The one to internalize is structural tenant isolation. Behavioral isolation, where everyone remembers to add the org filter to all 300-odd storage methods, works right up until the single query where someone forgets, and that one omission is a cross-tenant breach. The fix is to make isolation structural: a Mongoose plugin that injects the tenant predicate automatically, or a per-tenant query helper that no query can bypass. The principle is simple to state and hard to keep: don't rely on every engineer remembering the most important WHERE clause on every query for the life of the codebase.

The architectural debt ledger

MandateStateIf left unaddressedFixPriority
Tenant isolationorganizationId filtered by hand per queryOne missed predicate equals a cross-tenant breachMongoose plugin or per-tenant query helper, for structural isolation1: irreversible blast radius
IRT calibrationResolved. irt_b LLM-seeded, re-fit against production responses, divergent items flaggedn/aDone; now a standing monitor (new-item convergence, item drift)n/a
Queue externalizationResolved. BullMQ/Redis, isolated workersn/aDone; now harden idempotency and Redis HAn/a
routes.ts god-file~5,500 lines spanning skills/goals/learning/legacyBecomes the architecture by default; change-risk compoundsSplit by domain to match the modular routers; deprecate by a fixed date2: entropy
IStorage interface300+ methods on one surfaceHardens into something nobody dares touchSplit along the same domain seams as the routers2: entropy
Legacy /dashboard + dual surfacesA personal flow running parallel to enterpriseTwo products to maintain foreverSet a deprecation date; collapse to the enterprise surface3: focus

Outcome

What the system clearly is, based on the brief and nothing more:

  • A working end-to-end pipeline across the full skill-intelligence loop: define (AI extraction from PDF/Word/Excel role docs) → verify (adaptive IRT assessment) → diagnose (assessed-versus-expected gap analysis) → plan (AI SMART goals) → learn (AI learning journeys, with optional Coursera/Moodle hooks), all under a real multi-tenant org/team/role structure.
  • A 31-schema domain model that came through a real PostgreSQL to MongoDB migration. The 300-method IStorage interface is the scar tissue from that move. It did its job during the cutover and is now correctly flagged as indirection to retire. Abstractions that earn their keep during a migration rarely keep earning it afterward, and the trick is noticing when the scaffolding has quietly become the building.
  • A horizontally scalable runtime: the background queue is externalized to Redis/BullMQ with isolated workers, so the web and worker tiers scale independently and cluster mode no longer double-runs jobs.
  • An assessment engine that solves the item cold-start problem with LLM-seeded difficulty, uses that generation deliberately as a question multiplier against leakage, and then re-fits the difficulty against production responses on a continuous loop. The model's estimate is a starting prior that real data corrects, not a number trusted on faith. Calibrating against your own population rather than the model's guess is what lets a skill score stand behind a promotion.

Maturity, stated plainly: feature-complete, horizontally scalable, and running a live calibration loop, with model-seeded difficulty re-fit against the organization's own responses. The open items are structural and security hygiene (tenant isolation first, then the routes.ts and IStorage decomposition and legacy deprecation) rather than measurement integrity or scale.