Building Reliable Evals to Solve Hiring

In this blog

Building evals for the subjective What's next

Hiring is hard.

The industry struggles not because great candidates don’t exist, but because the traditional process fails to surface the right people as roles and expectations constantly change.

Paraform’s recruiting marketplace centralizes hiring activity and outcomes, giving us far more context than traditional recruiting. But context alone isn’t enough. Without strong evaluation signals, we can’t tell whether the systems we build are improving candidate utilization or quietly introducing noise.

Throwing LLMs and traditional recommendation systems at the problem isn’t sufficient:

Hiring decisions are subjective and noisy. There’s no fixed decision rule that leads to consistent outcomes.
Hiring data for both roles and candidates is sparse and mostly unstructured.
Hiring manager requirements and preferences constantly change over time.

Our core challenge isn’t building more sophisticated models. It’s knowing whether any system we build is improving hiring outcomes or quietly making them worse, and being able to evaluate that reliably.

Building evals for the subjective

Instead of treating hiring as a single prediction problem, we’ve shifted toward breaking it into smaller, more tractable subproblems. Each subproblem has its own evaluation framework grounded in historical hiring data, allowing us to benchmark progress without pretending there’s a single objective truth. The rest of this post focuses on two of those subproblems.

The systems we've built around these ideas are already driving recruiter efficiency and revenue. But more importantly, they give us a way to reason about whether we’re actually improving hiring.

1. Learning from past rejections

At a baseline, an intelligent hiring platform needs to learn from previous rejections. If a hiring manager has consistently said no to a certain type of profile for a specific reason, we shouldn’t resurface similar profiles.

The challenge is that rejection reasons rarely map to rigid rules. A hiring manager might reject a candidate for not attending a top school, but nonetheless later interview someone without a CS degree. In hiring, what looks like a contradiction always comes down to context.

To handle this, we built a context engine that models hiring manager decisions as nuanced signals rather than hard filters. It accumulates prior feedback and learns as new data arrives. Interview feedback, candidate interactions, and ratings all get captured, no longer falling into the void.

Because the system depends on role-specific historical context, standard train/test splits don’t reflect how it’s actually used. Instead, we evaluate it using role-conditioned masking over historical hiring data.

For each role, we repeatedly mask a single candidate and predict their outcome using the remaining candidates’ profiles to seed the context engine. This produces realistic, out-of-sample predictions while maximizing the number of evaluation points available.

The primary objective here is to confidently exclude candidates who aren’t a fit based on accumulated historical hiring manager decisions. As a result, we optimize for high negative precision while ensuring negative recall remains within acceptable bounds, so exclusions are both accurate and meaningful.

We ran ablations comparing a baseline evaluator against one powered by our context engine. The context engine improves negative precision without compromising recall, which lets us confidently reduce repetitive rejections.

Context Turns Noisy Rejections Into Reliable Signals

2. Surfacing the best

The previous section discusses how we approach “blocking the bad”. This section covers how we approach discovery, or “surfacing the best”.

Standard approaches like embedding similarity largely optimize for semantic relevance, but that isn’t a good proxy for a candidate’s fit for a role. For example, two candidates may both match a backend SWE role semantically, but only one aligns with the team’s seniority bar, domain experience, or the hiring manager’s past preferences.

Using an LLM to evaluate candidates against a role is an improvement, but we still found this to be noisy. Without the context of how the candidate pool is distributed, models struggle with calibration. To improve on this, we needed an evaluation framework that’s more robust than eyeballing the top results.

We built a backtesting pipeline that replays historical hiring scenarios and measures how well our rankers performed. We measure performance using Precision@K and nDCG. Precision@K ensures that the candidates surfaced at the top are accurate, while nDCG captures whether the best candidates are ranked high enough to actually be seen.

While this isn’t a perfect proxy for recommending entirely new candidates, it mirrors real hiring behavior closely enough to be useful, especially when aggregated across many roles.

We compared off-the shelf embedding models, a basic LLM scoring approach, and a thoroughly optimized agentic LLM ranker across hundreds of roles. We saw a 19.2% jump in P@5 between an LLM evaluator and embedding similarity. Further optimizations to our LLM ranker, mainly around shifting to a more agentic architecture that helps the model better contextualize candidates, improved P@5 an additional 7.2%.

Agentic LLM Rankers Use Context to Improve Candidate Ordering

What's next

Everything above is just a starting point - the evals we’ve built help us move forward with more confidence, but there’s more to do before we solve hiring at scale.

~

The ideas, systems, and evaluations in this post were developed by Naveen Govindaraju, Harry Gao, and Karan Parikh. We’re looking to expand the team - if you’re excited about ambitious problems at the intersection of product, engineering, and data, we’re hiring.