How We Use ELO Scores to Build Better Legal AI – Harvey
Turning human preference into measurable, meaningful improvements across the Harvey platform.
In BigLaw Bench: Arena, we described our system for scaling human preference over results. The core artifacts produced by that system are ELO scores for Harvey systems and non-Harvey baselines. These scores help us understand how likely the output of one system is to be preferred to another when both outputs are presented to lawyers.
To make this concrete, the Assistant ELO scores from BLB: Arena suggest that the responses of Harvey Assistant are preferred more than 70% of the time to those of generic foundation models.
Importantly, we use ELO not just to celebrate the results of applied AI work, but also to help shape that work. In the rest of this post, we explain how we use ELO generally to understand and improve AI systems through specific, recent examples from across Harvey’s products.
To read the article in full, click here.



