Algorithms were supposed to make Virginia judges fairer. What happened was far more complicated.

The Accomack County Courthouse in February of this year. (Timothy C. Wright for the Washington Post)

We tend to assume the near-term future of automation will be built on man-machine partnerships. Our robot sidekicks will compensate for the squishy inefficiencies of the human brain, while human judgment will sand down their cold, mechanical edges.

But what if the partnership, especially in the early stages, instead accentuates the flaws of both? For example, a formula designed to reduce prison populations in Virginia led some judges to impose harsher sentences for young or black defendants, and more lenient ones for rapists.

In an age when artificial intelligence is widely expected to eat what’s left of the world, simple sentencing algorithms are a preview of the economy’s tool-assisted future. A working paper released Tuesday by Megan Stevenson of George Mason University and Jennifer Doleac of Texas A&M provides one of the first examinations of the unintended consequences that arise when algorithms and humans team up in the wild.

The algorithms are intended to remove some of the guesswork from judges’ sentencing decisions by assigning a simple risk score to defendants. In Virginia, the score included data such as offense type, age, prior convictions and employment status. Larceny scores higher than drug offenses, men score higher than women, and unmarried folks score higher than their married peers. (Marriage and employment were removed from the score in 2013.)

Similar algorithms are now used in 28 states (and parts of seven more). In Virginia, they were adopted statewide in 2002 to help keep prison populations down after discretionary parole was abolished. Stevenson and Doleac’s analysis relies on tens of thousands of felony convictions, with a particular focus on the period between 2000 and 2004.

Judges were supposed to use risk scores to identify felons who were least likely to reoffend and either give them shorter jail sentences or send them to a program such as probation or substance-abuse treatment. Rather than focus on patterns of discrimination by algorithms or judges acting alone, as others have done, Stevenson and Doleac measured how the two interacted.

After controlling for the effects of time, geography and demography, Stevenson and Doleac found the ambitious new nonviolent risk assessment system, often held up as a model for other states to follow, did not change the rate at which people were incarcerated, the length of their sentences or the rate at which they reoffended after release. But that doesn’t mean it had no effect.

Judges followed the algorithm’s suggestions a bit less than half of the time. People who the algorithm deemed high-risk received longer sentences than they would have, and candidates assessed as low-risk got shorter ones. The two adjustments offset each other, so the overall numbers didn’t change, but the interaction between algorithmic sentencing recommendations and judges’ discretion nonetheless produced perverse effects.

In a statement, Meredith Farrar-Owens, director of the Virginia Criminal Sentencing Commission, said the system’s goal was to avoid increasing crime or recidivism rates while diverting the lowest-risk nonviolent offenders from prison and freeing up space for violent offenders, who were expected to serve longer terms under the sentencing reforms that took effect in 1995.

“While the Commission’s risk assessment instruments were developed based on empirical study of recidivism rates and patterns among Virginia felons, use of risk assessment takes place within the context of a complex and dynamic criminal justice system,” Farrar-Owens said. “Risk assessment recommendations are advisory and only one of many factors judges will consider when sentencing defendants.”

She added that judges’ ability to follow the algorithm’s suggestions may also have been limited by plea agreements, and by a lack of alternative options in their particular circuit.

The sentencing of young offenders is one example of the algorithm’s surprising side effects. Stevenson said some judges may not realize it, but someone’s risk score is largely a reflection of how old they are. That’s because age is such a strong predictor of recidivism. You get substantially more added to your risk score for being younger than 30 (13 points) than, for example, having been incarcerated five or more previous times as an adult (9 points).

In a 2018 analysis, Stevenson and Vanderbilt’s Christopher Slobogin calculated that 58 percent of the widely used COMPAS algorithm’s violent recidivism risk score can be attributed to age.

“People are getting this very stigmatic label — high risk for violent recidivism — largely because they’re 19 years old,” she said.

Judges tended to be more merciful toward young defendants than the algorithm recommended. Nonetheless, defendants younger than 23 were 4 percentage points more likely to be incarcerated after risk assessment was adopted, and their sentences were 12 percent longer than their older peers.

“Based on the Commission’s recidivism study, age is one of the most heavily weighted factors on the risk assessment tool. It makes sense that, once judges had an objective, research-based risk assessment tool, we would see sentences of young adult offenders increase relative to older offenders,” Farrar-Owens said.

Racial disparities also increased among those circuits that used risk assessment most. Although computers can’t explicitly use prohibited variables like race in their sentencing calculations, black defendants were 4 percentage points more likely to be incarcerated after risk assessment was adopted, compared with otherwise equivalent whites. Black defendants’ sentences were also 17 percent longer.

“This is partially explained by the fact that black defendants have higher risk scores, and partially because black defendants are sentenced more harshly than white defendants with the same risk score,” Stevenson and Doleac write.

The authors also studied a similar risk-assessment program for sex offenders. Out of an abundance of caution, the program was built so the algorithm could only authorize longer-than-baseline sentences — yet its net effect was to decrease how often sex offenders were imprisoned by 5 percentage points, and to shorten their sentences by about 24 percent.

Stevenson and Doleac suggest that by assigning a sex offender a low risk score, algorithms may help protect a judge from backlash if the offender goes on to commit another crime. This empowers them to offer shorter sentences than they otherwise would have meted out.

“If they sentence someone leniently and that person goes out and commits a heinous crime, all fingers are pointed at them,” Stevenson said. “If they make a mistake in the other direction — failing to release someone who would have done anything if released — nobody sees that. There are no consequences to the judge.”

The University of Maryland’s Frank Pasquale, who made the case for oversight of algorithms in 2015’s “The Black Box Society,” said most judges aren’t adequately trained to evaluate the claims made by sentencing systems such as the one in Virginia. Because judges have the option to reject the algorithm’s conclusion, Pasquale said, they may follow it only when it provides a convenient excuse.

“You can have a lot of scenarios where the AI algorithms end up being a rationalization for what the judge wants to do,” Pasquale said.

Stevenson and Doleac write that, especially when the goals of the algorithm’s human partners differ from those of its designers, we should expect unexpected results.

“Virginia’s nonviolent risk assessment reduced neither incarceration nor recidivism; its use disadvantaged a vulnerable group (the young); and failed to reduce racial disparities,” Stevenson and Doleac write. “Virginia’s sex offender risk assessment lowered sentences for those convicted of rape: a group that the Sentencing Commission had targeted for increased sentences.”

Why did the Virginia algorithm struggle? It turns out future crime is hard to predict. “Even under ideal conditions,” Stevenson and Doleac write, “predictions of future offending are unable to explain more than a tiny fraction of the variation in recidivism.”

Duke University professor Brandon Garrett, who has separately studied the Virginia system, said: “You can’t just adopt a tool and expect it to magically solve problems. You have to put real thought and information into implementation.”

Garrett and his collaborators spoke with judges from across Virginia. Some said they weren’t trained or didn’t trust the formula. Some said they didn’t even have the programs necessary to divert people from prisons. One judge even dryly equated using the algorithm to visiting a psychic.

In an analysis forthcoming in the California Law Review, Garrett and collaborator John Monahan of the University of Virginia write that judges didn’t follow risk assessments consistently — and that some didn’t use them at all.

“Nor should that be a surprise,” they write, “given that judges and other decisionmakers typically receive almost no training in risk assessment, and their discretion to ignore risk assessment is virtually unchecked.”

Farrar-Owens said that newly appointed judges are taught about the risk-assessment program during orientation but that “judicial philosophy in regards to risk assessment certainly varies across the Commonwealth.”

Aurélie Ouss, who researches criminal sentencing, recidivism and related subjects at the University of Pennsylvania, praised Stevenson and Doleac’s work, and said it showed Virginia’s algorithms had neither been the panacea some hoped for nor the nightmare others feared. Like Garrett, she said it might come down to implementation.

“There are so many ways in which you can present the information,” Ouss pointed out. “It may be a case that a different tool that’s designed differently — that judges use differently — would yield different results.”

Yet in the case of both race and age, Stevenson and Doleac find that if judges had fully complied with the algorithm’s recommendations, the disparities would have been even greater.

“If you’re sentencing purely on a risk rationale, then you want to lock up all the teenagers,” Stevenson said. “But if you’re sentencing on other rationale like mercy for the vulnerable or evaluating the level of culpability that individuals have, then people are young. There are a lot of reasons why we don’t necessarily want to punish them more harshly than older people.”