Today the Supreme Court of the United States (SCOTUS) ruled¹¹ https://www.supremecourt.gov/opinions/22pdf/20-1199_hgdj.pdf 6-to-3 that two instances of using race as a factor in admission decisions to universities are in violation of the constitution.

As with most rulings I’ve read from the court, both those I agree with and those I don’t, this decision hinges on a fairly deep layer of interpretations stacked on the constitution. It cites vision statements from historical proponents of amendments and explanatory language accompanying past court decisions; it extrapolates from statements about law to requirements for private policies; and so on. I’m both used to and confused by these structures, so I’m not going to comment on the decision itself. Instead, I want to comment on the concept of affirmative action itself.

Three tiers of affirmative action

Affirmative action refers to using membership in a protected class as part of a decision process that is intended to benefit members of that class. Protected classes are generally groups that have had systemic or official biases against them, commonly defined by self-reported race, ethnicity, gender, disability, or religion.

I have seen three broad categories of arguments for affirmative action:

Past discrimination moved resources from the protected class to others. Resources are inherited, so class-wide imbalances persist unless resolved. Affirmative action helps reverse this flow and rebalance inherited resources.
Some disadvantages are inherited, but can be interrupted. For example, parents’ education predicts children’s success in college and colleges try to predict success as part of admissions. Sending a generation to college can interrupt multi-generation reduced educational opportunities.
Some measures are biased against members of protected classes. Considering protected class membership can help remove those biases and make those measures statistically more accurate.

In my experience, relatively few people believe in argument 1 enough to act on it. Some people believe in argument 2, but evidence of interruption working is a bit fuzzy: one time interventions don’t seem to work as well as we might wish. People I’ve met who understand argument 3 generally agree with it, though protected class membership might not be the best bias-correcting factor in general, but most people I’ve talked to aren’t even aware of it. Let’s explore argument 3 a bit more.

Hiring good orators

Suppose you want to hire a team of orators to make speeches on your behalf. You want the best speakers you can get, which means you need a measure of speaker potential. You decide to host speech competitions around the country; but because you want speakers that ordinary people like, not ones that trained judges appreciate, you decide to measure the volume and duration of applause after each speech and use that as a measure of speaker potential. You send out people to record applause and have them correct for audience size by normalizing all the applause measures across a single competition and then report back a set of speaker names paired with normalized applause intensity.

But when you try to use this number to hire orators, you’re not getting very consistent quality. For some reason, you find yourself hiring a lot of speakers whose names are early in the alphabet but who aren’t that good at speaking. Consulting with a few of your competition recorders, they observe that while applause is correlated with speech quality, it’s also correlated with audience energy and that tends to drop after the first few speeches in each competition. Most competitions have speakers go in alphabetical order, so people with alphabetically-early names are generally getting more applause than equally-good speakers with alphabetically-late names.

Fortunately, you know how to solve this. You perform a little statistical regression²² See post 708 for more on regression. to find out how much of applause is predicted by name and then undo that. For example, if your regression found that $\text{applause} = f(\text{name}) + \text{potential} + \text{noise}$ where $f$ is the function you got through regression, you’d make a new measure $\text{corrected applause} = \text{applause} - f(\text{name})$ and use that instead.

Assuming you didn’t make a mistake in the regression, this corrected measure will result in you recruiting orators with higher and more consistent average quality. On average, it will remove a name-based bias in your measure.

But that’s only on average. Somewhere there’s the lucky person whose name starts with W but was still the first to speak and the unlucky person whose name starts with A but was still the last to speak. Those people are getting the combined impact of both the biased original applause metric and the name-based counter-bias.

So, is the corrected measure more fair than the uncorrected measure or not? Is doing something that’s more fair for most people but less fair for some people a net win? This question is, ultimately, a matter of opinion, not of objective truth.

This conundrum is caused by our inability to measure what we really want to know. We want to correct applause intensity for audience energy, but we don’t have a reliable measure for audience energy. We think order of presentation in the competition is a good approximation for audience energy, but while that could in principle be measured we didn’t measure it either. So we’re using the measure we have that comes as close to the one we want as we have.

Admitting students

College admissions is trying to measure many things about prospective students: alignment with the college’s mission, likelihood the student will accept the offered admission, student’s impact on college finances, and many other factors as well. But among those factors is likelihood that the student will do well in their college studies.

Likelihood to do well is estimated by the combined influence of many factors, too many to list here. One that many schools consider is scores on standardized tests, meaning tests produced by companies who are trying to build tests that measure students’ future success in college.

Unfortunately, those tests don’t only measure that. They also measure how much you’ve prepared for the test itself (as opposed to for college). And they also measure how comfortable you feel when taking tests because stress and anxiety will tend to distract test takers and reduce their ability to do their best.

To make the standardized tests a more accurate measure of how well students are likely to perform in college we’d want to measure and correct for test-specific preparation and test anxiety. But those aren’t things we can measure directly, so to do the best job of admissions that we can we should look for data that we do have that is as correlated with those factors as we can find.

Among the data we college admissions always has access to is student’s self-reported race. The US department of education requires colleges report this to them, so every college asks it on their admissions forms so they have it to report. And it turns out that in the US, race is correlated with home cultures which are in turn correlated with test prep and test anxiety. On average, a student who identifies as Asian American is more likely to come from a home that prioritizes test prep and describes tests as a path to future opportunity while a student who identifies as African American is more likely to come from a home does not prioritize test prep and describes tests as an capricious source of stress and judgment. These are just trends; many individuals do not fit the mold; but they are trends, which means that considering race can help correct for biases in the standardized tests.

But this only works on average. Because it’s not the factors we really meant to measure, just an available measure that is correlated with them, there will be some students it over-helps and others it over-hurts.

Aside: social side-effects

All forms of affirmative action, wether they be correcting past injustices, correcting for present biases, or even in the public consciousness but not actually used at all, have a negative side-effect on the common perception of the qualifications and ability of members of protected classes. I have heard people talk about a diversity hire or dismiss people as only let in because they’re an X in systems where I had enough visibility into the selection criteria to know there was nothing like affirmative action going on, not even the statistical bias correction form. Public opinion is hard to control.

What gets banned

The recent SCOTUS ruling bans two examples of using race in college admissions, and by extension is seen to ban most similar race-aware admissions processes.

Notably, that is all it does. It does not ban using other factors that are correlated with race. It does not ban the general practice of attempting to correct the biases in one metric using another metric. It does not ban asymmetric outcomes in admissions decisions³³ I was worried it might try; worried because I doubt very much they could have defined asymmetric outcomes in a universally-sensible way; I’ve never seen that done before.. As is often (though not always) the case in court rulings, it bans only the activities present in the specific case in question. Surprisingly, I’ve heard it reported (though not yet found in the text myself) that the ruling also also weakens its own bans, suggesting that there may be some institutions which can consider race in admissions decisions.

This is far from the first such ban. Various state legislatures, state courts, and boards of universities have implemented similar bans in the past. Although I’ve not verified this by looking at the data myself, I understand the common pattern to be a more-or-less immediate drop in the number of students admitted from protected classes and a corresponding drop in the actual quality of students in the school, followed by a slow return to the pre-ban rates as the people who work in admissions find other signals they can use instead of those they’ve been prohibited from considering.

Comments

Wm

I am curious about the main point you make in section 3 where you suggested that standardized tests would tend to underestimate the scholastic aptitude of demographics that are more likely to feel anxious about tests and less likely to spend a lot of time preparing for them.

I agree that this would likely be true if the goal of the test were to measure job aptitude or general intelligence or anything else that doesn’t typically involve taking a lot of tests, but what the standardized tests are trying to measure is future success in college – and success in college, particularly at the undergraduate level, is largely a matter of doing well on tests. If someone gets a low SAT score because, despite high intelligence, he doesn’t prepare much for tests and experiences a lot of test-related anxiety, I would assume that such a person would also tend to do poorly on exams in college for the same reasons. If the SAT predicts that such people are less likely to succeed in college, that seems like a useful feature, not a bug to be corrected for.

Going back to your orator example, real-world oratory rarely involves being one in a series of speakers who speak in alphabetical order, which is why you want to correct for the effect of names. If you wanted to hire orators for a job that did involve speaking at events where people speak in alphabetical order, then you wouldn’t want to correct for names at all. People named Zykowski really would tend to be less effective orators than people named Abrams in this scenario, and you would want to preferentially hire the latter.

Of course, it still wouldn’t be fair that someone should have a disadvantage just because he happens to be named Zykowski, so it’s important to be clear on what the goal is: to choose the people who are (for any reason) most likely to succeed, or to help people who are (for unfair reasons) less likely to succeed. If the former, I don’t find your argument for affirmative action in college admissions to be convincing.

Luther

Thanks for pointing this out! Having spent significant effort reducing the importance of tests in the courses I teach⁴⁴ For many reasons; I might post more about this later. and being part of a field that typically puts most grade weight on projects, not tests, I sometimes forget that there are fields in which and instructors for whom test anxiety is likely to dramatically impact student performance.

That said, the general principle holds. Even in a test-heavy college career, having spent a hundred hours taking test prep courses will still inflate test scores in a way that is unlikely to be good for college⁵⁵ Assuming college is busy enough the student can’t continue to pay for hundreds of hours of test prep for every test they take., and there are many other admissions factors that also suffer from some form of inflation and also benefit from correction. A few examples:

Winning prizes at science fairs measures funding as well as science potential.
Well-written application essays measures access to editors as well as writing ability, and access to essay content mentors as well as fit for the college.
Self-reported achievements measures cultural expectations for boasting vs modesty as well as actual achievement.
High school GPA measures the grading scales used by the school as well as student performance. In many high schools in the US, courses you have to pay extra for use a 5-point scale where free courses use a 4-point scale, so it also often measures finances.

And so on. As you point out, some colleges might value some of these also-measured characteristics and want to select for them, but often they don’t value them and want to control for them instead.