I am currently serving on a committee in my department with a multi-part charge which includes, among other things, arriving at a better way of evaluating the quality of teaching in the department¹¹ I’ve written about a broader perspective of this work in post 741. Broadly speaking, three approaches have been proposed by those we have spoken with and in the publications on the topic which we have read.

Option 1 is subjective²² Not purely subjective; there are many objective things known about good and bad teaching; but not everything is known and human judgment is subjective even when looking for objective things., and hence given to error and bias. It is also presupposes access to people who know what good teaching is when they encounter it. It’s by far the most common approach that I’ve encountered in the institutions of higher learning of which I am aware, and as such has many variations of implementation.

Option 2 is at least theoretically objective, but relies on access to a trusted and accurate list of components of good teaching. It also presupposes that adding more good components leads to better teaching and that good components are always good, regardless of the teacher and context. Although no one has yet provided counter-evidence to those suppositions, counter-instinct is widespread and there seems to be little stomach for option 2 among our faculty, nor am I familiar with another institution that uses it.

Option 3 is at least theoretically objective, and indeed a reasonable definition of what we mean by teaching quality, but it begs the question how do you know how much the students learned? Throughout education, student learning is measured by (a) asking the students to do something, (b) evaluating how well they do it, and (c) inferring from that how much they’ve learned. But that process has problems, and not just the did they really do it? problem we call academic dishonesty³³ I’ve posted about academic dishonesty before in posts 578 and 580..

Filter or teach

I’ve seen instructors have students emerge from their class with strong understanding of course material that appeared to come from something other than quality teaching. They may have been quality teachers as well, but that was not sufficient to explain their students’ success.

The first of these that I noticed was an instructor who actively encouraged a reputation of their elective course as the hardest course there was, one that only the very best students should even consider taking, and encouraged those students who did enroll to drop the course while they still had the chance. Although student self-perception of ability to handle such a course is not perfectly aligned with their actual academic ability, they are correlated and it is not surprising that students who completed that course had noticeably more ability than other students. Even mediocre instruction aimed at the most skilled students will yield a great deal of learning.

At first I thought that kind of filter for some type of previously-acquired ability was restricted to elective, not required, courses – but then I became aware of a required course whose reputation for difficulty led students to start studying course material from online videos and private tutors and form study groups for the course before the school term began; and another course that the student tradition said you took twice, once up to the drop deadline and then again the next term through to the end. Even mediocre instruction coupled with significantly more student time will yield a great deal of learning.

These all seem rather extreme, but there’s another case that I’ve found is commonplace: a course that students understand will require much more of their time than other courses. If my students invest 6 credit-hours of work studying for my 3-credit-hour course, how could I fail to achieve higher outcomes than other courses do?

These various course designs have two features in common. First, they achieve higher student outcomes though a mechanism unrelated to teaching quality. Second, that mechanism is based in reputation among students which grows slowly in the student rumor mill, creating the illusion of quality increasing from one semester to the next.

Raised bar

Most students do not work on learning until they have learned all they can; rather, they work on learning until the perceived yield of additional work is lower than the perceived urgency and importance of other demands on their time, such as other classes, employment, personal hygiene, a social life, and so on.

If I change nothing about my instruction but increase the grade threshold needed to do something the students want: land a job, continue in the degree program, keep a scholarship or the like; then student performance (or at least the part of it that grades measure) will improve. The quality of teaching has not gone up; the relative priority of the course has. The value of the course has gone down, offering the same quality of education but less access to the jobs, degrees, scholarships, or the like.

Raised bars are often also accompanied by a filtering effect, further exacerbating their impact.

Overfitting

When I was a child, standardized tests were being rolled out to many government-owned schools in the state were I lived. A goal of these tests was to measure all students learning equally so that the quality of teachers could be compared. But it soon became evident that what standardized tests actually encouraged was teaching only the subset of content that was going to be tested. In some cases this meant actually teaching about the test itself, rather than the topic the test was measuring, drilling tricks like the correct answer will on the reading comprehension question never quotes the text that had nothing to do with the topic being measured.

I have never yet encountered⁴⁴ I do have colleagues who assert that such assessments exist, but I’ve not yet looked into their claims in detail. an assessment that measures all of the learning a university course ought to offer. Perspectives and attitude are rarely measured, but even ability and understanding are hard to measure when the topic at hand is more involved than arithmetic⁵⁵ I’ve written about the outcomes of knowledge, understanding, ability, attitude, and perspective before in post 741. If I optimize my instruction to maximize assessed ability, I will over-fit to the specific parts of the course outcomes that I can assess and generally sacrifice time on the other outcomes to make that happen.

Assessable subset

One students I worked with that I thought (and still think) most highly of got middling to low grades in the courses she took with me. She was a skilled teaching assistant, a dedicated and successful researcher, a good communicator, a quick learner, creative and determined. She struck me as the student most likely to make a positive impact on the world, and so far she’s still on that trajectory. But she did not ace my classes.

Why not?

Because there are many things I value that I have no idea how to assess. Creativity. The ability to continue in tasks past the first sign of trouble. Prioritization that realizes that not every course topic matters to every person. Integration of ideas across multiple fields. None of these showed up in the assessments I had in my several-hundred-student computer hardware and systems programming course. I don’t even know how I could have assessed them.

I had several excellent teachers that helped me develop my curiosity, my sense of the connectedness of diverse topics, my desire to go forth and do something more than cram for a test. They helped me a great deal, they taught me a great deal, but they didn’t all teach me the assessed course material in noteworthily-good ways. There are qualities in a teacher which I want to select for, want to encourage and develop in the teachers in my department, which are not qualities that I expect will ever show up in student learning assessments.

Moving baseline

When I first taught Introduction to Programming, most students had only a vague idea what programming was and no prior experience with anything like it. When I last taught Introduction to Programming, almost all had a reasonable concept of what programming was and most had written multiple small programs before coming to class. If the programming ability of students exiting my class was taken as an indication of my teaching ability, I was in luck! I had less to teach them every year.

When I first taught Computer Graphics, students came in excited to learn how to get a computer to make pictures for them. As game engines and rendering software become more common, I had to work harder to convince the students that it was worth learning how those tools worked and not just how to run them. As AI-drive techniques like stable diffusion became popular I started having to convince students that the basic model-to-image pipeline that computer graphics studies was even relevant. If the graphics programming ability of students exiting my class was taken as an indication of my teaching ability, I was doomed! They were showing up less interested and less engaged every year.

Deferred problem

If I control the assessment but not the instruction, I can get any grade distribution I want. I can focus on the easiest pieces, give generous partial credit, allow many tries, support guess-and-check, or match the assessment very closely to the examples in the instruction, and get very high grades. I can focus on the hardest pieces, give all-or-nothing credit, have tight time limits, have questions that require a chain of correct steps to solve, or make assessments unlike any example in the instruction, and get very low grades.

If I control the instruction but not the assessment, I can also get most grade distributions by deciding how much I want to focus narrowly on the answers to the assessment prompts or ignore the kinds of tasks that the assessments ask students to do and discuss only the broader topic and equally-relevant but not-assessed tasks.

If I control both instruction and assessment, as most university instructors in my field do, I can get any grade distribution I wish. It’s entirely within my control.

Thus, using an assessment of student learning as evidence of instructor quality generally relies on some other, challenging calibration. I’ve seen the following calibrations proposed:

Evaluate the alignment between the assessments and the learning outcomes and the coverage of the learning outcomes by the assessments.
Have someone else evaluate student quality subjectively, such as teachers of follow-on courses commenting on how prepared students leaving a course are.
Have someone else evaluate student quality objectively, such as by creating a fair standardized test or by tracking grades in a follow-on course.

These are all promising ideas insofar as they go. But they’ve deferred one hard problem to another also-hard problem; they’re not solutions for measuring the quality of education until we have a good, fair, reliable way of doing this other calibrating evaluation.

Showing good teaching

I have colleagues and friends who will need to demonstrate that their teaching is good in order to be promoted in, or in some cases even retain, their job. Several of them feel that the primary evidence of their good teaching is the large number of students who learn their course material well. How should they, or anyone else in a similar situation, make that case?

Different people may have different opinions, but here are mine.

First, I worked on the course and the scores went up is not compelling to me. It is better if it is accompanied by evidence that the assessments are stable and have good coverage of learning outcomes, but even then I’d be hesitant to accept this as evidence of good teaching. The change in scores could also be caused by a moving baseline, or overfitting, or a raised bar, or more filtering. Even if it is caused by the changed instruction, it might have worked from blind luck instead of reasoned improvement and thus not be evidence that the instructor will continue to have good results going forward.

More compelling to me is evidence of a reasoned process of improvement. I’d be quite happy with the following outline:

Objective (but possibly over-fit) assessments and subjective (but harder to bypass) observations agreed on a specific flaw in student understanding or difficulty in student learning.
A change to the course was designed that was expected to fix that. That expectation is based on these reasons, but also had these risks.
After the change was implemented, the assessment and observation agreed that the specific problem was improved, the risks didn’t materialize in harm, and learning in the course overall was not hurt. Or, if it didn’t work that well, new insight was gained and a new change planned.

That story suggests that the teacher not only successfully improved student learning, but that they did so intentionally and can be expected to do so again; and hence that that teacher is one to retain and promote as a likely future asset to the school and its students. It doesn’t technically say that they are good teachers now⁶⁶ I have tried to compare teachers of the same course by their students’ outcomes and been frustrated by how messy that was to interpret, but that’s a topic for another post., but it does suggest that if they are not good now they will get better going forward.