I’ve been spending some time at work trying to grapple with how our department should measure and improve teaching effectiveness. I find that I have more ideas come up in this process than I have the ability to express within the context of our department, so I decided to try venting some of them in this probably-too-long blog post instead.

Outline of this post

Teaching effectiveness is the degree to which teachers assist students in gaining the attributes the course is trying to provide. Attributes a course is expected to produce or reinforce generally include some mix of:

knowledge
understanding
ability
attitude
perspective

Student attainment of these attributes depends on many factors, including both student attributes provided in previous classes, attributes acquired outside of the university, and intrinsic attributes. Effective teachers

help students who arrive with a diversity of attributes
help students attain desirable attributes
do not create or reinforce undesirable attributes

Teaching effectiveness is difficult to measure for many reasons, including

a course’s target attributes are generally not enumerated
multiple attributes can contribute to most actions
measuring students attributes is challenging

Current practice in many institutions is to use one or more of five measures that are not measuring the attainment of student attributes but are used as proxies for teaching effectiveness anyway. These are, in decreasing order of frequency of use:

popularity votes by students
student opinions of specific aspects of the course
student open-ended responses
peer observation of small portions of instruction
instructor provision of a teaching portfolio

Some possible alternatives that might be more related to student attribute acquisition, but also have some challenges, include

standardized tests
correlation of performance in subsequent courses
satisfaction of postreq instructors and employers
expert focus group evaluation
checklist of teaching practices
compliance with an improvement process

Desired student attributes

Courses are generally expected to produce knowledge, understanding, ability, attitude, and perspective.

Knowledge is ingrained instantly-available facts. If I say 3 + 7 and you know the answer is 10 without any additional work, that’s knowledge. If I say 3 + 7 and you have to count it up or otherwise compute it, that’s not yet knowledge. If you think it’s probably 10 but want to double-check to be sure, that’s partially-acquired knowledge.

Understanding is a mental model of something with sufficient complexity to use to answer novel questions about it. If you understand place-value arithmetic I can change the order or base of the places and you can figure out how to apply existing methods in that new context. Partially-acquired understanding can handle some forms of novelty but not others.

Ability is the knowledge of recipes and sequences of steps, the understanding of processes, and the habit or muscle memory needed to execute them well. Note that ability needn’t be coupled with a knowledge or understanding of why it works: I can unlock the door to my house without any deeper understanding of keys and locks and can understand how jewellery is made without the having the dexterity or skill needed to make it.

Attitude informs opinions and decisions. Through rarely explicit, most courses I’ve taught have attitudes they wish to produce: attitudes of confidence like coding’s not scary or security’s more complicated than it sounds; attitudes of priority like accessibility matters or desire elegant code; and more complicated attitudes that balance multiple conflicting objectives.

Perspective is the basis for wisdom and instinct that arises from having considered many different but related examples. Understanding allows novel experiences to be fit into a model and reasoned about; perspective allows novel experiences to be related to several similar past experiences and reasonable conclusions drawn without spending effort in reasoning. Perspective can also help identify flaws in understanding and inform the development of new or improved models. Like attitudes, perspectives are rarely identified as learning objectives of courses but most courses I’ve taught have perspective outcomes they aim to produce.

Prerequisite attributes

I believe that any desired attribute may be attained by any learner, but not necessarily through any teaching.

Most instruction depends on sight, hearing, and linguistic proficiency.

Most instruction explains new ideas by analogy with old ideas and only works for those who are familiar with those old ideas, which may be taught in earlier coursework or may be assumed to have been encountered externally and thus depend on the learner’s background.

Instruction that introduces several ideas concurrently or in close succession requires a high cognitive load to follow and will be less effective for students who feel unsafe, unwelcome, ill, tired, anxious, or uninterested because those all reduce the availability of or compete for cognitive resources.

Poorly organized instruction requires students to guess where to apply their focus and is more effective for students similar enough to the instructor to guess where the instructor thought they should focus.

Courses that provide only negative feedback work best for students with high self-esteem. Courses that provide little or late feedback tend to compound early misunderstandings and work best for students who were luck enough to have their misunderstandings happen later in the course instead.

Courses with only one path through the material work best for students whose attitude and preferences align with the instructor’s and whose time availability is compatible with course timing.

Courses with loose structure and few deadlines work best for students with strong time management skills.

Creating undesirable attributes

Many courses and instructors accidentally foster in students attributes that are likely to impede learning. These are too many to list in anything like a representative sampling, but a few scattered examples might give some idea of the scope of these problems.

Instructors who are too proud to admit they don’t know can confuse students and both turn trust in the instructor into incorrect beliefs and create anxiety, uncertainty, and distrust.

Strong positive reactions to one student’s comment can make other students who did not get they kind of reaction feel unwelcome and anxious.

Using fixed-mindset language by referring to what students are (e.g. smart) instead of what they have done (e.g. did the reading) can create an attitude that students’ current attributes are their permanent descriptors and undermine their confidence and interest in learning.

Asking questions with right and wrong answers for verbal responses in class either conveys the idea that the instructor thinks students are stupid for not knowing the right answer (if some questions are ones the student cannot answer) or the idea that the instructor thinks the class doesn’t know even basic things (if all questions are ones the student can answer). Both undermine trust and willingness to engage in the course.

And so on. Most instruction has some undesirable side-effects; effective teachers strive to minimize them while producing more good effects to counter them.

Difficult to measure

Various measures of teaching effectiveness are proposed that roughly approximate measure how many of the course attributes the students gain. While a reasonable idea, and one worth attempting, these tend to face various challenged.

A course’s target attributes are generally not enumerated. Most courses are defined by a short topic list with no defined attributes included. Some have defined learning goals, which are usually restricted to a subset of knowledge, understanding, and ability attributes and are generally specified at a level that leaves much room for individual interpretation. In my experience most instructors can’t articulate their attitude and perspective goals even if they are making significant course design decisions based on them.

Multiple attributes can contribute to most actions. A homework problem may yield to first-principles understanding or knowledge of and ability to apply the right template or perseverance to try dozens of answers until stumbling upon the right one or convincing someone else to give you the answer, either in whole or by walking you through it. Similarly, a single piece of instruction might help one student build an understanding of the system being discussed while another student builds ability to apply a recipe for bypassing one type of problem. Ambiguity in inherent in most learning activites.

Measuring students attributes is challenging. Common instruments equate success in doing some task to attainment of some attribute, but most could actually demonstrate other attributes, even some like test-taking skills that the instructor never considered or even blind luck. Other instruments rely on students thinking you are measuring one attribute while you are actually measuring something else and are fragile to students who see through the deceit. Observation protocols and think-aloud exercises are likewise confounded by student’s ability to articulate and assessors ability to understand what they are observing.

Common proxy measures

When you want to measure, reward, and improve something that is difficult to measure it is common to use some proxy measure: something you can measure that you hope is correlated with what you can’t.

Popularity votes by students are very widespread: all students are asked to rank each course they take. These tend to reflect how entertaining the instructor was with strong biases towards people who resemble the local ruling class. They’re increasingly being seen as ineffective in proxying teaching effectiveness.

Student opinions of specific aspects of the course could theoretically be better, but they seem to have little actual signal. Students tend to give high marks to every question about a course they liked overall, even things the instructor self-identifies as having done poorly or not at all, and likewise uniformly-low marks to classes they disliked. Thus, these tend to roughly mirror the popularity vote in all aspects.

Student open-ended responses tend to have overall affect that mirrors the popularity vote but sometimes have nuances that hint at things the instructor is doing well or poorly. They’re not a good tool for telling if teaching is effective overall but they can be mined to find potential areas of improvement.

Standardized tests are designed to measure student knowledge, with all of the attendant challenges that make that attribute difficult to measure. Additionally, standardized test can create perverse incentives for instructors to leak answer keys or otherwise bypass the accuracy of the measure.¹¹ I had an instructor who rambled off-topic for all of every lecture, then in the last two lectures read us the final exam questions and his answers to them. It wasn’t stated explicitly, but I assume this was an effort to fool the school into thinking he had taught us something.

Peer observation is more rarely done than student surveys in part because it takes instructor time instead of student time to complete. That burden of labor means observations are almost always restricted to small portions of instruction, such as visiting a single class session or reviewing a single take-home exercise. They also tend to be done by just one or two observers who often have minimal training and experience in performing observations. That said, they can be used to identify positive and negative practices to the degree that the observers can recognize such and happen to observe them within their sample.

Teaching portfolios are a way of directing what peer observers see to more closely mirror what the instructor believes to best reflect their teaching. It an even higher cost in instructor time than peer observations with similar variation in peer experience performing reviews, plus additional variation in instructors’ ability to present their work well. I’ve seen portfolios used so rarely in higher education I can’t speak to their effectiveness. I understand they are more common in primary and secondary education and have heard strong and contradictory opinions about them as a measure of teaching effectiveness in that space.

Possibly better proxy measures

How else might we measure the effectiveness of teaching in a college setting? Here are some of the ideas I’ve heard.

Correlation of performance in subsequent courses is an appealing approach: if XYZ 101 is supposed to prepare you for XYZ 201, we can measure how well students who took a given instructor’s flavor of 101 did in 201. But in practice this has many problems; two of the biggest are that it depends on 201’s measures being good (which simply defers the measurement problem to a later course) and that it fails to account for selection bias.²² At UVA we saw this in practice: the least popular instructor for one course with several instructors only kept students who didn’t need teacher help to learn, and who did better in their subsequent courses because of that. When that instructor gained more experience and improved as a teacher that difference went away.

Satisfaction of postreq instructors and employers is not something I see discussed explicitly very often, but it often serves as a proxy for the accuracy of other methods. If students who do well on the XYZ 101 test are seen as inadequately prepared by the XYZ 201 instructor, that’s taken as a sign that the 101 test is not an effective measure of student learning. I’ve not yet seen a concrete proposal for how to use this idea to measure teaching effectiveness directly, and I expect that if it were used it would be polluted by internal faculty power dynamics and biases.

Focus groups of students can be a useful tool for discovering details about a course’s operations and impact, particularly if the focus group is conducted by someone who has expertise in learning that the students themselves lack. However, running effective focus groups is a challenging skill in its own right and extracting useful data from the groups’ replies is also nontrivial. I’ve never heard of a school that invested in enough trained qualitative researchers to conduct these across more than a few courses.

A checklist of teaching practices has been recommended by some as useful proxy measure. For example, there’s fairly strong research that classroom practices that cause all students to engage in doing something simultaneously (often called active learning) are generally effective, so we could check to see how often that is part of a given instructors’ operation. We could likewise check for other practices that we have a consensus of evidence are either good or bad, and use those counts to create a score for the instruction. This isn’t perfect: we know that good practices can be misused, usually-bad practices may be appropriate in some contexts, and that effective teaching relies on pacing and order of content as well as the practices used to convey it. But it likely has different kinds of shortcomings than other measures discussed here, and is related to some other successful measures in academia like citation counts and Hirsch numbers.

Compliance with a process is a common tool in business management to proxy for hard-to-measure ideals. For teaching effectiveness, this might look like instructors creating a plan for improvement, implementing that plan, and reporting back that they did so. Processes used in this way have a tendency to become more complicated and onerous over time as a reaction to those being measured for compliance trying to bypass their intent or those doing the measuring trying to better git the process to what they care about. However, when that temptation can be tames and when those under the process agree with its intent, there is some evidence that they can help improve, if not directly measure, their subject. I’ve not yet seen such systems used to improve teaching effectiveness.