# The Statistically Valid Way to Analyze Student Evaluations of Teaching (SETs)

In my last post, I reviewed some problems with doing math on “Likert items” or “Likert scales” because they are fundamentally ordinal data, categorical data that can be ranked.

Adhering to the adage that one should not criticize without offering a solution, I now will describe how these data can, and should, be analyzed.

Pick up an introductory statistics text and it will probably tell you, within the first chapter (in 5 out of six book in my library, the latest they got to it was page 17), about data scales (nominal, ordinal, interval and ratio) and the types of measures you can calculate from them (nominal – mode and percentages for each category; ordinal – median, mode and the same percentages as for nominal; interval – mean, median, mode, standard error, etc.). Chances are that later in the book they will talk about how to analyze these types of data.

Hopefully, by now * I have convinced you that student evaluations of teaching are ordinal scale data* and thus you can calculate the frequency of responses in the different categories, you can determine the median response [the one in the middle of the observed data], and the mode [the response given most commonly]). But a promotion and tenure committee might not know what to do with these. But you can also

*and therein lies some power. I will illustrate this process with some of my own data.*

**look at the distribution of the responses across all the categories and compare those distributions to other real or hypothetical distributions**I just went through my 5 year, post-tenure review. In it, I had to review 2 courses and comment on my teaching evaluations. I chose a first year introductory biology course which has been, for years, regarded as a hard class that students don’t particularly want to take. If you don’t believe me, here are some data taken from 5 years of IDEA Center evaluations (I summed the number of students responding to each category across 10 sections, 2 sections each fall).

I tricked you, because, without any specialized training you can see, at a glance, that they think the course is hard and they don’t want to take the course. Now, I would not want my job to rest on gut impressions from a distribution. And in reality, what matters is what students are learning (or, at least what they self-report that they are learning). Here is the distribution of responses to the the prompt, “Describe the amount of progress you have made on each [of the following learning objectives]” for the category of learning, “Gaining factual knowledge”.

8 out of 149 students responding said that they made “No Apparent Progress” or only “Slight Progress” on this learning goal. 117 or 149 said that they made “Substantial Progress” or “Exceptional Progress” on this learning goal. The cool thing about this is that I can take this (the best learning goal that I had in these data), and ask myself, how do my other learning goals stack up. How about “Learning Fundamental Principles”?

It looks like the distributions might not be the same because I have a lower percentage of students responding ‘Exceptional Progress”. Now. if we knew how far “Exceptional Progress” was from “Substantial Progress” we could analyze this difference by calculating means. But we don’t know that, so we can’t. But what we can do is ask the question, “Is the distribution of responses on this learning goal different from the distribution for ‘Gaining Factual Knowledge'”?

I did this with a Chi-square test, a simple test you can calculate on the back of an envelope or build a simple Excel spreadsheet to automatically calculate if you will be doing lots of the same kinds of data. I used the “Gaining Factual Knowledge” distribution to construct expected values for the “Learning Fundamental Principles” distribution and the Chi-square test simply examines the differences in these frequencies across categories in order to determine whether the cumulative differences in frequencies are big enough to say that the two distributions are different. For this test, the Chi-square statistic was 5.13 which has a calculated p-value of 0.27, meaning that, if I concluded that these two distributions are different, I would run a 27% risk of being wrong. Generally, the convention for statistical significance is p< 0.05. Therefore, I would conclude that the learning gains on these two objectives were not different. There is another learning gain that did not go so well: “Learning to Analyze and Critically Evaluate Ideas, etc.”. The distribution is below and you can see that it “looks different”.

It seems that there are just more people in the “Slight Progress” and “No Apparent Progress” categories but when you calculate the Chi-square (61.8) you find that it is highly significantly different (p<0.01) but that this difference is driven by the increase in “Slight Progress”. If we try to “average” these things, the averages for the three learning goals are 4.1, 3.97 and 3.69. Now, I don’t know 4.1 whats or 3.69 whats so it is difficult for me to know, based on these “averages” of integer values assigned to a categorical variable, what, if anything I should do. But I can see by looking at the distributions that something is up, that is then verified by a statistical test and I can then begin to ask myself some questions about this particular learning goal, the structure of the course and why this learning goal is being assessed differently, etc. ** It is hard for me to look at a difference of “0.28 whatever the scale is here” and know whether this is something I should pay attention to or not.** And statistical practice would say that

*. McCullough and Radson (2011) would say that these averages can be misleading in both directions and they have a nice analysis based on real SET data from a university to illustrate their point. It is interesting reading and I recommend it.*

**the 0.28 difference is meaningless because it is based on an average of something that cannot be averaged****So, how can this be used at broader scales?** A college could use the data from all classes to set up a large distribution of responses to these Likert items and then they could make that available to their faculty and faculty could test their own data against this large “expected distribution” and the see if their data fall outside the norm, whether better or worse than the college distribution. Colleges could also look at the categories and, informed by prior data from their college, set what acceptable floors would be for the percentages of students making “Substantial Progress” or better or ceilings for the percentages of students making “Slight Progress” or worse (see figure below).

**One advantage of such a system is that, it would be clearer to faculty what the bar was that had to be crossed or not fallen below.** **This would also allow promotion and tenure decisions to be based on quantitative analyses of the categorical data that come from Likert items and the decision would be grounded in a statistically appropriate method for treating these ordinal data. **

This is not earth-shattering as these ideas can be found in the literature and the basic statistical principles are so basic that they are not even found in advanced texts but are always found at the beginning of beginner texts. In addition, the argument that “everyone does it this way [averaging categorical data]” does not really hold up because in about 5 minutes of web surfing I found that the University of Michigan and Oregon State University treat SET data as ordinal scale data.