When Copilot is the Data Analyst
Civics of Technology Announcements
Next Tech Talk: Please join us for our next Tech Talk where we meet to discuss whatever critical tech issues are on people’s minds. It’s a great way to connect, learn from colleagues, and get energized. Our next Tech Talk will be held on Wednesday, February 4th at 8:00 PM Eastern Time. Register here or visit our Events page.
Privacy Week Webinars: On January 22 (note the updated date!) and 29 - see our Privacy page to learn more!
Latest Reviews:
Mood Machine: The Rise of Spotify and the Costs of the Perfect Playlist, by Liz Pelly, 2025.
Shell Game, podcast hosted by Evan Ratliff, 2024-Present
By Jacob Pleasants
At the beginning of December, I listened to an episode of WBUR’s On Point with a provocative title: “Is education research actually helping teachers?” The guest of the show was David Marshall, Associate Professor of Education at Auburn University. Marshall had written an opinion piece for The Hill (published October 28) in which he argued that “much of today’s research doesn’t reflect what teachers say they need” (par. 2). His claim was based on a research study that he and his collaborators had recently conducted, and which currently is available as a preprint. In their study, they compared the main research themes at recent AERA conferences (the largest education research conference in the United States) with what a sample of U.S. teachers said that education researchers ought to be investigating. They found little overlap between the two:
While conference research overwhelmingly focused on equity, social justice, and identity, teachers prioritized issues such as student behavior, mental health, technology use, parent support, and retention. (Marshall et al., 2025, Abstract)Marshall’s opinion article has generated some pushback from scholars (some of whose voices were featured on the On Point episode), who question the implication that research on equity, social justice, and identity is “out of touch” with teachers’ needs. It’s a fair concern, and the broader conversation about priorities in education research is an important one. But that’s not what has motivated me to write this piece. I want to highlight an aspect of Marshall et al.’s study that very much caught my attention, but went completely unscrutinized during the On Point conversation or in any of the responses that I have encountered. Specifically, how exactly did they analyze those AERA conference programs? According to the preprint:
To identify the major themes of the conference, we utilized Microsoft Copilot, an AI-powered language model, to analyze the full text of the conference programs from 2021 to 2025… Copilot systematically reviewed this content using natural language processing techniques to identify recurring keywords, concepts, and disciplinary focuses. (p. 4)Well, that’s interesting.
Marshall actually mentioned his use of AI during the On Point interview, but it received no further discussion. I find it rather extraordinary that this choice would receive no comment at all, let alone criticism.
Of course, there’s nothing especially novel about using machine learning or other AI techniques for education research. Researchers have been doing this for years (and here’s a recent example). But there’s a big difference between using built-for-purpose analytical software and sending a dataset over to a general-purpose commercial LLM, as I detail below. Yet as extraordinary as this methodological choice might be, it did not even register as particularly interesting to those discussing Marshall et al.’s research.
Why this is, in fact, extraordinary
To better understand what was going on in this analysis, I decided to investigate the analysis that Copilot carried out. I downloaded the most recent AERA program, uploaded it to Copilot (my university has an enterprise license), gave it the same prompt described in Marshall et al., and let it run. The main output I received more or less matched what was reported in the paper. But what was Copilot actually doing? Replicating the output of a black box doesn’t make it any more transparent (or valid). So, I investigated further. I will spare you all the details, but here are some key things I found:
You can ask Copilot what it did to generate its results. When I asked how it identified the top themes of the conference sessions, I expected that it would have written some custom code. But it reported that it actually used an “internal file-summarization tool to semantically extract themes,” which apparently includes “semantic clustering and ranking.” [Note: Should I trust what Copilot is reporting here? I don’t really have much of a choice, and that makes me uneasy.]
I am not an expert on semantic clustering, but I do know that it is not an “objective” or “judgment-free” process. For the conference program, you have to decide how to segment the data (what’s the unit of analysis? Is a paper presentation weighed the same as a roundtable or symposium?). You have to decide what you are excluding from analysis (should you stick to paper titles or use the session title?). You have to decide which computational techniques to use (there are many options). I asked Copilot to list out the potential “biases” that can emerge from these decisions, and they are myriad (many are highly technical choices that are outside of my expertise).
So, how does the “file-summarization tool” make those decisions? Mostly, they are hard coded into the program and draw upon the very LLM architecture itself. The program was designed to use certain document-parsing heuristics, filter certain kinds of words, use certain computational methods, and so on. It also relies on its internal language model to link certain terms to one another (e.g., “equity” and “justice”). You can ask Copilot to report those design decisions.
An interesting decision: The summarizer can parse the program structure and separate out symposia, roundtables, posters, etc. But when it comes time to extract themes, all of those presentations are weighed equally. That is, for the purposes of identifying the prominent themes among the presentations, a poster on equity is weighed the same as a symposium on equity. I would argue that this “equal weight” decision is highly questionable! But you have to do some serious digging to find out that this even was a decision at all.
The upshot is that when you use Copilot to extract themes from a dataset, there is a (potentially) legitimate analytical approach being used – but you have offloaded all the key decisions to the software. Which means you don’t even need to know what those decisions are. Just push the “analyze” button and receive the results.
Is this any different from using something like SPSS to run statistical tests? I argue that it absolutely is. When I use statistical software, I am not grinding through the calculations, but I am making the decisions about what statistical tests I am going to perform, what models I’m going to use, and so on. If SPSS ever adds an AI feature that lets you offload those choices, then we’d have something similar. Let’s hope that day does not come.
Offloading the decision-making process is a problem, but there’s an even larger issue at play. When a researcher reports in a manuscript that “we analyzed our data using Copilot,” there is minimal transparency of what was done. To be clear, I do not object to the use of analytical techniques like semantic clustering, provided that the researchers are upfront about what decisions they made and why. There may even be a world where using Copilot (or some other chatbot) to do the semantic clustering is reasonable, provided that the researchers investigate and report the details of its analytical process like I have done above (recognizing, though, that these commercial systems are subject to change at any time without notice). But if the methods section essentially boils down to “the LLM did it,” that is a serious problem.
Does the Marshall et al. paper provide sufficient detail about their analysis? You can be the judge. Here is their full description of the analysis:
To identify the major themes of the conference, we utilized Microsoft Copilot, an AI-powered language model, to analyze the full text of the conference programs from 2021 to 2025. Our university-supported Copilot subscription does not use uploaded data or conversations for training. The conference programs, provided in PDF format, included detailed descriptions of all scheduled sessions, including paper presentations, symposia, roundtables, and workshops. Copilot systematically reviewed this content using natural language processing techniques to identify recurring keywords, concepts, and disciplinary focuses. Sessions were grouped into preliminary thematic categories based on semantic similarity and topical overlap. These categories were then iteratively refined through clustering and frequency analysis to ensure conceptual coherence and distinctiveness. The final set of themes was determined by ranking categories according to the number of sessions associated with each, resulting in five dominant themes. We utilized 3 main prompts. The first was applied to each individual year’s program, after uploading the pdf. "Could you give me the main themes of this conference based on the descriptions of conference presentations?" Then, after each program was addressed separately, we sought to generate themes across the full set of programs with the prompt “What were the most prominent themes across each of the annual programs?” Then relative prominence of each theme was visualized using a frequency-based bar chart to illustrate their distribution across the program, using the prompt “Would you be able to create a bar chart of the frequency of the top 5 themes, to show their relative importance?” (p. 3-4)
The Future
I suspect that we are going to start seeing a lot more research that lacks transparency, that does little more than say “AI did the analysis.” There are a growing number of research studies that compare LLM-based analysis to human coding (here are three examples: 1, 2, 3), presumably to legitimize the approach. What is incumbent on all of us is that we do not normalize this practice. We should ask questions. Lots of questions.
I don’t necessarily fault Meghna Chakrabarty, the host of On Point, for missing this, though I wish she hadn’t. And the paper by Marshall et al. is just a preprint; it hasn’t yet gone through peer review. I sincerely hope that any peer reviewer would raise the kinds of questions that I have laid out here. Maybe this is a one-off case of authors trying to generate buzz around not-yet-published work. Maybe, but I suspect this is indicative of things to come. Hopefully my investigations here can provide you with some tools to push back if and when you encounter an “AI analyst.”