A/B Testing for Developing Learning Programs
For some years now, technology developers from industries other than education have been releasing products to users as soon as possible and then collecting and using data from the users to determine consumer preferences. Technology developers amass a large user base so they can collect and learn from data about how users respond to their product. In the commercial world, this approach can lead to faster development of better products at a lower cost.
This approach is now being extended to learning systems, with networks of teachers and/or curriculum experts providing ongoing reviews and analyses as learning system development progresses.
Educational Data Mining
One advantage of digital learning systems is that they can collect very large amounts of data (big data) from many users quickly. As a result, they permit the use of multivariate analytic approaches (analyses of more than one statistical variable at a time) early in the life cycle of an innovation. But big data requires new forms of modeling for data that are highly interdependent (Dai 2011). Accordingly, the emerging field of educational data mining is being combined with learning analytics to apply sophisticated statistical models and machine learning techniques from such fields as finance and marketing (U.S. Department of Education 2012a).
The need for new techniques for mining data also is giving rise to a new type of professional: the learning data scientist. The field of data science emerged in the last few years, in parallel with the growth of big data. Data scientists, whose formal training may draw on computer science, modeling, statistics, analytics, and math, were first employed in marketing and finance but now have a place in education. Good learning data scientists are capable of both structuring data to answer questions and applying learning principles to select the right questions to study.
One of the key challenges of educational data mining is determining how best to parse learning interactions into right-sized components for analysis (Siemens and Baker 2012). Once the components are defined and identified, analysts can explore the records of learning interactions to find interesting patterns and relationships.
Educational data mining includes both bottom-up techniques, in which analysts look for interesting patterns in the data and then try to interpret them, and top-down approaches, with data collection and analysis shaped by a driving question or hypothesis. Some practitioners advocate the former approach because of its ability to yield unexpected insights, but others stress the increased efficiency and interpretability of planned data collection and analyses. Most practitioners are coming to see the value of combining the two approaches.
Top-down approaches can be found in the work of both technology developers in industry and education researchers, but the two groups differ in that education researchers are more likely to be guided by concepts drawn from basic learning theory and research. In developing and studying learning technologies, education researchers often have the dual goals of creating an effective learning product and testing the applicability of a basic learning principle. Moreover, in the absence of existing empirical evidence about the effectiveness of different instructional design options, learning theory provides guidance that can increase the likelihood of making good design choices.
Learning theory is also important in the initial design of a learning technology. Without a basis in learning theory design principles, observing what students do as they move through an online curriculum is unlikely to reveal much about how to optimize learning for all students. The goal is not to find optimal pathways through bad content, but rather to design better content. The best way to achieve that initially is to draw on the extensive body of findings from learning science. Once content is improved, new technology- enabled data collection and analysis can be used both to improve the online curriculum and to test hypotheses about learning system design that extend existing research.
Uses of Evidence from Educational Data Mining
Educational data mining can address the question of how to refine a learning system or other type of learning resource and can provide the practitioner or researcher with information about learner behavior, achievement, and progression. It is less well suited to investigating the causal case for the effectiveness of a resource or intervention as a whole. However, even resources with causal evidence of effectiveness in particular settings often fail to have the same impact when applied elsewhere (Cronbach and Snow 1977). This is because education is a complex system, and any new intervention is likely to interact with different system components in a new setting in unforeseen and sometimes less effective ways. The ideal would be to have experimental tests of an intervention’s impact in all the settings where it would be expected to be used.
There are two possible responses to this challenge. One is to try to create an intervention that works everywhere because all possible constraints of setting have been foreseen and accommodated. The other is to expect that an intervention will be used in somewhat different ways in different settings, possibly with different outcomes.
Rapid Random-Assignment Experiments
Another advantage of digital learning systems is that they provide an opportunity to conduct controlled random-assignment experiments (Shadish and Cook 2009) much more rapidly than was previously possible.
The purpose of randomly assigning study participants is to create two or more equivalent groups whose results can be compared. In randomized controlled trials (RCTs) in education, learners are randomly assigned to very different treatments or to an experimental treatment and a business-as-usual condition. For example, an RCT might involve one group of students taking an online algebra course and another group of students receiving face-to-face algebra instruction at school.
In the software industry, a random-assignment experiment known as A/B testing is used to isolate variables by comparing two different versions of the same product or system (version A and version B) by randomly assigning users to one or the other version. One version of an online algebra course might have design feature A, for example, and the other would have design feature B, but the versions would otherwise be identical.
Historically, A/B testing has been used for market research, such as for comparing the sales or click- through results of two user interface designs or two versions of an advertisement. But increasingly it is being applied to digital learning research and development. The emergence of online learning resources that attract many users is making possible rapid collection of input on a scale that produces statistically significant results and comparison of relative outcomes from multiple versions during a short period.
Sometimes A/B tests are conducted with a well-defined population of interest and sample participants who represent that population, as in the Geometry Cognitive Tutor example. For example, a study might assign all the eighth-grade algebra students in five school districts to take one of two forms of online eighth-grade algebra instruction with the goal of generalizing to eighth- graders in districts like the participating five.
In contrast, in an A/B test of two versions of a free online game for high-schoolers, researchers may make the game available to anyone who finds it online, with the result that they do not know anything about the characteristics of the players—their age, previous gaming experience, math concept knowledge, and so on. Developers of digital learning systems may not ask their users to provide any information about themselves because they do not want to discourage potential users with a sign-in process. In addition, they argue that the larger the pool of users, the less the importance of specific users’ characteristics. The Khan Academy, for example, reports that it attracts enough users to run an adequately powered A/B test in a matter of hours, and typically it does so without collecting user information.
A/B Testing and Rapid Improvement Cycles at the Khan Academy
The Khan Academy has grown from a collection of a few hundred YouTube videos on a range of math problem types created by Sal Khan himself to a digital learning system incorporating more than 3,000 videos and 300 problem sets geared to K–12 mathematics topics.
For each problem set, the Khan Academy system logs the number of attempts a user makes for each problem, the content of each answer, whether the answer was correct or not, and whether the system judged that the user had mastered the skill the problem set addressed. For each video, the system keeps track of the segment being used, the time when the user’s viewing started and ended, and any pauses or rewinding.
As an organization, Khan Academy combines technology research and development approaches with Wall Street-style financial analysis. Its Dean of Analytics, Jace Kohlmeier, was previously a trading systems developer at a hedge fund.
Khan Academy’s open-source A/B testing framework enables the organization to randomly assign users to one of two or more versions of the software with one line of code. Developers can determine what percentage of their users they want to receive the experimental version, and a dashboard charts user statistics from the two treatment groups in real time.
Because Khan Academy has about 50,000 active exercise users doing several million problems each day, developers can accrue statistically significant data very quickly. For something with a large impact, Kohlmeier reported they can collect results in an hour (because large effects can be detected with small samples). But many of the Khan Academy’s experiments involve changes with smaller effects and hence take longer. In addition, the organization likes to run experiments for a week or so because of user flow cycles; more adult and self-driven learners use Khan Academy in evenings and on weekends.
One of Kohlmeier’s first projects with Khan Academy was to look at how the system determined that a learner had reached proficiency on a problem set topic. The system was using a simple but arbitrary heuristic: If the user got 10 problems in a row correct, the system decided the user had mastered the topic. Kohlmeier examined the proficiency data and found that the pattern of correct/incorrect answers was important. Learners who got the first 10 problems in an exercise set correct performed differently subsequently than did users who needed 30–40 problems to get a streak of 10.
Kohlmeier built a predictive model based on estimating the likelihood at any point during an exercise set that the next response would be correct. (Similar predictive models have been used in intelligent tutoring systems for some time.) The system was then changed to define mastery of a problem set as the point where a user has a 94 percent likelihood of getting the next problem correct.
This change in the system set a higher bar for mastery and meant that some users had to spend more time on an exercise set. By monitoring user data after making the change, Khan Academy analysts were able to see that users were willing to devote the extra effort. At the same time, the new criterion allowed fast learners to gain credit for mastering material after doing as few as five problems, enabling them to cover more material in a given time. The Khan Academy team used A/B testing to compare the old and the new models for determining mastery. They found that the new mastery model was superior in terms of number of proficiencies earned per user, number of problems required to earn those proficiencies, and number of exercise sets attempted.
Although a great proponent of A/B testing and data mining, Kohlmeier is also aware of the limitations of those approaches. It is difficult to use A/B testing to guide big changes, such as a major user interface redesign; too many interdependent changes are involved to test each possible combination in a separate experiment. In addition, system data mining is extremely helpful in system improvement, but to make sure the system is really effective, analysts need an external measure of learning.
U.S. Department of Education, Office of Educational Technology, Expanding Evidence Approaches for Learning in a Digital World, Washington, D.C., 2013.