Empirical testing of the SM-2 algorithm's performance on scheduling overdue cards

Follow

Empirical testing of the SM-2 algorithm’s performance on scheduling overdue cards

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
df <- read.csv("overdues.csv")
df <- df[df$afterOverdueRating != 0,] # I'm not sure how these got in here, probably a bug
df$initialOverdueRatingNumeric <- df$initialOverdueRating
df$initialOverdueRating <- as.factor(df$initialOverdueRating)
df$afterOverdueRatingNumeric <- df$afterOverdueRating
df$afterOverdueRating <- as.factor(df$afterOverdueRating)
library(ggplot2)
library(cowplot)
library(dplyr)
library(tidyr)
```
## Background
Many users worry about the default behavior of the SM-2 algorithm when passing an overdue card, i.e., one that is being reviewed later than originally scheduled. The algorithm “inflates” next review times to take account of the fact that you appear to have remembered the card much better than it originally predicted – if it thought you were 90% likely to forget the card after an $X$-day delay, but you actually remember it after a $2X$-day delay, that suggests your memory is, with high probability, stronger than the current scheduling data accounts for. The inflation is large enough after long delays that many new users initially believe it's a bug.
To summarize the current behavior here mathematically, some of the difference (“delay”) between the scheduled interval (the number of days the algorithm originally said you should wait to review the card) and the actual interval (the number of days you actually waited, since you reviewed it late) is applied in the next interval calculation. If you select Easy, the full delay is added to the current scheduled interval prior to calculating the next interval. If you select Good, half the delay is added. If you select Hard, Anki currently does not add any bonus, while RemNote uses one-quarter of the delay (if I recall correctly, Anki formerly used this figure and that's where it comes from; this means that my review history has used a mix of these approaches). No delay adjustment is applied if you select Again/Forgot.
To make this concrete, suppose you have a card with an ease of 250% and an interval of 10 days, with SM-2 scheduling parameters at their defaults. If you review it on time and give it a pass rating, you'll get the following next intervals:
* **Hard**: $10~\text{days} \times 1.2 = 12~\text{days}$
* **Good**: $10~\text{days} * 250\% = 25~\text{days}$
* **Easy**: $10~\text{days} * 250\% * 1.3 = 32~\text{days}$
Now what happens if you review it 10 days late?
* **Hard**:
* In Anki: $(10~\text{days} + 0~\text{days}) * 1.2 = 12~\text{days}$
* In RemNote: $(10~\text{days} + \frac{10~\text{days}}{4}) * 1.2 = 15~\text{days}$
* **Good**: $(10~\text{days} + \frac{10~\text{days}}{2}) * 250\% = 37~\text{days}$
* **Easy**: $(10~\text{days} + 10~\text{days}) * 250\% * 1.3 = 65~\text{days}$
As you can see, the intervals get quite a bit longer.
The argument for this approach is straightforward and theoretically sound and is presented above. However, there's a fairly strong argument against it as well. In reality, we don't get reviews *only within our flashcard apps* (indeed, if we did, flashcard apps would be pointless). If you're only a little bit late to review, chances are pretty good that you actually did remember it a touch better than the algorithm suggested. But if you remember something when it's, say, 400% overdue, the chance that this retention was actually caused by a lucky pattern of reviews *outside the app*, compared to the chance that you actually remembered something for 5 times longer than the algorithm predicted, rises dramatically. In this case, using the 10-day example above, perhaps we actually got 4 equally spaced extra-app reviews during the 20 day-delay period. Then our next interval should arguably be calculated from a *5-day interval*, since we never demonstrated our ability to remember for longer than 5 days. This is likely overstating the case somewhat, because SM-2 has proven well-calibrated with some flashcards being “reviewed” outside the app periodically. But the case for using a full 20-day interval, rather than a 10-day interval or something in between, seems significantly weakened.
## Data
The data consists of 105,393 observations of “after-overdue” reviews conducted in Anki over 13 years by one person (me). I have used the SM-2 algorithm built into whatever Anki version was current at the time throughout, and have used the default scheduling settings for review cards throughout (I have used different settings for lapses and learning cards, but no sequence of reviews used in this dataset includes a lapse or learning-mode review). The cards were on a wide variety of topics, some of which were part of my everyday life outside the spaced-repetition tool, and some of which were not.
An *after-overdue review* is one performed immediately subsequent in a card's scheduling history to a successful review which was overdue. For an after-overdue review to be logged, there must be a sequence of three reviews meeting the following conditions:
1. Any review made in review mode. Reviews that started in new, learning, or review-early mode are not eligible. The card received a non-Forgot rating, and a next interval of $X$ days was calculated at this review.
2. A review where the true interval (time between review 1 and review 2) was $X + Y$, where $Y$ is some positive integer number of extra days (if $0 < Y < 1$, the conditions are also not met). The sequence is ineligible if the card receives a Forgot rating or this review was performed early. (I have reviewed cards early only a handful of times in my 13 years of using Anki.) A new interval of $Z$ days was calculated at this review.
3. An after-overdue review carried out at least $Z$ days after review 2. This is our primary metric of interest: how are we going to rate these cards, after the algorithm scheduled our overdue card?
These sequences of three reviews may overlap, and an after-overdue review may itself be overdue.
See the appendix for a SQL query you can use to extract this data from your own collection.
## Results
(Results disclaimer: I haven't used `ggplot` in quite some time and relied quite heavily on ChatGPT to write my R code. The graphs all look right, but there's a higher-than-average chance I could have made some weird error.)
### Overall performance
```{r, include=FALSE}
forgotten <- paste(round(sum(df$afterOverdueRating == 1) * 100 /nrow(df), 1), "%", sep="")
```
`r forgotten` of the after-overdue reviews in the dataset received a Forgot rating. This is indeed worse than the standard forgetting-index target of ~10% of items forgotten at each review. However, note that my whole-collection history also misses the target by a fairly similar amount, with a total of 86.4% retention at review time (13.6% forgotten), a difference of approximately 1.2%. This is likely because I am not, over the long run, particularly good at reviewing daily, and perhaps also because, at the beginning of my Anki use, I wasn't particularly good at creating good cards and/or I used Anki more for memorizing intrinsically difficult items than I do today.
Since about a third of my reviews are after-overdue reviews, a naïve weighted-average calculation plus a few perhaps-unwarranted assumptions that gaps in the review sequence don't cause any memory penalty on top of the interval change would suggest that the overall retention penalty on after-overdue reviews, starting from my baseline non-overdue-review performance, is about 1.8%. (We could get a more accurate figure here by extracting all non-after-overdue reviews and calculating their retention percentage, but I've had more than enough SQL for today.)
The difference in retention on after-overdue and all reviews, taken as a whole-collection aggregate, cannot help but be statistically significant given the size of the dataset (though I didn't import all reviews, so cannot get the exact numbers here), but the effect size seems fairly unconcerning (indeed, an 85% forgetting index is generally understood to be more efficient than one of 90%).
Here's a plot of all ratings. Interestingly, they appear to show Hard and Easy used somewhat more than I typically would; it may be that the variance is higher on overdue reviews. I haven't tried to validate this by looking at all reviews though, so maybe these are just typical figures (I have been using Anki for 13 years, so maybe I used different rating methodology in 2010 and have forgotten).
```{r}
ggplot(df, aes(x = afterOverdueRating)) +
geom_bar() +
labs(x = "Rating", y = "Uses") +
scale_x_discrete(labels = c("1", "2", "3", "4", "5")) +
geom_text(stat = "count",
aes(label = paste0(round(after_stat(count)/sum(after_stat(count))*100,2), "%"),
y = after_stat(count)/sum(after_stat(count)), group = afterOverdueRating),
color="white",
vjust = -1,
size = 3.5)
```
It might also be interesting to look at how the `initialOverdueRating` (y-axis; the rating we gave the card when it was presented to us early) and the `afterOverdueRating` (x-axis; the following time we saw it) are related. Note that the `initialOverdueRating` cannot be 1 because we excluded cards that were forgotten at overdue time from the analysis.
```{r}
table(df$initialOverdueRating, df$afterOverdueRating)
```
Relative frequency version (these are percentages):
```{r}
round(prop.table(table(df$initialOverdueRating, df$afterOverdueRating)), 4) * 100
```
### Over time
The way Anki handles this, as well as my scheduling settings, have changed a little bit over time (in particular, the delay on pressing Hard on an overdue review), so I wanted to make sure that didn't have a large effect here. I binned these by month, and I've included a graph of the monthly average number of days overdue cards were, so we can see where changes in average after-overdue rating might be caused by a simultaneous increase in overdueness.
```{r}
df$reviewDate <- as.Date(as.POSIXct(df$afterOverdueId/1000, origin = "1970-01-01"))
df$reviewYear <- format(df$reviewDate, "%Y")
aor <- ggplot(df, aes(x = reviewDate, y = afterOverdueRatingNumeric)) +
geom_line(aes(group=1), stat = "summary", fun = "mean") +
labs(x = "Year", y = "Average afterOverdueRating") +
scale_x_date(date_breaks = "1 year", date_labels = "%Y") +
stat_smooth(formula=y~x, method = "lm", se = FALSE, color = "red")
apo <- ggplot(df, aes(x = reviewDate, y = overdueByDays)) +
geom_line(aes(group=1), stat = "summary", fun = "mean") +
labs(x = "Year", y = "Average days overdue") +
scale_x_date(date_breaks = "1 year", date_labels = "%Y")
plot_grid(aor, apo, align = "v", axis = "tb", ncol=1)
```
(The lighter area between 2017 and 2020 corresponds to a period after I graduated from college where I used spaced repetition much less actively. Overdueness remains high afterwards because most of my new content now goes into RemNote rather than Anki; I'm thus much more likely to temporarily set content aside in Anki and then want to relearn it, resulting in a bunch of overdue reviews.)
As you can see, there is an almost imperceptible trend downwards; the correlation turns out to be (drumroll)...
```{r}
cor(df$afterOverdueId, df$afterOverdueRatingNumeric)
```
I am not concerned about this.
### By days overdue
Considering the ratings of all cards together obscures a very real dropoff in effectiveness of the current algorithm when cards become substantially more overdue.
The following graph shows the fraction of rating buttons selected for after-overdue reviews on cards that were some number of days overdue, binned into 30-day chunks. All items that were more than 3 years overdue at review time are grouped together into the last bar, and the data gets a little noisy beyond a year or so, as there aren't a lot of reviews in each bucket at that point. The number on each bar is the number of observations in that group.
```{r}
df$overdue_bucket <- cut(df$overdueByDays, breaks = c(seq(0, 24*30*1, by = 30), Inf), include.lowest = TRUE, right = FALSE)
# Group the data by overdue_bucket and afterOverdueRating, then count the number of observations in each group
counts <- df %>%
group_by(overdue_bucket, afterOverdueRating) %>%
summarise(n = n()) %>%
ungroup()
# Calculate the total number of observations in each bucket
total_counts <- counts %>%
group_by(overdue_bucket) %>%
summarise(total_n = sum(n))
# Calculate the proportion of each afterOverdueRating within each bucket
proportions <- counts %>%
left_join(total_counts, by = "overdue_bucket") %>%
mutate(prop = n / total_n)
# Create bar plot with text labels for counts
ggplot(proportions, aes(x = overdue_bucket, y = prop, fill = afterOverdueRating)) +
geom_bar(stat = "identity", position = "stack") +
geom_text(aes(label = n), position = position_stack(vjust = 0.5), size=2) +
labs(x = "Days overdue", y = "Frequency") +
scale_fill_discrete(name = "After-overdue rating") +
theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1)) +
ggtitle("Answer buttons selected on after-overdue reviews, by days overdue")
```
I'm not sure exactly why this effect exists. Perhaps cards that are more overdue are more likely to have been reinforced or relearned due to other, similar cards, or in real life, recently? Nevertheless, the results do suggest a meaningful opportunity for improvement by scaling the interval bonus on overdue reviews based on how overdue the card is (more overdue cards have less of the delay accounted for in scheduling their next interval).
The amount of miscalibration here should not be overstated. Mildly overdue cards are more common in most use cases than massively overdue cards; you can see the number of observations dropping off quickly in my dataset as you move to more overdue cards, and I have spent plenty of time not reviewing things for extended periods of time – although we should note that as far as very overdue and very high-interval cards go, I simply may not have ever seen them again, which means they wouldn't be included in this dataset. (It's hard to draw any conclusions about the results beyond a year or so of overdueness, as there simply isn't much data to work with yet.) And many users believe the new intervals are usually wildly high and represent intervals they could never expect to retain the card for, while the actual impact observed here is a dropoff from about 85% to about 75% retention at the after-overdue review over the course of a year of overdueness. (I presume the sudden increase in success between 300 and 390 days is an artifact; there are only about 100 reviews in each of those bins.)
However, it's worth noting that a significant drop below one's desired forgetting index here is more problematic for one's learning goals with long-interval cards whose next interval was further inflated than with typical cards. When a card's interval is, say, 10 years, you don't have a chance to give the algorithm feedback on the fact that you don't actually know it for a very long time, so you will simply not know the information for an extended period of time, and such cards will be so spread out that, unless you actually analyze the data like I've done here, you likely will never realize that you knew them less well than you thought.
*Side note:* I wasn't sure whether percentage overdue or number of days overdue would be more meaningful for this analysis, so I tried both. Percentage overdue demonstrated a much less clear trend, so I have assumed that the measure showing a larger effect is more relevant here and omitted the percentage-overdue experiments from the report.
## Impacts of ease factor and previous rating
An interesting side question is how much impact the ease factor of the card and our previous rating has on our rating at after-overdue review time. We could consider whatever variation is explained by these factors imperfect behavior of the algorithm, as it would ideally adjust so that the probability of forgetting is similar and the rating buttons all make sense at each review, regardless of the card's absolute difficulty. If the algorithm was able to perfectly calibrate its ease calculations after a single review, we would find this model would have no explanatory power.
```{r}
mod <- lm(afterOverdueRatingNumeric ~ initialEaseFactor + initialOverdueRatingNumeric, data=df)
summary(mod)
```
The $R^2$ value here is about 0.11, which, *a priori*, seems quite good: only 11% of the variation in rating was explainable by the previous ease selections. (Note that we're treating ratings as continuous and linear here, which seems defensible but may not be ideal.) It would be instructive to compare with the statistics for all reviews to see if this is any worse, and it seems plausible that the optimal amount of overdue delay could be affected by these figures as well.
## Conclusions and next steps
Based on these results, users are correct in believing that the SM-2 algorithm undershoots its retention target on significantly overdue cards, but they overestimate the amount by which it does so; the spaced-repetition algorithm is still functioning, and it nowhere makes wildly absurd predictions. Many users would prefer to achieve 85% or 90% retention at review time, rather than moderately lower figures, at the cost of somewhat more reviews after catching up. But the difference is small enough that it would not be unreasonable to claim that the current algorithm is “good enough” for most use cases, as SM-2 is in general.
The results suggest that any adjustment to the algorithm's method of calculating extra delay should be applied on a curve that takes the amount of overdueness into account, rather than using a fixed scaling factor as it currently does. The parameters here would have to be determined experimentally; RemNote could achieve this by creating a platform that pushes parameters that vary on a continuous, randomized scale to different users and collects retention statistics.
It should be noted that this study involved a large amount of review data, but only a single user, so before making any changes to the algorithm, similar data should be collected from more users to confirm that the way I use spaced repetition is not wildly different from the average user. I intuitively doubt the results will be substantially different, but this shouldn't be difficult to verify, and it would be rather embarrassing to make algorithm changes that later proved to be ill-advised based on an $n=1$ study.
Another possible concern is that the data does not differentiate between after-overdue reviews that were conducted on time and ones that were themselves overdue. It's possible that this created systematic bias, for instance because cards that I review less often overall, and thus were more likely to become significantly overdue in the first place, are also more likely to be reviewed overdue subsequently -- and obviously when a review is conducted late, it's more likely to have been forgotten. A continued investigation of the topic should double-check this possibility.
It would also be quite helpful to directly compare reviews which are *not* after-overdue to the ones analyzed in this study. My comparison with whole-collection summary statistics shows the performance on after-overdue reviews is definitely worse than average, but a more precise understanding of the difference would be helpful.
## Appendix: data collection
A CSV file was generated by executing the following annotated SQL query against my `collection.anki2` file:
```sql
-- First, we create an in-memory table of times when we carried out a review which was overdue.
WITH overdueReviews (id, cid, ease, ivl, factor, type, lastIvl, lastId, actualIvl, currentReviewType, prevReviewType) AS (
SELECT
currentReview.id, -- The unique identifier of this review, and the millisecond timestamp at which it was carried out.
currentReview.cid, -- The unique identifier of the card that was being reviewed.
currentReview.ease, -- The ease button (rating) that was selected, 1 = Again, 4 = Easy.
currentReview.ivl, -- New interval in days, i.e., the number of days from the current moment that the card should next be reviewed on.
currentReview.factor, -- The new ease factor of the card after this review, in permille.
currentReview.type, -- 0 = Learn, 1 = Review, 2 = Relearn, 3 = Early Review
prevReview.ivl, -- Interval from the *previous* review of the card. Ideally today would equal the time of prevReview.id + prevReview.ivl; the query is finding cases where it's greater.
prevReview.id, -- Millisecond timestamp of the previous review.
(currentReview.id - prevReview.id) / 86400000, -- Actual interval between the previous and current reviews of the card, in days (truncated).
currentReview.type, -- DEBUG
prevReview.type -- DEBUG
FROM revlog currentReview
INNER JOIN revlog prevReview -- Self-join the revlog table to find the review immediately prior to this one...
ON prevReview.cid = currentReview.cid -- ...for the same card.
AND prevReview.id = (
SELECT MAX(id) -- The previous review is the one with the highest ID...
FROM revlog
WHERE cid = currentReview.cid
AND id < currentReview.id -- ...that's less than the current ID.
)
WHERE currentReview.id - prevReview.id > (prevReview.ivl * 86400000)
AND ((currentReview.id - prevReview.id) / 86400000.0) - prevReview.ivl >= 1
AND currentReview.ease != 1 -- Don't include cases where the card was overdue and we failed it; the question is what happens when we pass it.
AND prevReview.type = 1 -- Only consider cards that were in review mode (not learning mode) at all stages.
AND currentReview.type = 1 -- ...ditto.
AND prevReview.ivl > 0 -- Negative values refer to learning-mode intervals; not entirely sure why the type isn't set to 0 or 2 in these cases.
-- Anki 1 appears to have written 0 here for relearn cards (this was before the introduction of learning mode).
)
-- Now using that table, we look for each review that we carried out immediately after an overdue review.
SELECT
afterOverdueReview.id AS afterOverdueId, -- ID/timestamp of this "after-overdue" review.
afterOverdueReview.cid AS cid, -- Card that's being reviewed.
overdueReviews.ease AS initialOverdueRating, -- The rating we gave the card on the review that was overdue.
afterOverdueReview.ease AS afterOverdueRating, -- The rating we gave the card on the subsequent after-overdue review (this is our outcome variable).
overdueReviews.factor AS initialEaseFactor, -- Ease factor the card had after the first overdue review.
overdueReviews.lastIvl AS initialOverdueScheduledIvl, -- Interval the overdue review was supposed to have been carried out at.
overdueReviews.actualIvl AS initialOverdueActualIvl, -- Interval it was actually carried out at.
(overdueReviews.actualIvl - overdueReviews.lastIvl) AS overdueByDays, -- Number of days delay between the optimal and actual review times.
ROUND((overdueReviews.actualIvl * 100.0 / overdueReviews.lastIvl) - 100.0, 1) AS percentOverdue, -- Percentage excess that the card was overdue (e.g., if reviewing after 15 days when 10 days was optimal, 50%).
currentReviewType, -- DEBUG
prevReviewType -- DEBUG
FROM revlog afterOverdueReview
-- We carry out the same self-join here, but now against the in-memory table we defined earlier, so that we now are considering three reviews in sequence:
-- (1) An initial review
-- (2) A second review which was overdue with respect to the interval coming out of review (1)
-- (3) A "follow-up" review which occurred after the overdue review (2).
INNER JOIN overdueReviews
ON overdueReviews.cid = afterOverdueReview.cid
AND overdueReviews.id = (
SELECT MAX(id)
FROM revlog
WHERE cid = afterOverdueReview.cid
AND id < afterOverdueReview.id
);
```
Save these results to `overdues.csv`, then load it into R:
```{r ref.label=c('setup')}
```