Monte Carlo Simulation: Total number of reviews to burn everything


I was curious as to how many WK reviews I should expect to do in total to burn every single item. I know that I could probably try to look it up somewhere, and I could definitely apply some of what I learned in my statistics class in uni to figure this out mathematically, but instead I decided to do a Monte Carlo simulation.

How many reviews you need to do of course depends on your accuracy, so what I did was go through a number of different accuracy intervals, and do simulations in each step to see how many reviews you would need before reaching the last SRS level, on average. I start the simulations at 61% because it gets slow lower than that (asymptotically so as you approach 50%), and I doubt many have such low accuracies.

The reason I do this is that I want to find out how far I have come, in terms percentage of reviews that I will need to do to burn everything.


The interval 61%-100% is evenly sampled at 40 points. For each of these 40 different accuracies 100 simulations are run where all items are reviewed, with a probability of success equivalent to the point’s respective accuracy level, until they are burned. In each simulation the total number of reviews to burn all items is tracked, and the average over the 100 simulations is recorded.

Source Code
import random
import time

class Item:
    """Class for items to be reviewed. Only attributes are SRS level and how many cards the item has."""
    def __init__(self, multiplicity):
        """Create a new item."""
        self._SRS_level = 1                     # Don't count lessons
        self._multiplicity = multiplicity       # Indicates how many "cards" you have per "note", to use Anki terminology

    def review_item(self, p):
        """Evaluates based on probability whether the item passes or fails a review."""
        # p**self._multiplicity is used here becuase a user has to pass both the meaning and reading review when multiplicity is 2.
        p_observed = random.random()
        if p_observed < p**self._multiplicity:
            self._SRS_level += 1                # If review is successful item goes up one SRS level
            review_count = self._multiplicity   # if it's a radical we did one review, else 2
            if p_observed < p:                  # If this is true then we failed one review and passed one
                review_count = 3
                review_count = 2*self._multiplicity
            if self._SRS_level != 1:            # Don't change SRS level is item is already at the lowest
                if self._SRS_level <= 4:        # If item is an apprentice then reduce SRS by 1 level
                    self._SRS_level -= 1
                else:                           # Else 2 levels
                    self._SRS_level -= 2
        return self._SRS_level, review_count

def create_items(count, double):
    """Creates a hash of items to review."""
    items = {}
    for i in range(1, double + 1):
        items[i] = Item(2)
    for i in range(double + 1, count + 1):
        items[i] = Item(1)
    return items

def review_items(count, items, max_srs, p):
    """Reviews all items until they reach the final SRS level."""
    reviews = 0
    while count > 0:
        keys = [key for key in items]
        # Review all items once
        for i in keys:
            srs_level, review = items[i].review_item(p)
            # If item reaches last SRS level remove it from the queue
            if srs_level == max_srs:
                del items[i]
                count -= 1
            reviews += review
    return reviews

def repeat_run(runs, single, double, max_srs, p):
    """Repeats the same simulation a number of times and returns the average."""
    total_reviews = 0
    for i in range(1, runs+1):
        count = single + double
        # Create new items
        items = create_items(count, double)
        # Review items until all reach the last SRS level
        reviews = review_items(count, items, max_srs, p)
        total_reviews += reviews
    return total_reviews

def add_to_table(accuracy, total_reviews, runs):
    """Adds data to the discourse table we're making."""
    data = "| " + str(accuracy) + " | " + str(round(total_reviews / runs))
    if (accuracy-1) % 5 == 0 and accuracy:
        data += "\n"
        data += "| \| "
    return data

def parse_time(seconds):
    """Makes sense of seconds."""
    minutes = 0
    hours = 0
    if seconds > 60:
        minutes = seconds//60
        seconds %= 60
        if minutes > 60:
            hours = minutes//60
            minutes %= 60
    return hours, minutes, seconds

def simulate(highest_accuracy, interval_length, lowest_accuracy, runs, number_of_single_items, number_of_double_items, total_estimate, max_srs, estimate=False):
    """Starts the whole simulation."""
    # do stuff
    table_data = "| %    | Reviews | \| | %    | Reviews | \| | %    | Reviews | \| | %    | Reviews | \| | %    | Reviews |\n|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-\n"
    reviews_done = 0                                       # Will contain number of reviews done
    accuracy = highest_accuracy                     # Rename variable
    t0 = time.time()                                # Start time
    while accuracy >= lowest_accuracy:
        p = accuracy / 100                          # Accuracy as a probability
        # Redo the simulation a number of times for the current accuracy level
        total_reviews = repeat_run(runs, number_of_single_items, number_of_double_items, max_srs, p)
        reviews_done += total_reviews
        # Add to the table
        table_data += add_to_table(accuracy, total_reviews, runs)
        # Print stuff
        if not estimate:
            time_elapsed = time.time()-t0
            seconds = round(total_estimate / reviews_done * time_elapsed - time_elapsed)
            hours, minutes, seconds = parse_time(seconds)
            if total_estimate is False:
                progress = ""
                time_left = ""
                progress = str(round(reviews_done / total_estimate * 100)) + '%'
                time_left = str(hours) + "h " + str(minutes) + "m and " + str(seconds) + "s remaining."
            text = '- ' + progress + ' - ' + time_left
            print(accuracy, '-', 'Average Reviews:', total_reviews//runs, text)
        # Go to new accuracy level
        accuracy -= interval_length
    return table_data, reviews_done

def main():
    # Settings
    highest_accuracy = 100                          # Percent
    interval_length = 1                             # Percent
    lowest_accuracy = 61                            # Percent
    runs = 100                                      # Per level of accuracy
    max_srs = 9                                     # Number of SRS levels
    number_of_double_items = 2027 + 6300            # Number of items with two cards
    number_of_single_items = 477                    # number of items with one card

    # Estimate how many total reviews we can expect by doing one run
    # The estimate is used to calculate time left
    if runs >= 10:
        table_data, reviews_done = simulate(highest_accuracy, interval_length, lowest_accuracy, 1, number_of_single_items, number_of_double_items, 0, max_srs, estimate=True)
        total_estimate = reviews_done*runs
        total_estimate = False

    # Simulate
    table_data, reviews_done = simulate(highest_accuracy, interval_length, lowest_accuracy, runs, number_of_single_items, number_of_double_items, total_estimate, max_srs)
    print('Total reviews:', reviews_done)



2027 kanji,
6300 vocab words,
477 radicals.
Kanji and vocabulary reviews each have a meaning an reading, and count as two.
There are 17131 “items” to review in total.

4 apprentice levels,
2 guru levels,
1 master level,
1 enlighten level,
1 burn level.
There are 9 SRS levels in total.

A user does not fail a review item more than once per session.


These are the results after a total of 20,407,018,564 simulated reviews.

The percentage columns indicate the accuracy level. The review columns indicate the average number of reviews needed to burn everything, given the adjacent average accuracy.

% Reviews | % Reviews | % Reviews | % Reviews | % Reviews
100 137,048 | 99 145,906 | 98 155,534 | 97 166,106 | 96 177,775
95 190,611 | 94 204,773 | 93 220,687 | 92 238,860 | 91 258,023
90 281,032 | 89 306,655 | 88 336,041 | 87 369,310 | 86 408,331
85 453,812 | 84 505,797 | 83 566,916 | 82 641,099 | 81 729,555
80 834,471 | 79 959,757 | 78 1,114,089 | 77 1,302,871 | 76 1,533,677
75 1,822,811 | 74 2,182,566 | 73 2,626,532 | 72 3,193,191 | 71 3,916,460
70 4,828,658 | 69 5,993,856 | 68 7,517,466 | 67 9,468,785 | 66 12,034,521
65 15,398,389 | 64 19,811,452 | 63 25,668,246 | 62 33,436,116 | 61 43,932,400


My average accuracy is 95.61%, and I have done 112,663 reviews so far. Looking at the 95% cell I see that I will need to do 190,611 reviews to burn everything. As such I can calculate that I am 59.1% (112,663/190,611) of the way to burning everything.

You can find your own (current) total number of reviews, and your total accuracy, on (see highlighted parts of screenshot), and calculate how far you’ve come in your journey to burn everything.



Two questions:

  1. Did you look at meaning and reading together, so if you get one wrong you still have to review both the next time?
  2. Wouldn’t it be better to look at your stats for an item being marked right or wrong during a review, rather than the number of answers that are right or wrong (as the stats site shows)? That statistic would more accurately reflect how many reviews you have to do.

Those are good points. The stats site calculates accuracy differently than WK. For kanji/vocab, if you enter the right reading but wrong meaning, WK counts that as 0% but stats site says it’s 50%.

It’s not so much that the stats site does it differently. The stats site shows the percentage that you see during a review session, which unfortunately was all you could get from API v1. API v2 will allow you to get the more useful percentage from the review summary page as well, and hopefully the new version of the stats site will add that information.

@Kumirei By the way, what program did you use to run the Monte Carlo simulation?

1 Like

FYI, since items start out on Apprentice 1, it’s actually only 8 reviews to Burn, not 9. Unless you’re including the Lesson quiz, but in that case, keep in mind that doesn’t include Lesson quizzes in your accuracy.

So how can it be that I’ve done 183,871 already but my 94%+ accuracy tells me that I need 175,227 reviews? :thinking:

1 Like

If I’m correct that Kumirei is using the wrong percentages, that would be the most likely reason. But regardless, keep in mind that this is a statistical analysis. It will never be completely accurate.

1 Like

Sure, but I still have 5500+ items to burn. If I was to get everything 100% correct from now on, I would still easily need to do around 20,000 reviews. That’s a 10%+ difference :slight_smile:

I know, that’s why I said it was likely that there was something wrong with the calculation. :slight_smile:

1 Like

A perfect track record on Wanikani would be 137048 reviews:

[(6300 vocab + 2027 kanji) * (1 reading + 1 meaning) * (8 srs level-ups)] + (477 radicals * 1 meaning * 8 srs level-ups)

1 Like

And how many reviews did you do again? :smile:


Just a small thing, but did you consider that getting a wrong answer at SRS level 1 and 2 will decrease the SRS level by 0 and 1 respectively instead of 2?

I’d say the calculations themselves are basically correct. It’s just the accuracy you see on the statistics page is wrong. Or rather it’s based on a different set of information.
The accuracy as shown on wkstats is based on how many meanings OR readings you got right.
The simulation is based on how many items (meanings AND readings at the same time) you get right.
So basically your actual accuracy is a tad below what is shown on wkstats. Depending on how often you get just one of either meaning or reading wrong it can be considerably lower than what is shown.

I.e. if your accuracy is 95% just subtract another 4-5% and look it up in the chart.
Assuming you did 100,000 reviews and got 95,000 correct and 5,000 wrong.
Then your accuracy would be shown as 95,000 / 100,000 = 95%
In most cases you just get either meaning or reading wrong and not both (if you’re above 90% accuracy). If you get either wrong you might as well get the other wrong as well since the item itself will be wrong anyways. In that case you can just double the amount of wrong answers you gave.
100,000 total - 95,000 correct - 5,000 wrong -> 95,000/100,000 = 95% accuracy
105,000 total - 95,000 correct - 10,000 wrong -> 95,000 / 105,000 = ~90.5% accuracy
It’s based on the assumption that you only get one of them wrong. The more often you get both meaning and reading wrong at the same time the closer the wkstats accuracy is to your actual accuracy.

TL;DR: Just deduct what % you’re missing from 100% from your accuracy.

1 Like

That’s why it’s a simulation and not a calculation. The simulation will ideally randomise which items get answered wrong and when.

I’m not really sure what you’re responding to. If the simulation doesn’t account for when to drop items by different SRS amounts it’s simply wrong. It would be a simulation of a version of WaniKani that doesn’t exist.

@shiza: Just to note, all Apprentice items drop by one SRS level instead of two (except Apprentice 1 which as you noted doesn’t drop at all).

1 Like

That’s why I said the simulation will ideally randomise which items get answered wrong and when.

Okay, I didn’t know this. I thought it’s always -2 when you get something wrong.

I always forget to check but does getting the same item wrong multiple times during the same review session decrease the % shown during reviews?

From WaniKani FAQ:

The FAQ is inaccurate then. I don’t think it has been updated in years.

I’m sure it does randomize the when, but I’m not talking about that. I’m talking about making sure that if the simulation randomly marks an Apprentice 4 item wrong it drops by one level to Apprentice 3, but if it marks a Master item wrong it drops by two levels to Guru 1.

1 Like

That’s what I meant by “when”.

Apparently we have different definitions of the word “when” then…

1 Like