[iOS] Tsurukame - native app with offline lessons and reviews

Continuing the discussion from Level 60 :confetti_ball: and my study habits:

Sorry for the long winded post, but I’m curious why tsurukame might affect accuracy stats on wkstats. This post is mostly to write my understanding down somewhere.

I suspect @wiersm might be right that tsurukame somehow causes artificially high “accuracy” reporting from wkstats.

(Hopefully, I’m posting this in the Tsurukame thread as intended. I’m unfamiliar with tsurukame and wondering if this is a known issue).

Here are my stats for comparison:

@wiersm: I never used tsurukame. Since you said you used it often, would you mind posting your wkstats accuracy for comparison?

@rfindley: please confirm my assumption that wkstats calculates this table solely from the review_statistics object for each subject.

Gory background re: question vs. review accuracy

Note that Total “Reviews” in the wkstats output is a slight misnomer.

Technically, wkstats appears to sum Correct and Incorrect answers and report it as “Total Reviews”.

To be pedantic, I think this would be better labeled “Total Answers” (or Total Quizzes/Questions). It’s a subtle distinction, but you might answer incorrectly more than once for a given subject within a single review session.

To my mind, at least, only items/subjects are reviewed. Meanings and readings are individually quizzed and answered for each subject, but, ignoring radicals, you must provide two correct answers to complete a review for a given subject.

WKstats appears to report what I call “question accuracy”:

total_correct = reading_correct + meaning_correct;
total_incorrect = reading_incorrect + meaning_incorrect;
total_answers = total_correct + total_incorrect;

question_accuracy = (total_correct / total_answers) * 100;

The API tracks correct/incorrect counts for an individual subject in the Review Statistics data structure. I believe this structure is updated whenever an individual Review record is created for a subject.

Ignoring scripts and tusurukame, the official web app only creates a Review record whenever both the reading and meaning components for a subject are eventually answered correctly.

So whenever all components of a subject are answered correctly, wanikani adds exactly one to reading_correct and meaning_correct for that subject. It will also add zero or more to the incorrect counts for that subject (depending on how many times an incorrect answer was provided).

Thus, a pedantically precise count of Total Reviews isn’t simply the sum of all correct and incorrect answers, it’s:

total_reviews = 
   correct_radical_meaning 
  + (correct_kanji_meaning + correct_kanji_reading) / 2
  + (correct_vocab_meaning + correct_vocab_reading) / 2

Item accuracy vs. question accuracy

The summary pages (that are soon going away) report item accuracy, the percentage of subjects that were reviewed with no incorrect answers for either reading or meaning. This is usually a smaller number than question accuracy.

For a given radical:

item_accuracy = (meaning_correct - meaning_incorrect)
        / (meaning_correct + meaning_incorrect)
        * 100;

For kanji or vocabulary subject, the only absolutely accurate way to calculate item_accuracy is to walk through every individual Review record (which is extremely expensive since a user might have hundreds of thousands of reviews).

If you ignore repeated incorrect answers within a single session (which are hopefully fairly infrequent) it can be approximated for kanji/vocab items as:

item_accuracy = (meaning_correct + reading_correct 
  - meaning_incorrect - reading_incorrect)
  / total_reviews
  * 100;

(This slightly _under_states the item accuracy, though, as it dings you for every incorrect answer within a session, not just the first.)


Questions

I know that tsurukame “batches” data and uploads it to the server periodically (I ran into this after discovering that review records for some users weren’t always in chronological order). I’ve no idea how or why that would affect accuracy stats, though.

For wkstats to be overstating question accuracy, it must somehow be under-reporting incorrect answers, or over-reporting correct answers. The former seems more likely.

One hypothesis: perhaps it never records more than one incorrect answer within a “session” (however it defines a session). Maybe it just marks a subject review as “incorrect” and doesn’t count all the incorrect meaning/reading replies?

Another hypothesis: how does it handle multiple reviews of the same subject (multiple sessions) between uploads via the API? Is it possible that affects things somehow?

Final hypothesis: @wiersm has the memory of an elephant and really does have 95%+ question accuracy.

1 Like