Continuing the discussion from Level 60
and my study habits:
Sorry for the long winded post, but Iâm curious why tsurukame might affect accuracy stats on wkstats. This post is mostly to write my understanding down somewhere.
I suspect @wiersm might be right that tsurukame somehow causes artificially high âaccuracyâ reporting from wkstats.
(Hopefully, Iâm posting this in the Tsurukame thread as intended. Iâm unfamiliar with tsurukame and wondering if this is a known issue).
Here are my stats for comparison:
@wiersm: I never used tsurukame. Since you said you used it often, would you mind posting your wkstats accuracy for comparison?
@rfindley: please confirm my assumption that wkstats calculates this table solely from the review_statistics
object for each subject.
Gory background re: question vs. review accuracy
Note that Total âReviewsâ in the wkstats output is a slight misnomer.
Technically, wkstats appears to sum Correct
and Incorrect
answers and report it as âTotal Reviewsâ.
To be pedantic, I think this would be better labeled âTotal Answersâ (or Total Quizzes/Questions). Itâs a subtle distinction, but you might answer incorrectly more than once for a given subject within a single review session.
To my mind, at least, only items/subjects are reviewed. Meanings and readings are individually quizzed and answered for each subject, but, ignoring radicals, you must provide two correct answers to complete a review for a given subject.
WKstats appears to report what I call âquestion accuracyâ:
total_correct = reading_correct + meaning_correct;
total_incorrect = reading_incorrect + meaning_incorrect;
total_answers = total_correct + total_incorrect;
question_accuracy = (total_correct / total_answers) * 100;
The API tracks correct/incorrect counts for an individual subject in the Review Statistics data structure. I believe this structure is updated whenever an individual Review record is created for a subject.
Ignoring scripts and tusurukame, the official web app only creates a Review record whenever both the reading and meaning components for a subject are eventually answered correctly.
So whenever all components of a subject are answered correctly, wanikani adds exactly one to reading_correct
and meaning_correct
for that subject. It will also add zero or more to the incorrect counts for that subject (depending on how many times an incorrect answer was provided).
Thus, a pedantically precise count of Total Reviews isnât simply the sum of all correct and incorrect answers, itâs:
total_reviews =
correct_radical_meaning
+ (correct_kanji_meaning + correct_kanji_reading) / 2
+ (correct_vocab_meaning + correct_vocab_reading) / 2
Item accuracy vs. question accuracy
The summary pages (that are soon going away) report item accuracy, the percentage of subjects that were reviewed with no incorrect answers for either reading or meaning. This is usually a smaller number than question accuracy.
For a given radical:
item_accuracy = (meaning_correct - meaning_incorrect)
/ (meaning_correct + meaning_incorrect)
* 100;
For kanji or vocabulary subject, the only absolutely accurate way to calculate item_accuracy
is to walk through every individual Review
record (which is extremely expensive since a user might have hundreds of thousands of reviews).
If you ignore repeated incorrect answers within a single session (which are hopefully fairly infrequent) it can be approximated for kanji/vocab items as:
item_accuracy = (meaning_correct + reading_correct
- meaning_incorrect - reading_incorrect)
/ total_reviews
* 100;
(This slightly _under_states the item accuracy, though, as it dings you for every incorrect answer within a session, not just the first.)
Questions
I know that tsurukame âbatchesâ data and uploads it to the server periodically (I ran into this after discovering that review records for some users werenât always in chronological order). Iâve no idea how or why that would affect accuracy stats, though.
For wkstats to be overstating question accuracy, it must somehow be under-reporting incorrect answers, or over-reporting correct answers. The former seems more likely.
One hypothesis: perhaps it never records more than one incorrect answer within a âsessionâ (however it defines a session). Maybe it just marks a subject review as âincorrectâ and doesnât count all the incorrect meaning/reading replies?
Another hypothesis: how does it handle multiple reviews of the same subject (multiple sessions) between uploads via the API? Is it possible that affects things somehow?
Final hypothesis: @wiersm has the memory of an elephant and really does have 95%+ question accuracy.