In my understanding, an index maps a key to specific rows of a table, not just blocks of binary data, which is why I can’t really follow. Admittedly, there may be IO overhead that I’m not aware of, especially if you have slow storage media without random access, but at least you save the time of having to compare every row in the table.
If an index on a user id, in the case where there are probably at least thousands of users, isn’t selective I wonder what kind of index is.
But admittedly, this is not my area of expertise, so I must have fundamentally misunderstood something.
index maps to rowid (or similar) - which tells the db where to look for the row, it then needs to load the specific data page from disk (if it is not in memory,; even on ssd this io takes a long time). This might help? - How Postgres Stores Rows
Imagine you need to obtain 5 books from 5 different libraries - it does not help you if you know which book is in which, if you still need to go to all of them.
You are correct that the userid will reduce the number of data obtained - but that does not help in this case where the data is so scattered that it does not help. I admit though that I might be completely wrong (I have not seen the execution plan for the query).
Fair enough, but the way I would have understood it
A full table scan needs to load all blocks into memory too, and so even if the indexed query had to load most or even all blocks, the only additional I/O it needs to perform is loading the index which… probably is not a lot in comparison?
The indexed query avoids more CPU work by not having to compare each row to the value.
So I could see that in a specific case, the index might not make the query much faster, but that it actually makes it slower?
That’s just focusing on the read side though. By adding an index the write side likely gets slower, and for a table that’s written to every time a review is completely that could have a big impact.
Yes, but it’s still something the WaniKani devs would have to be mindful of. Making the reads faster doesn’t help if it takes down the app when writing. Not that I have any idea if that would be relevant to them in practice.
It seems to me like partitioning the database by user is the way to go then. As a bonus, you could find inactive users and put them in a separate partition to reduce the load on the main partitions.
Reorganising databases is not that simple. It would probably take hours just for the update process itself, and even then it would need to be thoroughly tested to ensure nothing went awry during the process.
This is the first time I saw https://nihongostats.com/, pretty cool site! I wanted to develop something roughly along the lines myself But I assume the site is affected by the /v2/reviews API change and there is no workaround, right?
Databases are not my forte, but this sounds like the typical use case for MapReduce to me, basically doing what google, who coined this technique, did with their search results, though of course on a much smaller scale. However, this would require a NoSQL database (like MongoDB), which seems to me the better choice for this type of data due to the usual problem of fitting the user table on one server and not being able to use things like sharding. But of course, hindsight is always 20/20…
@tofugu-scott
I’ve read a bit in [Userscript] Wanikani Heatmap and it seems most people, myself included, are missing the streak counter above all. So I would suggest implementing this as two new endpoints, for example GET /user/streak/review and GET /user/streak/lesson to soften the blow from removing GET /reviews.
From what I understand about WaniKani’s architecture and data model, the backend looks to me like the better place to do this than the client because to query the number of reviews for a user and for every day since creation is probably not cheap.
It might even be preferable to add this as a field to the user model, which (after initialization) gets incremented or reset via an hourly (to avoid timezone trouble) cronjob that only checks reviews/lessons of the past day from the user configured “New day starts at” time to avoid the JIT calculation in the first place. Changing the “New day starts at” time probably needs reinitialization of the two streak fields, but that’s much rarer than on every page refresh from someone who has the heatmap userscript installed.
Let me double you, dear sir.
Same here, missing streak feature for Heatmap script. @tofugu-scott I should admit, looking for daily tiles being painted was pleasant as well, but most of all I miss the streak counter itself.
I’m sorry for bringing another service as example, but Bunpro guys might be doing it in quite the way I’d be happy with.
I agree with this. My thinking is that the reviews table is very big and write-heavy which caused the negative impact when the new index was added.
Partitioning the table on user id would almost certainly fix the problem but the challenge would be getting there if the reviews table is essential for the main application to work.
(I had MySQL / MariaDB in mind when I wrote this but I could be wrong…)
heatmap has a fix because it can get by with just pulling recent reviews, and keeps the rest of the data cached. breaks if you move between devices though.
other scripts (e.g. ganbarometer) aren’t working, so doesn’t look like it’s fixed.