New version of API suggestion


#1

As I moved into higher levels, I have progressively been able to notice the problems of the current API. I will explain one common problem and the solutions I’m considering to go around this.
So let’s say you want to enjoy this nice script that allows you to check what reviews are coming next. So you’re thinking, it’s easy, just send a request to the API (well 3 since it’s for radicals, kanji and vocab) then use the mighty JS to go through that and sort by date. So without considering that JS (or any interpreted language for that matter) is terrible performance wise when you want to process 6000 items (it still gets done fast enough on a home computer), the main problem faced by the script now is that WK simply times out when requesting all the levels at once.

Current solution workaround:
1 ask levels by group of 10 or so, more or less dodging the issue by sending more requests (so getting everything ends up taking up to one minute or so sometimes but no more timeout)
2 rewrite the script so the lowest (burned) levels won’t be asked to the API (ugly but saves time and you can get ~30 sec to get your future now)

So I think it is safe to say that neither of these solutions are great on the client side (people don’t like to wait). Server side it isn’t great either. A request of half my vocab (1-25) is 1.1MB of text so I will extrapolate to basically 3MB of text WK has to generate and send you when you reach higher levels (which means bandwidth costs and some processing power).


So there are two “easy” fixes to implement: add a new request on the API to get the information (new reviews coming up). Might save some processing power on the server (I assume it’s faster to sort+select a part of a database than a full dump basically) and a lot of bandwidth (probably around 500 items worse case, so that’s like a quarter of the bandwidth).
Example request:

https://www.wanikani.com/api/user/yourapikeyhere/study-items/3 <-3 would refer to 3 days for example

The second would be a way to request only a part of the information. The current API gives a lot of information on a each character. I believe very few APIs would care about synonyms for example. So maybe appending at the end of the line the information you want. This would probably reduce the size of the request to a quarter of its current size in most cases (plus it would be good for every API out here). Bonus points for putting the two together.
Example request:
https://www.wanikani.com/api/user/yourapikeyhere/vocabulary/available_date,srs
^This would return every vocabulary item but only the available date and srs. Following similar syntax as the levels, nothing means all and a coma-separated list is used to specify specific information to keep.

The second solution to get around the timeout because of the bandwidth is to use some compression. ZIP compression nets ~90%, I assume even some stupid research+replace (retarded dictionary compression) would net 50% for basically no cost (the server would use the short 1 character names for everything+1 table as a reference). This also requires some more work client side but with a common compression system there are API who can take care of it and with research and replace it is either trivial or not even needed (people just look for part of the information).

One last option is to move to binary and output the data in its most efficient way. This requires the most work from developers (both from the WK team and third-party) so I’m not expecting it to be much considered. The easiest solutions would probably reduce the load enough so that this solution would look more like overkill.

If you have other ideas I haven’t thought of, please suggest them. The last suggestions are in my opinion not the best since they would bring the most changes to the API but I decided to go along with it and present everything I thought of. Please bring your (constructive) comments and I hope some people from the WK team will see that.

Save our bandwidth, sign the petition! (and as a bonus you save WK’s)


#2

This would be nice.


#3

@meneldal,
I’m working on a centralized data framework to intelligently gather and cache WK and user data.  It will eliminate the vast majority of API calls for any scripts that make use of it.

Upon first sync, it queries data in small batches, showing progress as data is received from WK.
Afterward, it uses the Study Queue API call, combined with various timestamps in the user data, to know what minimal incremental data to query.

I’m hoping other script writers will find this useful, and start to support it.  It should make scripting a lot easier, and will also help reduce redundant data gathering.


#4

There’s no reason to use something like ZIP or custom compression; gzip is sufficient, and indeed already enabled for the API.


#5
rfindley said... @meneldal,
I'm working on a centralized data framework to intelligently gather and cache WK and user data.  It will eliminate the vast majority of API calls for any scripts that make use of it.

Upon first sync, it queries data in small batches, showing progress as data is received from WK.
Afterward, it uses the Study Queue API call, combined with various timestamps in the user data, to know what minimal incremental data to query.

I'm hoping other script writers will find this useful, and start to support it.  It should make scripting a lot easier, and will also help reduce redundant data gathering.
I see what you're doing and I think it's great but I see one major case where it won't work well: when an user is away for a while and has reviews of many different levels at the same time (like 20 different levels or so). In this case your solution will be not much better than the current solution, you'll still have to request many levels. In other cases though it would be great.
One hard thing to support might be (for people who dare use the reorder script) handling new items that show up from previous levels since the current API doesn't tell you the items that are in the lesson queue/recent lessons. I guess adding a "last lesson session" in the API could fix this (the information is already there just not in the API)

I just feel that's a kind of "hack" to go around the main issue: people have to request a full list of items while they care for maybe 10% of the information (or even less). That's why I think the API could be improved, to reduce the server-side load and it wouldn't change anything for people writing APIs (after all you just get tables with many fields you don't care about).


Oishikatta said... There's no reason to use something like ZIP or custom compression; gzip is sufficient, and indeed already enabled for the API.
 Source? I found no information on the API about any compression at all. I tried some requests and I got raw text. I only took ZIP as an example, when you already get 90% I don't believe you care so much about the exact savings you can get. Plus this test was on vocabulary which is trivially harder to compress (much more actual information/filling ratio compared to kanji or radicals). I don't really see it as a great solution anyway because it doesn't reduce the load on WK's servers.

#6
meneldal said..
Oishikatta said... There's no reason to use something like ZIP or custom compression; gzip is sufficient, and indeed already enabled for the API.
 Source? I found no information on the API about any compression at all. I tried some requests and I got raw text. I only took ZIP as an example, when you already get 90% I don't believe you care so much about the exact savings you can get. Plus this test was on vocabulary which is trivially harder to compress (much more actual information/filling ratio compared to kanji or radicals). I don't really see it as a great solution anyway because it doesn't reduce the load on WK's servers.
I haven't looked into whether WaniKani has it enabled, but if I remember rightly gzip commpression is all part of the HTTPS wire protocol - its not something you would see in the API. It just magically happens if both ends support it - and enabled on the server.

The basic idea you are suggesting is great - improve response times.
This could probably be done by caching the stuff. (i.e. trading disk space for time).
It would depend on the economic model of the server infrastructure to see whether it would be worth investing the programmers time to save that money.
And since we are still in beta, new features probably trump this work.

I know I've been privately emailing Koichi to considering opening up WaniKani a bit to let those with some programming skills to help out with improvements.
But working out which bits are the 金玉 and which boring plumbing would also require programming resources that are being used elsewhere.



#7

Well I have to say I didn’t hear about this feature from HTTPS. I guess it at least helps for the bandwidth issue indeed. Though using HTTPS for public data is a bit of a waste of resources in my opinion (but a now very common and imo overused protocol).

The main rewrite of the API would be not too complicated in my opinion. It depends mostly how the information is stored. But assuming a pretty standard database, that’s mostly using a select a,b,c vs a select * (well maybe not ALL fields are exported but you get the idea). There will probably some debate+some checks to do to avoid sending some unsafe information but I don’t believe the implementation is very complicated.

By the way, by checking the API doc it seems that unless you need the new things from the latest version you’d better use the old versions since they return less information.


#8
baerrach said...

I haven’t looked into whether WaniKani has it enabled, but if I remember rightly gzip commpression is all part of the HTTPS wire protocol - its not something you would see in the API. It just magically happens if both ends support it - and enabled on the server.

meneldal said…
Well I have to say I didn’t hear about this feature from HTTPS. I guess it at least helps for the bandwidth issue indeed.
It’s actually not specific to https.  It’s browser negotiation in general.  The browser’s Request Header typically contains:
    Accept-Encoding: gzip, deflate, sdch
And if the server supports it, you get a Response Header like:
    Content-Encoding: gzip

So the content is usually gzipped for both http and https.  But it’s usually transparent to the user, thanks to the browser.

#9
Well I have to say I didn't hear about this feature from HTTPS. I guess it at least helps for the bandwidth issue indeed. Though using HTTPS for public data is a bit of a waste of resources in my opinion (but a now very common and imo overused protocol).
It's not specific to HTTPS. Assuming you are using Apache see mod_deflate.

You use HTTPS when you dont wan't anything on the wire to be snooped. It's much simpler than having to do the encryption in the application layer.
But as you point out, there is nothing of value in the API protocol, so HTTP would avoid the encryption overhead.
It's probably easier from a configuration perspective to redirect HTTP to HTTPS anyway :)

The main rewrite of the API would be not too complicated in my opinion
Which bits do you want to cull?
It looks reasonably lean now, about the only thing you could junk would be some fields of "user_specific". And assuming compression is on, then gzip will be replacing duplicate strings with references to the first occurence (read gzip algorithm) so it should be pretty tiny at least over the wire.

I'm assumng the biggest cost is pulling it all out of the database, hence the suggestion to cache, but not having access to all that stuff and to avoid premature optimization I'd rather have some metrics to work from.



#10

So let’s take an example here:

    {
“character”: null,
“meaning”: “gun”,
“image”: “https://s3.amazonaws.com/s3.wanikani.com/images/radicals/80fff71b321c8cee57db7224f5fe1daa331128b5.png”,
“level”: 1,
“user_specific”: {
“srs”: “burned”,
“unlocked_date”: 1392333657,
“available_date”: 1407325500,
“burned”: true,
“burned_date”: 1407329361,
“meaning_correct”: 8,
“meaning_incorrect”: 0,
“meaning_max_streak”: 8,
“meaning_current_streak”: 8,
“reading_correct”: null,
“reading_incorrect”: null,
“reading_max_streak”: null,
“reading_current_streak”: null,
“meaning_note”: null,
“user_synonyms”: null
}
}

A lot of information that is mostly useless. I agree that compression would basically remove most of it but it still takes time to generate it in the first place (there’s a routine that wrote all the text at some point). The dates can be compressed but there’s still a limit. Not considering the image link that adds a lot of entropy, a random entry has maybe ~10-20 bytes of pure entropy (name/meaning/reading) and maybe ~5 from the dates. But even a good compression won’t give you a result that good most likely.

For the review queue there’s basically just one thing you’re trying to get: “available date”. Maybe two if you want to highlight apprentice items (so get that srs value with <3 bits of entropy). So that way even thousands of items would net a very small output, faster to process and to generate. But as you said, without knowing what the servers architecture is, it’s hard to predict how much rewrite would be needed.

What people want to junk really depends on the application but basically the API is currently in a mode “I give you everything!”. And as far as I know nobody made an app using the API that needs everything. The second thing a new version of the API would allow is giving the user control over his API key (for example allowing only some information to be requested, if you want to keep your synonyms secret for example).

I think the community server operating some kind of cache would be great but in the end that’s just writing over the API because the API isn’t doing what you want.


#11

@Meneldal,
I totally agree, server side is the best place to improve.  I don’t know how likely that is, so my solution is hopefully interim.

I would love to get a look at the WK database structure and some load statistics to see where the bottleneck is.  But I suspect they could simply index the “available date”, and add an API query to return the currently-dispensed info, but only for items newer than a client-specified timestamp.  I’ve done something similar in a recent consulting project.

Surely the static radical/kanji/vocab table can remain in server cache at all times, since it’s really not that big, and is common to all users.  So, the database load would mostly come from a user’s “available date” data, which should be indexed by timestamp and user.  Then the rid link between that table and the item data table would make for fast fetching.


#12

https is not really much overhead at all, especially if you employ it correctly and also use spdy or something. and it’s not only useful for 3rd party not seeing the content of the transmission (that’s just a side effect), the main advantage is authentication so you can’t be mitm-ed. this is always important, if you send sensitive information or not. see also: https://www.youtube.com/watch?v=cBhZ6S0PFCY not really interesting talk but at least they have knowledge.

Edit: just wanted to add this https://docs.google.com/presentation/d/1BH9DI1XlmukCzU2i8OvxLIfgQf_aGlZgZyvWDSyYyzs/present
some good tips for employing tls