[Userscript] Community Mnemonics (v0.9.7.8)

In order to make sure your code for transforming the markup into html is secure and reliable, you should write separate code that parses the data into a tree and separate code that serializes the tree. (Don’t just do regex replacement or try to convert the markup into HTML in-place. That is too easy to do in a way that is insecure.) You should also make sure your code detects and throws an error on invalid markup like [k]kanji[k][/k].

Again, writing parsing and serialization code can be fairly error prone. I just want to stress to avoid any temptations to do something quick and dirty or take shortcuts in the code, because doing so often leads to security problems.

Edit: Still, just using an iframe with the sandbox attribute probably shouldn’t be the only protection because apparently, there have been security bugs allowing one to bypass the sandbox attribute as recently as a year ago, maybe less.

1 Like

I thought @oldbonsai said in his post that @Samuel-H made his sheet read only too, or did I misunderstand?

It uses the same Spreadsheet. Of course it would be best to copy everything to a new one. Originally I thought it wouldn’t be that easy, because the insert uses a macro, that I wasn’t able to access. But as it turns out I just had to copy the sheet to my own drive to access the macro. :man_facepalming:.
Anyway, that’s the next thing I will be doing as well, change it to a seperate Spreadsheet.

But unfortunately, I have noticed, it is only read only for direct edits on the spreadsheet. Inserting via this macro and in turn the submission of new Mnemonics still works.

    url: "https://script.google.com/macros/s/AKfycbznhpL43Ix-qqO3sNcJmQeQk5dsdW6u0uaZ9to4_8TQho0qcm0/exec",
    type: "POST",
    data: serializedData, ...

But that made me wonder. If the only way for data into the spreadsheet is this macro (Also just Javascript code), could it be enough to sanitize everything, that goes in, there?
But I guess, the other solutions should be implemented as well, for best protection.

Anyway, thanks for helping me find a proper solution. I would be completely lost if I had to do it myself and would probably just settle for a mediocre solution.

Edit: It looks like I misunderstood your post. It does appear that macros can modify read only sheets as long as the sheet is writable by the person who created the macro. So it looks like if you have an App Script which runs as an Installable Trigger and the document is read only to everyone except you, that might be secure as long as the app script does sanitization on all uploads.

I tried installing this but the mnemonics don’t seem to load on any page. I’ve already tried clearing my cache. Could it be a conflict with one of the other scripts I’m using? Does anyone know?

– NVM, I just scrolled up and noticed the update, after which it started working. Thank you. :slight_smile:

Ok, so here is what I have done so far. @est_fills_cando @oldbonsai

I added sandboxed iframes, for user Mnemonics, to be displayed in.
I have copied the original Spreadsheet and have set it to read only as well. Like this the only way for data to enter the Spreadsheet is the Google Apps Script … script. That gets called from the plugin with the data to be inserted.
In this Google Apps Script I added a regex replace, to delete all HTML tags, except the ones used for highlighting. (I know regex isn’t great to sanitize for XSS, but I couldn’t get any proper library to run there. )
It is also hosted on GitHub now.

The Sheet is read only, but everybody can still take a look at it, to make sure there is no malicious code already in there:

I still want to change the highlighting method to something else, more secure and user friendly and also revise Escaping and sanitization in the other parts of the script.

I’m glad it works for you. Make sure to update to the new version though. :durtle:

1 Like

I think the script is still vulnerable even with the changes. I would recommend disabling the submission macro until this is fully fixed. (I assume the document is already read only so the submission macro is the only way for users to change it?) I’m happy to elaborate but it might make sense to wait until the macro is disabled.

Alright, I have disabled the Apps Script (submission macro). So now the Spreadsheet is truly read only.
I think the regex /(<(?!(?:\/?(span|b|i|u|s))\b)[^>]+>)/gmi that I used in there should already filter out any incoming malicious tags. At least I have not found any constellation, that would make it’s way through.
Where do you think might still be a problem?

It appears that you include the text of the mnemonic in the sandbox iframe, but not some of the other content like the username of a person making a request for a mnemonic. (Look at the source of getCMReqList() and the places that function is called.) This appears to create an XSS vulnerability if someone were to submit a request with a fake username that is something like " onmouseover="alert();" which would not be filtered out by your regex.

You might check out the OWASP XSS Prevention Cheat Sheet and OWASP XSS Filter Evasion Cheat Sheet. The first one is how to prevent XSS attacks, and the second one is a list of a very large number of ways of bypassing XSS filters, which is illustrative of why so much care needs to be taken in order to actually prevent XSS. (Some of the attacks in the second link don’t work in modern browsers, but a lot do. The larger point is that unless it is 100% obvious that code outputs safe HTML, then the code is probably vulnerable to some attack.)

Some other possible issues:

  • in case you are skimming this post, make sure to read the very first paragraph
  • unicode flag is not set on javascript regex in the macro, not sure if this is actually a problem or not
  • rowinfo is not sanitized in the macro
  • as mentioned, regex in the macro doesn’t actually prevent XSS. For example, <i onmouseover="alert();">asdf</i> but there are probably lots of other XSS things it lets through even if you fix this example
  • the name of the kanji/vocab includes a fairly large range of unicode characters and to be on the safe side, should probably be checked to make sure the range isn’t larger than it needs to be
  • usernames can and probably should have more aggressive sanitization than mnemonics (I think the length is always >=1 and <=16 chars and I’ve never seen a WK username that doesn’t match /^[a-zA-Z0-9\-_]+$/
  • It would probably be good to use unicode character category matching to verify that each type of data doesn’t contain characters of an unexpected type. For example, kanji/vocab names should probably only include categories L, Nd, and the handful of specific symbols that WK uses in vocab names like the japanese tilde and dash.

More fundamentally, I am concerned that both the macro and client side script access unsanitized data in so many places that it will be difficult to make secure and essentially impossible to verify the security. (This is a problem you only inherited of course.) For example, it looks like there are over 200 places in the client side code where CMData is referenced, and CMData contains unsanitized data. I certainly wouldn’t be confident in my own ability to check every one of them to make sure that the script isn’t doing anything unsafe without making any mistakes. And it looks like some other global variables may also depend on unsanitized data too. The server-side macro similarly has a bunch of places where it references unsanitized data, not nearly as many, but enough to make me concerned that the bullet points above might not include every potential issue.

So how to fix this? Here is what I would do if I were writing it. For the macro, have a separate variable for each parameter instead of an array (making the code less general since it will only work for your spreadsheet, which is fine). Set those variables once at the very beginning of the macro to sanitized versions of each corresponding parameter and then only reference those sanitized variables for the rest of the macro, not the original unsanitized parameters. To do this of course, you would need to have a robust and comprehensive sanitization functions specific to each parameter the macro takes. (A different sanitization function for each different type of data and also taking into account how the data is used. See below) For the mnemoic parameter specifically, I would recommend parsing the data into a json tree, and for each text node in the tree, converting the text to htmlentities using the following code (in addition to checkin the unicode character category as mentioned in the last bullet above):

    // converts the given string to htmlentities
    function htmlEntities(str) {
        let codePoints = Array.from(str).map(unicodeChar => unicodeChar.codePointAt(0));
        if (!codePoints.every(codePoint => Number.isInteger(codePoint) && codePoint>=0)) { // also prevents null and undefined
            throw RangeError('Invalid code point');
        return codePoints.map(codePoint => '&#' + codePoint.toString() + ';').join('');

Here is an example of what the format for the tree could look like:


The client side code in the browser could then download the json and convert it to safe html when it needs to display it (still making sure to display it in a sandbox iframe for extra protection).

For the client side script, after taking a closer look, I think the amount of refactoring needed may be so large that it may be easier to basically gut all the logic in the script, maybe just keeping the CSS and mimicking some of the html structure. (As mentioned, the problem is in the huge number of places data is used. And even if you sanitize all the data server side, the type of sanitization you need to do has to also depend on how the data is used. (See the OWASP guide on preventing XSS.) If I were doing this, I would have a clear separation between parts of the script that handle sanitized vs. unsanitized data (with code handling the latter being very short) with clear comments on where it is safe to use each type of sanitized data and reducing the number of loc to where you reference variables that contain even sanitized data to less than 10.

By the way, if you want to make sure that people can only submit mnemonics under their own real Wanikani username, there is relatively straightforward way to check this: in the client create a sha256 hash of the submission and temporarily set the user’s profile text to this hash, then have the macro compute its own hash of the submission and fetch a copy of the user’s profile page and verify that the hash it retrieves matches the hash it computed itself. In order for this to be secure, you do need to make sure that processing the same request more than once has the same effect as just processing it once. Otherwise, there would be a short window in which someone could do a replay attack.

1 Like

Wow, thank you for this comprehensive analysis. It’s really awkward that I didn’t notice something so obvious, like the request usernames. But all the other stuff, I never would have thought of.
I will definitely give this XSS Prevention Cheat Sheet a thorough read before continuing. It is a lot of stuff to keep in mind.
I also already have thought about re implementing the whole thing, since it’s really hard to comprehend all of this undocumented code and retrace where unsanitized data is referenced.
At first I thought it would be better to fix the problems in this script, before making a new version of it from scratch, to at least have a working version in the meantime. But with all the issues, you just brought up, I begin to think, it might actually be better to ditch this one and just start over.

Either way, I want to do it properly this time and not rush any solution, that still has security issues. So the Spreadsheet will remain read only, until then.

I will post here again, once I have decided, what to do next.

That sounds like a good plan. It’s very hard to avoid making mistakes in a large codebase, so I wouldn’t beat yourself up too much. That’s why it’s a good idea to take precautions like minimizing the amount of code where making a mistaking could cause a security problem (for example, by sanitizing data very early and minimizing the number of lines that deal with even sanitized user data) and things like adding multiple layers of defenses so that if one layer has a bug, users are still protected by the other layers.

Not sure how helpful it is but I was working on a rewrite

I haven’t touched it in a while, but here’s the half-done rewrite code Tellow Krinkle / WK Community Mnemonics · GitLab

IIRC the rewrite successfully loads the existing database but is completely missing functionality to submit new mnemonics

1 Like

Thanks mate,
I will take a look at it soon. But currently I am super busy with exams in university and will be for at least two more weeks.

I took a quick look. On line 1162 and subsequent lines, is the quantity CMItem.u[CMSortMap[mnemType][CMPage]] already escaped? If not, that’s an XSS vulnerability. I didn’t look through very much of the code, but in what I read, I also saw a couple other things that worried me even though I don’t immediately see how to exploit them:

  1. The regex doesn’t set the unicode awareness flag.
  2. Constructing HTML by string concatenation greatly increases the risk of XSS. For example, if somewhere else there is even a single unmatched quote, the generated HTML becomes vulnerable to XSS. Example:
let sanitized = sanitizeMnemonic('<span class="onmouseover=alert();"></span>');
let html =  '<a href="https://www.wanikani.com/>WaniKani HomePage</a>' + sanitized; // causes XSS

If you insert the contents of the above html variable into the document then it results in an <a> tag with the onmouseover attribute set to "alert();" which is an XSS.
The right way to fix this is to switch to OOP and write a class for an html tree node (sort of like a stripped down version of javascript’s own HTMLElement class) and give it its own equivalent of an outerHTML method which converts the tag and its subtree into a string. Then you only need to make sure your quotes are properly matched in just one place (the part of the outerHTML method where you turn the attribute list into a string).
3. You don’t whitelist the contents of the class attribute for span. (The attack in the previous comment would be harder if this were done.)
4. You use encodeURIComponent() for escaping things that go in URL’s but this does not escape single quotes which can allow you to add empty unauthorized attributes to tags if you use single quotes instead of double quotes to enclose the url. I don’t know how to exploit this to do XSS, but anything that allows creating malformed HTML is worrying.
5. It would really improve the security if, in addition to sanitizing, you also put everything generated by users including usernames, posts, etc only in an iframe with the sandbox attribute.
6. The mnemonic text is sanitized via regex replacement. A more robust method of sanitization which would require relatively little change to your existing parsing code is to modify your parsing code to output an array that looks like

[{type:'plaintext', text: 'This kanji is '},
 {type:'reading', text: 'Ko'},
 {type:'plaintext', text: 'ichi'}]

Then, you could write a function which goes through the array and converts each text attribute into htmlentities, thereby sanitizing the data. Finally, you can convert the sanitized data to html by looping through the sanitized list. The advantage of doing it this way is that even if there is a major error in your parsing code, it won’t result in an XSS vulnerability
7. In excapeHTML, why not convert all characters to html entities?

    // converts the given string to htmlentities
    function htmlEntities(str) {
        let codePoints = Array.from(str).map(unicodeChar => unicodeChar.codePointAt(0));
        if (!codePoints.every(codePoint => Number.isInteger(codePoint))) { // also prevents null, undefined, and NaN
            throw RangeError('Invalid code point');
        return codePoints.map(codePoint => '&#' + codePoint.toString() + ';').join('');
  1. There are around 200 references to CMData in the script. If the content of CMData is not sanitized, it is very difficult to check all uses of CMData to make sure none of them has an XSS vulnerability. It would be better to sanitize your data immediately after it is received from the server and before storing it anywhere and to never even store an unsanitized copy. See also my earlier post to @Dakes suggesting completely scrapping the original code for the UserScript.
  2. If you accidentally swap uses of encodeURIComponent and escapeHTML it results in an XSS vulnerability. You can avoid even the possibility of accidentally doing this if you implement the fix suggested item 2.

Edit: added items 4 and 9 above.

Sorry, I left the old code in as reference as I was writing new code, I just went through and marked it as deprecated, they shouldn’t be called. CMData is forever empty


What’s the problem with a non-unicode-aware regex in this case?

I switched the HTML generation code to use HTMLElements, which seem to automatically escape things for you

This scrip doesn’t work on Safari? I have Safari v14 and I can’t make it work!

id like to see this one back up and running too =D

This is exactly what I’m looking for, thank you so much, TK and Dakes!

Does this script still work? First it would load for ages and then display an error message, after deleting all WK cookies it doesn’t even show up anymore.

Hello people :durtle_hello:

Sorry it took so long, but I finally got around to remaking the whole Community Mnemonics Script. But safe and properly this time. :high_touch:
Head over to the new post for more infos and installation instructions: [Userscript] Community Mnemonics 2 (WKCM2) 0.2 :curry:

@Liras @neversleep @junemofu

1 Like