It appears that you include the text of the mnemonic in the sandbox iframe, but not some of the other content like the username of a person making a request for a mnemonic. (Look at the source of getCMReqList()
and the places that function is called.) This appears to create an XSS vulnerability if someone were to submit a request with a fake username that is something like " onmouseover="alert();"
which would not be filtered out by your regex.
You might check out the OWASP XSS Prevention Cheat Sheet and OWASP XSS Filter Evasion Cheat Sheet. The first one is how to prevent XSS attacks, and the second one is a list of a very large number of ways of bypassing XSS filters, which is illustrative of why so much care needs to be taken in order to actually prevent XSS. (Some of the attacks in the second link don’t work in modern browsers, but a lot do. The larger point is that unless it is 100% obvious that code outputs safe HTML, then the code is probably vulnerable to some attack.)
Some other possible issues:
- in case you are skimming this post, make sure to read the very first paragraph
- unicode flag is not set on javascript regex in the macro, not sure if this is actually a problem or not
- rowinfo is not sanitized in the macro
- as mentioned, regex in the macro doesn’t actually prevent XSS. For example,
<i onmouseover="alert();">asdf</i>
but there are probably lots of other XSS things it lets through even if you fix this example
- the name of the kanji/vocab includes a fairly large range of unicode characters and to be on the safe side, should probably be checked to make sure the range isn’t larger than it needs to be
- usernames can and probably should have more aggressive sanitization than mnemonics (I think the length is always >=1 and <=16 chars and I’ve never seen a WK username that doesn’t match
/^[a-zA-Z0-9\-_]+$/
- It would probably be good to use unicode character category matching to verify that each type of data doesn’t contain characters of an unexpected type. For example, kanji/vocab names should probably only include categories L, Nd, and the handful of specific symbols that WK uses in vocab names like the japanese tilde and dash.
More fundamentally, I am concerned that both the macro and client side script access unsanitized data in so many places that it will be difficult to make secure and essentially impossible to verify the security. (This is a problem you only inherited of course.) For example, it looks like there are over 200 places in the client side code where CMData
is referenced, and CMData contains unsanitized data. I certainly wouldn’t be confident in my own ability to check every one of them to make sure that the script isn’t doing anything unsafe without making any mistakes. And it looks like some other global variables may also depend on unsanitized data too. The server-side macro similarly has a bunch of places where it references unsanitized data, not nearly as many, but enough to make me concerned that the bullet points above might not include every potential issue.
So how to fix this? Here is what I would do if I were writing it. For the macro, have a separate variable for each parameter instead of an array (making the code less general since it will only work for your spreadsheet, which is fine). Set those variables once at the very beginning of the macro to sanitized versions of each corresponding parameter and then only reference those sanitized variables for the rest of the macro, not the original unsanitized parameters. To do this of course, you would need to have a robust and comprehensive sanitization functions specific to each parameter the macro takes. (A different sanitization function for each different type of data and also taking into account how the data is used. See below) For the mnemoic parameter specifically, I would recommend parsing the data into a json tree, and for each text node in the tree, converting the text to htmlentities using the following code (in addition to checkin the unicode character category as mentioned in the last bullet above):
// converts the given string to htmlentities
function htmlEntities(str) {
let codePoints = Array.from(str).map(unicodeChar => unicodeChar.codePointAt(0));
if (!codePoints.every(codePoint => Number.isInteger(codePoint) && codePoint>=0)) { // also prevents null and undefined
throw RangeError('Invalid code point');
}
return codePoints.map(codePoint => '&#' + codePoint.toString() + ';').join('');
}
Here is an example of what the format for the tree could look like:
{type:'root',
children:[
{type:'plaintext',
text:'Here is some "vocabulary":',
},
{type:'vocabulary',
text:'𝌆.𝌆𝌆😀',
},
{type:'plaintext',
text:'. The end.',
},
],
}
The client side code in the browser could then download the json and convert it to safe html when it needs to display it (still making sure to display it in a sandbox iframe for extra protection).
For the client side script, after taking a closer look, I think the amount of refactoring needed may be so large that it may be easier to basically gut all the logic in the script, maybe just keeping the CSS and mimicking some of the html structure. (As mentioned, the problem is in the huge number of places data is used. And even if you sanitize all the data server side, the type of sanitization you need to do has to also depend on how the data is used. (See the OWASP guide on preventing XSS.) If I were doing this, I would have a clear separation between parts of the script that handle sanitized vs. unsanitized data (with code handling the latter being very short) with clear comments on where it is safe to use each type of sanitized data and reducing the number of loc to where you reference variables that contain even sanitized data to less than 10.
By the way, if you want to make sure that people can only submit mnemonics under their own real Wanikani username, there is relatively straightforward way to check this: in the client create a sha256 hash of the submission and temporarily set the user’s profile text to this hash, then have the macro compute its own hash of the submission and fetch a copy of the user’s profile page and verify that the hash it retrieves matches the hash it computed itself. In order for this to be secure, you do need to make sure that processing the same request more than once has the same effect as just processing it once. Otherwise, there would be a short window in which someone could do a replay attack.