Manga Kotoba: Manga Frequency Lists and Stats

mitrac · April 21, 2024, 4:56am

Makes sense, I am on dark mode! I forgot that breaks stuff. I’ll try that out today and let you know!

Ooooh isn’t that fun! Amazing! I just gave it a whirl and that is perfect timing for とんがり帽子 starting, thank you!! I like how the table splits at the page level, that makes it visually way easier to leave and come back, or glance away with a split screen, fantastically useful!

I’ll let you know how I get on with it through the volume and see if it’s useful enough to warrant improvements. So far I’m just super happy it’s there as it is, thanks!

That is an interesting idea. I agree it’s not immediately obvious how to use it, but Maybe just having that in the back of your mind, at some point in the future there will be a tangible need and that will be the solution you pull out

ChristopherFritz · April 25, 2024, 5:30am

I recently added per-volume reading status (no interaction with per-series status yet). I’ve now further added per-volume reading status on series pages:

(They still have to be changed on volume pages, though.)

Still pending: Creating a “recommended words” list page based on volumes reading, owned, and wish-listed.

So, it seems this is possible entirely with a SQL query.

I haven’t tested it yet, but PostgreSQL has a function called stddev_pop that returns the “population standard deviation of the input values”. These results can then be stored in a materialized view, so they can be queried without having to re-caculate on every request (but I’d need to re-run the query every time I add a new volume).

Something like this (adding here for my own benefit, but also for any SQL enthusiasts who are curious):

CREATE MATERIALIZED VIEW word_deviation AS
SELECT
    word,
    STDDEV_POP(page_number) AS deviation
FROM
    volume_words
GROUP BY
    word;

I haven’t tested it yet, but I’m excited to try it out when I get a chance.

I wasn’t planning to implement this solely because I thought it would be too difficult.

But I do feel this information is absolutely necessary at a series level, where one word can appear 50 times spread across 10 volumes, and another can appear 50 times spread across a five-chapter story arc.

This is true for the volume level as well, but to a much smaller degree.

Although I haven’t implemented anything for it yet, it’s time to start thinking for how to present the information.

For those who, like me, are unfortunately not statistically inclined and learned, here is an example of what the standard deviation is (alongside the mean) for a few number sets:

Pages	Deviation	Mean	Comments
3, 7, 12, 44, 67, 77, 89, 105, 124, 175	55.859	70.3	Words spread throughout volume.
3, 17, 22, 54, 67, 87, 99, 115, 224, 275	89.347	96.3	Same, but more spread out.
44, 45, 47, 47, 47, 48, 49, 49, 51, 52	2.470	47.9	Words localized to a several pages.
1, 2, 2, 3, 3, 107, 108, 108, 108, 109	55.766	55.1	Words localized to two separate scenes.
1, 2, 2, 3, 3, 207, 208, 208, 208, 209	108.469	105.1	Same, but spread farther apart.

And…I have absolutely no idea how to make use of these numbers in a meaningful way.

mitrac · April 25, 2024, 7:38am

I’ve seen that, but I wasn’t sure how it would affect anything - I like the idea of the volume based recommended words (rather than the series based one), that would be very useful!

On my log I had mentioned some benefit for prelearning words for Frieren (has furigana, slower club). But fafter week 1 of using this for Tongari - the preparation was well worth it. The week 1 reading was still challenging enough I wasn’t sure I would be able to keep up, but (since I’m away next week), I did the week 2 reading. Perhaps now with the story moving and the intro out of the way, the vocabulary was a lot more predictable and there were loads of kanji where I thought, this is so awesome that I pre-learned you! It definitely made it easier to half remember/guess other kanji and have many fewer lookups. So whereas the Week 1 reading I did over the course of 5 days, the Week 2 reading I did in 2 sittings on the same day! I’ll write more eventually on my log, but it’ll be a while so I thought I’d let you know first!

Your SQL adventures look really fun.

only with effort!

My first thought was… if for a vocab and kanji with a lower deviation and a lower incidence, and a lower perceived usefulness (to me), I would choose not to learn those words and wait to see it in context, look it up once, and then that will be enough to get me through the next pages, and I’m not worried if I forget it.

Then for a higher deviation in a series or volume, if there are enough incidences - I’m going to consider learning that more, because otherwise I’ll look it up every. single. time.

But that falls down in the last two examples where you showed they have a high enough frequency and deviation to consider learning, but are concentrated at 2 distinct scenes, so… maybe not so much.

I wonder… if actually if you ever wanted to make use of this, if a visual representation would be more useful. But then, the scales are going to be funny: data (where you want to see whether the words are grouped within ~5 pages) … tumbleweeds for 200 pages … data.

So my last thought was, actually, the simple list you made above of pages numbers might be the most useful as an optional pop up

ChristopherFritz · May 5, 2024, 6:17pm

Recommended Words

This has now been added to the dashboard:

Similar Difficulty Series

I’m experimenting with spotlighting a random series of similar difficulty to the one being viewed:

This currently picks a series where the user’s known words and known sentences are both within 3% higher or lower than the series being viewed. (I plan to lower these percentages as I add more series to the site.)

This does not currently factor in whether the series have furigana or not, nor does it yet factor in word density.

Once I refine this a bit, I may replace this with two entries, one for a random series a little harder and another for a random series a little easier. I’m undecided at the moment.

Current Series and Volume Counts

Manga Kotoba now has frequency lists for 1,909 volumes across 594 different series.

If the site is to one day find success as a tool for Japanese learners who plan to read manga, the first step is to have enough coverage so anyone trying out the site can find a frequency list for a manga they want to read, even if it’s just the first one or two volumes.

Of course, adding a ton of unknown series based on time-limited free previews doesn’t help with that. But once a learner has invested time to marking their known words for series they are planing to read, the site becomes useful to find random series that should be easier to tackle first.

I still haven’t found a way to utilize this properly.

I feel like I’m 75% of the way there, but the other 25% needs to be worked out before I’d feel good about including it.

mitrac · May 5, 2024, 6:44pm

Very fun! I just gave it a test ride and the volume recommendation still behave the same as my series recommendations. I have many series set as owned, reading, or wishlist, but only a few volumes categorised (otherwise they are blank) . Is it possible that the series selection is overriding the volume? So for example, I have “The Chef” series set a owned but no volume marked as reading/ owned/ wishlist. This puts all vocab for all volumes in my series and volumes word recommendation lists, whereas I would have expected to have to mark a volume purposefully to such a category for words to appear in the volumes based recommendation list.

I can see some logic in the series selection overriding the volume selection, but that would be really fiddly. Currently marking a series in a category is the fastest way to get it on the dashboard and set it into the dashboard categories. So it’s unavoidable to do the volume based marking for organisational purposes. At least the way I would use it, it would make sense to drive vocab based decisions with the volume selection as the master.

I hope that makes sense…

Neat! I didn’t notice it until I specifically sent looking for it, but I like it!

So impressive. In addition to a lot of random volumes, you have a good coverage of book club picks, so I think you’re on a good way.

ChristopherFritz · May 5, 2024, 6:52pm

The good news is that this is literally impossible to happen.

The bad news is…that makes is difficult to figure out what’s happening…

Can you provide a screenshot of the first few words, including the header and page navigation?

Something like this.

Note: Clicking on the “Usage” icon current only returns series-level results.

I’m always glad to see when a nomination has a time-limited free volume 1.

mitrac · May 5, 2024, 7:19pm

screenshots

Looking closer, the lists are very different, and what is puzzling is that there is a selection of series in both that are on wishlist and owned that have the same categorisation, but they might appear in one and not the other.

You’ve made good use of them!

Sisthra · May 18, 2024, 6:04am

I’ve been using the site for a bit now and wanted to pop in with some feedback.

First of all, thank you so much for making this! I took a bit of time to mark the majority of words I know in the system and clicking on all the available series so it would track my progress with them, then decided to pick up and read the series that according to the website I knew the absolute most words in: it turned out to be Orange with around 85% words known (even more than Yotsuba, which is sitting at 60% despite being recommended for absolute beginners… But I probably know more words inside it that I happened to mark as known), so I picked up the first volume.
I then started to read it and set aside some ten minutes a day where I would add the remaining words in Volume 1 of Orange to my SRS little by little… This was my first time studying words that would come up in something I read in advance and it worked pretty well!

I was a bit worried that having the vocabulary list would spoiler some plot points, but having 0 context for them actually made encountering them in the volume exciting instead, so that was nice.

Most of all, I was astonished at how accurate the system ended up being, because I really had not many problems reading the first volume at all! It was a massive confidence boost to be able to fly through pages when in the meantime I have been struggling through reading the first volume of the Scarlet and Violet Pokémon manga (not exceedingly difficult and there is furigana, but it has a lot more slang and just a completely different vocabulary I guess. Having to parse “what is random string of katakana supposed to mean? Oh nvw it’s the name of this Pokémon” also breaks the flow of reading considerably.
Manga Kotoba itself says that my knowledge of the volumes of Pokémon Special recorded is only 60%, which looks quite low to me even according for some words actually being names of characters/loanwords and stuff)

Orange meanwhile is mostly about high school things so it just has a much narrower set of words it uses for now. (Which doesn’t mean a simpler plot, on the contrary!)

So I’m excited to use the website to keep studying the words I’m missing from the Orange series while I keep reading it, and then see where that left me compared to all the other series I’ve marked as being interested in!

I’ve been using JPDB as a mix of dictionary/srs system for vocabulary, but I’ve never quite found that its premade decks worked. They are useful to get a feel for the most common words of something but, apart from the part that it’s missing manga completely, I prefer how Manga Kotoba shows the same info.

Watching the progress bars fill up as you mark more and more words as known is incredibly motivating and really effective in immediately telling you at a glance how much of a certain series you could expect to be able to read.

So really, thank you so much!!

ChristopherFritz · May 18, 2024, 4:41pm

There are a couple of reasons for this!

First, Japanese sentence parsers are adept at parsing sentences that contain kanji, but not quite as good when it’s all kana. This means Yotsuba’s dialogue has a higher error rate when parsing for words.

If you find your percent of known words for the series is highest in volume one, that’d likely be because Yotsuba uses kanji in the first volume.

The second reason is that Manga Kotoba tracks known words separately when they appear with kanji, as hiragana, and as katakana. There are probably a bunch of words you know that appear as hiragana in Yotsuba, but you only have them marked as known with kanji. (At least, that’s been my own experience.)

It is a bit misleading, but I don’t think there’s any good way around it, unfortunately!

One of my goals for Manga Kotoba is to create the site I wish I had when I started out reading.

The best part is even after you learn many words, and encounter many series where you know a large percentage of words, you can always discover new genres where you’re back at square one, and have a vast number of high frequency words to tackle.

This can be an issue sometimes. I learned to accept it as the price of learning from a frequency list. “Oh, this word means amusement park. I guess I know what’s coming up in this volume.”

But as a motivating factor, the more words you learn, the fewer potential for spoilers you’ll encounter in other series’ frequency lists

I also agree about the lack of context, which softens the spoiler blow a bit as you have no idea how they’ll apply until you encounter them (and hopefully recognize them without having to stop and look them up!)

The best part is there’s still room for improvement.

For example, there’s a big difference between “I will need to look up one word per sentence” and “I will need to look up two or more words per sentence”. Not just the time spend looking up words, but also because the more words one has to look up, the easier it is to loose track of the meaning of the sentence. So once I add in percentage information for “+1 sentences” that require looking up only one word, it will make it clearer whether those unknown words will be a low or high hindrance when reading.

I struggled through the “gen one” Pokemon manga, and I can’t even put my finger on why. I think part of it was just an amazing amount of word lookups I had to do along the way. When I eventually pick it back up with the “gen two” volumes, I hope to have a better time of it as this time I’ll be able to pre-learn words.

It’s possible there are some character, Pokemon, and location names I haven’t added to the block list yet. It’s one of the disadvantages to frequency lists, and has a greater impact in series where I haven’t blocked any character names yet.

Every time I try to go the pre-made deck route (whether to use myself, or to create for others to use) I do end up with more negatives than positives.

Sites like JPDB also have the disadvantage of misparings that end up in decks.

Manga Kotoba falls into this category, but has it even worse as the manga OCR process can misread a kanji, and that’s what ends up on the site (or else gets excluded for producing a nonsense sentence). This likely has the greatest impact for anyone learning words that appear only one time, at which point it’ll be better to utilizing frequency lists based on a combination of series/volumes one is reading.

Orange is a great series, and any high-frequency vocabulary you learn along the way will translate to so many other series (with high school being a common setting in manga).

ChristopherFritz · May 18, 2024, 5:26pm

Manga Titles and URLs

One of the earliest decisions I made for Manga Kotoba is to have series/volume names in the URL, such as:

https://manga-kotoba.com/series/日々蝶々
https://manga-kotoba.com/volume/日々蝶々-1

This differs from other sites, such as Bookwalker:

https://bookwalker.jp/series/12731/list/
https://bookwalker.jp/de7eba953e-7cd2-4286-a120-1252c46f10fd/

And Natively:

https://learnnatively.com/series/da361268a8/
https://learnnatively.com/book/963232b446/

This was a decision I spent quite some time on, and in the end decided that while I’d use a hash as an internal ID (such as 3de2b9200d and 364ed252e5 for the series/volume above), I would utilize the title in the URL.

I knew this decision might come back to hurt me later, but I didn’t want to halt the initial development of the site over trying to make a decision.

Fast forward to this past week, and I ended up with the following two series on the site:

https://manga-kotoba.com/series/magico
https://manga-kotoba.com/series/magico

And that…caused some issues.

How on earth on there two series with this same name???

For now, I’ve removed both from the site while I think of a solution. My initial thought is to add either the mangaka name or the year the first volume published to the URL. Or maybe the label that published it?

Recommended Kanji

I’m playing around with different kinds of “recommended kanji to learn” lists to see what works and what doesn’t, accessible via the dashboard.

The first list I’ve created lists all unknown kanji by meaning for words marked as known.

What I’m hoping to get out of this:

More easily recognize kanji whose “meanings” I know so I can mark them as known. This means series/volume kanji meaning lists will be more targeted to what I should focus on learning without being as cluttered with items I know and haven’t marked as known yet.
See at a glance which vocabulary words I know but I’m not as comfortable with the kanji in isolation. Sometimes I know a kanji in one word, but don’t recognize it in other words.

Manga Kotoba is up to 2,057 volumes across 645 series.

Sisthra · May 18, 2024, 6:27pm

I think having the author and title in the url is going to be fine, it’s highly improbable that two mangaka with the same name publish something with the same title

sweetbeems · May 19, 2024, 6:45am

I agree that it looks cleaner and might have some small SEO benefits. I will just throw out though that if anyone copies the url from the url bar, it does create a mess due to the Japanese characters which is why I ultimately decided on simply hashes for Natively

You probably already considered that but if you hadn’t I wanted to note.

ChristopherFritz · May 19, 2024, 6:57am

This is the sad life for those of us who run book clubs, where we need to remove the title portion of the forum URL just to have clean-appearing links:

If I was unaware of it, I’d certainly be grateful to be learning about it now!

My earliest prototype of Manga Kotoba used hashes so I could see how I felt about them, but in the end I wanted something that looked nice in the URL bar, regardless of how it copy and pastes.

I had also considered the Discourse/Amazon route of both an ID and human readable text, but for Japanese in the URLs, I felt this combined the worst of both options.

Overall, the experience gave me an extra bit of respect for everyone who’s had to make this decision!

sweetbeems · May 19, 2024, 7:07am

Totally understandable. The hashes do always make me sad. At some point these browsers need to come into the 21st century and not urlencode unicode characters for copying! Seriously you all… cmon now.

ChristopherFritz · May 19, 2024, 7:28am

Word Density

One metric I’ve had in mind for a while to add is word density.

Regardless of the percentage of words one knows in a volume, there’s a fairly clear difference in word density between these two series:

To reflect this, I’ve added a “Density” column to series pages to include this information:

This is also factored into the “random similar difficulty series” shown on a series page. This has the benefit of showing better suggestions, but may cause some series to not show a recommendation. I’ll tweak the parameters over time to try and balance accuracy versus quantity of series that may show here.

For the curious, the lowest word density series on the site are:

And the highest are:

mitrac · May 19, 2024, 8:40am

Lots of exciting updates!

That is a great idea! It took me a while to get my head around it, but that is super useful.

Nice, I like that. Is that a word per page or what does it mean?

I noticed this problem… wonder if you are creating some strength of similarity score to choose these. Maybe give them a tangible rating like “very similar” , “a bit harder” “a bit easier” “next harder” “next easier”

Then you could prioritise what is shown based on that. So if there’s nothing similar, there could be a message that says - we haven’t found anything similar but here is the next harder manga (or one of the other categories)

I also wondered if this recommendation relies on having opened (clicked on each series) because anything I haven’t clicked on shows 0%

So maybe a prompt is needed to tell people to browse and click on them, or calculate some…

ChristopherFritz · May 19, 2024, 8:52am

Yes, average words per page across a volume/series.

It’s not perfect, as a volume can have things like:

many pages without dialogue and few pages with dense dialogue
short chapters with two “blank” pages between them

…which throw off the numbers a bit.

But I think it’ll be more much useful than not.

The main thing to know is “bigger number = more dense” and “smaller number = less dense”. If you know the word density of various manga you’ve read, you can get a better idea of how series you’re interested in reading will go.

My initial idea was to show a similar series, a slightly harder series, and a slightly easier series, but there typically isn’t room for three series.

But I do like the idea of showing a similar series, but if one can’t be found, trying for a slightly easier series, and if one cannot be found, showing a slightly harder series.

I’ll ponder on this one a bit.

It does rely on a series having been “seen”, which can most easily be done by clicking through each page of the browse page.

It’s not the best method to implement this, and I’m looking into ways I can improve it.

mitrac · May 19, 2024, 8:57am

Yep that’s what I meant, create a hierarchy so something is shown. Showing all would be too overwhelming anyway and miss the point. It’s a really nice discovery feature even if imperfect

I wonder if there is some way to make a global ranking based on your density and vocab data, in which case, another discovery feature would be to suggest the next recommended title to click on (browse). So somewhere would be a little dodad that says, why not browse other titles? This title might be near your level… again, can be very imperfect but would be fun!

ChristopherFritz · May 19, 2024, 9:52am

I haven’t yet come up with something like this, but I’m still hopeful.

Although not quite the same, I do have something kind of similar in the works:

I haven’t tried anything more globally yet, as that can be tricky as different people can tolerate different amounts of dictionary lookups while reading.

ChristopherFritz · May 22, 2024, 2:35am

From my to-do list:

I’ve created the initial database and code portions for a very basic collection section.

I haven’t yet gotten to my to-do list items, but here are the current collection pages so far:

カードキャプターさくら Collection
- additionally, the Clear Card series is now split onto its own series page
WaniKani Absolute Beginner Book Club Collection
WaniKani Beginner Book Club Collection
WaniKani Intermedia Manga Club Collection
- also includes the one manga read in IBC

With this, I can start adding in various series I’ve been holding off because they involve sequel series, spin-offs, or manga adaptations (such as the Detective Conan movie manga).

I also plan to locate any other series that should be split into a collection, such as Love So Life and Life So Happy, and split them.

I’m still thinking about how I want to integrate collection discovery into the user interface.

And things like collection-wide vocabulary frequency lists will come a bit later.

Note: Currently series a user marks as ignored will show up when viewing a collection.

Topic		Replies	Views
Tracking Known Vocabulary and Kanji in Manga Resources	7	1147	August 9, 2023
Manga Wordlist Wiki Resources	19	1123	December 3, 2021
Inuyasha 犬夜叉 Vocabulary List Reading	2	199	January 8, 2024
Vocabulary resource I'm developing for all levels of learners Resources	4	177	November 7, 2024
Getting started with jpbd? Resources	33	6811	September 18, 2023