Manga Kotoba: Manga Frequency Lists and Stats

I’ve written in my study log about various tools I’ve developed to track my progress in learning vocabulary appearing in the manga I read.

The culmination is:

Manga Kotoba

Purpose

Sites such as jpdb and Koohi do a good job at providing frequency information and word lists for novels and anime, but I’m unaware of anything for manga.

Manga Kotoba is designed for finding how readable a manga will be and which words you should learn next based on your own personal vocabulary.

Features

  • View series-level and volume-level vocabulary by frequency.
  • Track known vocabulary.
  • See your progress in learning the total vocabulary from a series and its volumes.
  • See the number of sentences you can read without looking up vocabulary.
  • Export any frequency list, useful for importing into JPDB.
  • Frequency lists for over 2,250 volumes across over 725 manga series.

Screenshots

Browse Series List

Designed for discovery of material to read, the “Browse” page lists all available manga sorted by how percentage of words you know in a series.

Series Page

The series page lets you select your reading status for a series, as well as showing your vocabulary progression for the series and all (available) volumes.

Note: This screenshot shows an in-development view of the volumes. Currently, the only available volume view is a table-based view, which will remain as an option once this layout goes live. I just need to work the series vocabulary stats and link into it.

Vocabulary Frequency List

Desktop:

Mobile:

Each series and volume has a vocabulary list sorted by frequency.

User Series List

From the dashboard, you can access a list of all the series you have assigned a status to, and your progress stats. (This page is currently called “My Series”, but I’m still debating whether to go the “Your Series” route.)

Recommended Words List

From the dashboard, you can access a frequency list for all series you’ve marked as reading or a few related statuses. This lets you decide which words to learn from single frequency list combined from everything you’re reading.

Known Words List

You can view your known words list from the dashboard. If you accidentally mark something as known, you can unmark it from here.

Import Known Words

If you have a list of known words, you can import them here.

Note: Until I add pagination to the import review page, it’s best not to add more than 200 words at a time. Also, it’s best to avoid hiragana-only words as they may match on many different kanji words.
Note: At the moment, marking words as known from this page doesn’t work. Getting it working is on my short-term to-do list.

Limitations

  • Small (but growing) selection of manga.
    • It’s mostly sourced from time-limited full-volume free previews and manga I’ve purchased.
  • The site still is in early development. Things may occasionally break or be shuffled around, and test pages/sections will come and go from time to time.
  • Limited mobile device support.
    • The most important parts are usable, but I still have more work to do.
  • No password reset feature yet.

Requesting a Series

I can add a series by request, provided you can complete these steps:

  1. Purchase a digital copy of the volumes from the series.

  2. Remove DRM and unzip contents.

  3. Install Mokuro and run it on the volume folder that contains the manga page images.

  4. Optional: Include a text file listing character names. This is used to remove character names.

  5. Optional: Include a text file listing file names of pages that do not include content (such as cover image, table of contents, あとがき, copyright page). This is used to remove unnecessary vocabulary.

  6. Compress Mokuro’s output _ocr folder and any optional files into a zip file.

    • Do not include the images folder!
  7. Share the zip file with me on Google Drive, or else send me a message on Discord at ChristopherFritz#5813 with a link to the zip file.

Note: I don’t know Discord very well, and I’m only on it a few times a month. I’m not certain if I can actually receive messages from random people. But leave a comment in this thread, and we can coordinate.

Background

Ever since the release of manga-ocr (and later Mokuro), I’ve been creating various tools for tracking my known vocabulary and comparing it with the text in manga.

An early iteration used Google Sheets, but it required too many manual steps to be very accessible.

The next iteration used simple web pages, and tracked progress in your browser’s local storage. (See the [details]-hidden section below for the original posting on that site.)

This latest (and final!) iteration combined ease of adding content with the benefit of server-side storage for user progress.


Note: Previously, this thread was specific to my old site where frequency lists and word tracking were tacked on and had limited usage. You can see parts of my original posting here.

The frequency lists will eventually be removed from my older site. In the meantime, here is how the old site differs from the new:

  • No user account is required.
  • Supports tracking known kanji.
  • Supports viewing kanji frequency information.
  • Shows stats for how many unique highest-frequency words you need to learn to reach certain percentages of overall known words.
  • Integrated SRS (extremely basic).
    • Note: If you’re using another SRS application such as Anki, Kitsun, or Migaku, I recommend using that.
  • Displays unique words known for a volume.
  • Volume-level sentence stats includes:
    • Number of sentences you will need to look up only one word.
    • Number of sentences you will need to look up multiple words.

Screenshots

List of series marked as “Currently Reading”, with known words percentage based on 200 high-frequency vocabulary words marked as known.

Stats for a single volume, and the next highest-frequency vocabulary words to learn.

image

SRS page, showing available reviews.

Limitations

  • Progress is stored in your browser’s local storage. This means your progress cannot easily be transferred between browsers or computers.
  • This site has not been tested on mobile devices.
  • Limited manga selection. (Nothing new is being added.)

How to Use

  1. Open the list of manga series.

  2. Find a series you are reading or are interested in reading

    • You can select the :open_book: icon to move the series into a “Currently Reading” table.
  3. Select the series title to open the stats page.

  4. On the stats page, under Vocabulary Stats, select the “Vocabulary List” link for the series or a specific volume. (Optionally, there are kanji links below the table.)

  5. On the frequency list page:

    • Select “Mark as Known” to mark a vocabulary (or kanji) as known. Stats at the top of the table increase for each item you mark as known. (Low-frequency items may not change percentages due to rounding.)
    • Select “Add to SRS” to add a vocabulary or kanji card for review.
  6. If using the built-in SRS, do reviews on the SRS page.

Once you have viewed a series’ vocabulary page, its percentage of known vocabulary shows on the series list page. This number updates only when you view the series frequency list. Select the percentage number to open this page and update the number.

Known Issues

  • The kanji frequency pages currently display empty known word charts.
  • When an SRS card is marked as “unknown” during review, it shows as immediately available to review, even though there is a ten-minute wait before it can be reviewed again.
    • Cards marked “Unknown” should become immediately available for review again once the number of pending cards reaches zero. This is not fully implemented yet.
26 Likes

That website of yours is truly great. I’m using it sometimes to check the vocabulary before reading a manga but as I’m using multiple devices I didn’t really checked the SRS. Allowing user to sync their progress between devices could be great. What I’m doing instead is exporting your lists on JPDB when I want to SRS them. It’s working quite well but there are still some quirks when doing it, but keeping the frequency of the vocab is great instead of just exporting it to Anki where I will have no way to SRS them by frequency. You could probably easily provide an export yourself if you wanted to as you already have all the information you need.

Thanks for your work. If you add a “buy me a coffee” link, I will gladly obliged.

4 Likes

This certainly is a weak point of the site.

The reason is because the site is entirely static HTML pages and Javascript. There’s no server-side programming, login management, database, etc.

How are you doing the import? Are you copying from the web page, or perhaps using the ODS spreadsheet file, and pasting it into the “New deck from text” option on the “Learn” page?

Maybe I can add a Javascript button that gives you the text with the necessary fields formatted for importing into JPDB. (Maybe low priority, though.)

Or, do you prefer Anki, and you’re using JPDB because it’s easier to get the data into? I’d have to look into ways to generate a deck for Anki, but that’s always an option. (Again, low priority.)

Since I don’t drink coffee, I’ll have to wait for donation sites to develop a “buy me a manga” link :wink:

9 Likes

Yeah, you already mentionned that somewhere in this forum I think. Adding it when you don’t have a use case for yourself isn’t really a priority and hosting a service with server-side processing open the door to a lot of possible issues. On my side, I could probably reap the website & add a way to sync the progress on a server-side service hosted on my server, but that’s not something someone without technical knowledge can do easily.

One other way to do it, would be to sync the progress on something like Google Drive but using its API from Javascript is painful.

For the export, I’m generating a text from the vocabulary list (one word per line), repeating each vocabulary the same number of times as their frequency and copying it in the new deck from text field of JPDB. It let me preserve the frequency and is working enough to be usable. Sometimes I’m doing it from the CSV or sometimes directly from your HTML. I’m doing it in Python but it could easily be done in JavaScript. I should have thought of developing a user script instead.

I’m using JPDB instead of Anki because in JPDB deck there is the frequency of the vocabulary and you can order your new lessons by frequency and also as the knowledge is shared between decks, I know the coverage of each deck beforehand. I mentioned Anki because some people would probably love to have Anki decks available directly, but I’m not one of them.

“buy me a coffee” is just a nice way of accepting donations, you don’t need to drink coffee to use it but you probably already know that.

3 Likes

On the off chance that anyone’s been giving my site’s built-in SRS a try, I did something that should be a one-time thing:

I’ve replaced the spacing algorithm, which, unfortunately, wipes out any SRS progress.

The old algorithm was my implementation of WaniKani’s algorithm.

The new algorithm is the FSRS algorithm recently implemented in Anki to bring it to the 21st century.

I still recommend using any other SRS than the one built into my site, but if someone uses my site’s barebones SRS, it should be a little bit better now.

4 Likes

I’ve recently added an experimental feature that puts a “Track Volume” option on volume pages.

Tracking a volume serves two purposes:

  1. The tracked volumes page lists all tracked volumes with the percent of total known words and the percentages for sentences (all words known, one unknown word, multiple unknown words), sorted from easiest to most difficult.

I use this to understand better how the first volume of a series I’m considering reading compares with the current volumes I’m reading from other series.

  1. Viewing the tracked volume page updates an internal (browser’s local storage) list of recommended words based on the highest frequency words across tracked volumes. When viewing a frequency list, words from this internal recommended list include the word “recommended” in the SRS column.

As I learn the highest-frequency words from a volume, I find myself down to words that appear only once or twice in a volume to learn. It becomes difficult to decide what to learn next. This “Recommended” lets me know which words appear more frequently across everything I’m tracking, so I know I will likely see it again sooner.

7 Likes

I’ve enjoyed seeing this tool develop it’s so useful! Congrats on graduating it to having its own thread!

So for it behaves the same for me on mobile (Android Samsung)

3 Likes

I’ve started a side project to branch off the frequency lists/stats from my main Japanese learning website (where the static file generator I use crumbles under the massive number of files), moving it to a separate website that will allow for user login and will store vocabulary tracking on the server.

This will solve for using it across devices.

I don’t know when I’ll have it ready to launch, as I’m learning many new things to make it work, but when it’s available, I’ll post it here.


For the curious, what I’ve been working on this weekend is creating a site built on NodeJS using AdonisJS, with a PostgreSQL backend. (All things I started out completely unfamiliar with.)

So far, this setup delivers frequency lists much faster than my current “load the whole list, then use Javascript to hide known words” approach on the static site.

One of my early decisions was to base handling words on JMDict’s word list and ID numbers. This does result in an issue where a manga may have the word 「りんご」, but storing only the ID doesn’t retain that it’s 「りんご」 and not 「林檎」, nor that marking it as known should apply to 「りんご」 and not 「林檎」. I don’t yet have a decent solution for this, although I have some vague ideas on what to try.

This design also makes it difficult to transfer a known-words list from the current site to the new one. I haven’t yet started thinking that one through.

Early development screenshots:

Series list page.

Displaying series the user is currently reading, has finished reading, etc. would show on a different page.

Series page.

I haven’t even started working out how to handle the stats display.

I’ll probably store the stats data in JSON format in the database since these stats generally don’t need to be recalculated unless I add new words to the block list. And I can always have an admin function to recalculate them if needed, or recalcuate them locally to sync back to the server.

Volume page.

Pending: Adding admin icons to add a word to the block list. This is useful for character names, misparsings, and results such as 「あああああ」 that probably shouldn’t count toward a volume’s word total.

Now I just need to force myself not to work on this weekdays, since I’ll need those days to catch up on my missed daily reading I haven’t been getting done over the weekend.

2 Likes

Working toward eventually removing the frequency lists from my main site onto their own has progressed a little over the weekend, as I have an alpha release online:

It has basic support for creating an account, logging in, and marking words as known.

I probably won’t be able to carry over a user’s known words list from the current site due to a change in how I track known words, although I will check to see if I can work something out.

There’s a lot more to do, but that’s for another weekend.

3 Likes

wow, that is a big update, very nice! Really looking forward to it!

2 Likes

Looks exciting : ) お疲れさまでした!

1 Like

Pre-weekend update for the site that will become my primary location for frequency lists:

Searching

A basic search bar has been added to the header:

I plan to add pagination to the series list later, which requires a method to search.

Progress Bars

I’m starting to add progress bars to series and volume pages. These will appear in more locations once I implement caching for their values.

image

Series Vocabulary Page


My focus for this weekend includes:

  • Clean up some code. :white_check_mark:
  • Improve SQL queries. :white_check_mark:
  • Adding more progress bars.
  • Get the “Mark Known” button working on the series vocabulary page. :white_check_mark:
  • See if I can get the progress bars to auto-update when words are marked as known.
  • Add caching for calculated values.
  • Add pagination. :white_check_mark:
5 Likes

While I still have some to-do items for last weekend pending, a few other recent updates:

  1. I have added an experimental dashboard. There’s a lot of work pending on it.

  2. Series pages allow setting a series-level reading status.

    • I plan to add an option to hide series from the browse section based on status later.
    • The dashboard lists all series with a status.
      • The results aren’t yet sorted, so the order within each status is currently random.
  3. I’ve added a page for importing known words (accessed via the dashboard).

    • This works best for importing known words that contain kanji or are katakana. It’s not very useful for putting in hiragana words.
    • I don’t have a “mark all known” button implemented, so items currently have to be confirmed one by one that you want them marked as known.
    • This allows bringing one’s known words list over from my older tracking site, one of the necessities before I can retire the frequency list pages on the old site.
    • This will also be useful for me personally when I mark words as known in Migaku so that I can take my known words list from Migaku and import it into here.
      • That is, if I can figure out how to access my known words list in Migaku…
        • And I still need to make a user’s known words visible and list downloadable on Manga Kotoba…
  4. Further under-the-hood code cleanup as I learn more about how to do things in AdonisJS, Lucid, and PostgreSQL.

2 Likes

Further weekend updates:

Export Wordlist

Each series page now has a (barebones) “Export Wordlist” link.

image

This allows exporting all words from a series (based on available volumes) that you have not marked as known:

Grouping words gives you a list of words and their frequency counts:

image

Note: At this time, there is no option for including English definitions in the export.

Leaving words ungrouped gives a list of all words in order of appearance (volume, page number):

image

This is useful for creating a deck in JPDB.

Tracking

A clipboard/chart icon has been added to represent words being “tracked”.

This doesn’t do anything on the site, aside from showing the icon as green when tracking and yellow when not tracking.

But it can be useful to mark a word as being utilized outside of the site, such as marking words you have in Anki or another SRS application.

That way when looking for words to add to your SRS application, you’ll know which ones you’ve already added.

(At least, that’s my own use I wrote it for. So I know which words I’ve created cards for in Migaku.)

2 Likes

Midweek additions to the new site:

Known Words List

Added a “Known Words” list to the still-unorganized dashboard:

image

Until I figure on a good way to support an “undo” function for accidentally marking a word as known, this gives a place to unmark words as known. (Most recently marked words appear first.)

Full Descriptions

Added an option (when logged in) to show full descriptions rather than just the first description in the English column:

Mobile Layout

Mobile has upgraded from a third-class citizen to a third-class citizen that gets to hang out with its second-class friend, as I’ve made frequency tables easier to use on small screen devices:

There’s still a ton of work to do for the mobile layout, which is near the bottom of my to-do list.

But this does get me one step closer to being able to supplement my bus commute manga reading with (eventual) smartphone look-ups that I can do right on the site, check and see if the word has a high enough frequency across what I’m reading, then copy the word and paste it into Migaku to make an SRS card for me.

6 Likes

Whoa, this is some fascinating stuff! Did you end up switching to Ichiran or are you using Juman++/MeCab? Is there a Github repo I could have a poke around in? I am really curious how you are managing all this data with all this cross-referencing! :grin:

1 Like

I did indeed switch to Ichiran after you linked me to the docker installer. I’d still be on Juman++ if it wasn’t for that!

For this I don’t, but the process is essentially this:

  1. Run the volume through Mokuro.
  2. Extract the sentences from Mokuro’s JSON files into a single text file for convenience.
    • I use this code to create a text file with image file names and extracted sentences.
  3. Run that text file through Ichiran:
with open(inputFile, 'r') as inputStream:
    output = []
    for fileLine in inputStream:
        page, line = fileLine.split('\t')
        if not line.strip():
            continue
        result = requests.post('http://localhost:3005/segmentation', json={"text": line.strip()})
        output.append(f"{page}\t{json.dumps(result.json(), ensure_ascii=False)}\n")
with open(outputFile, 'w') as outputStream:
    outputStream.writelines(output)

This gives me a file for the volume that contains the original image file name (which I have separate code to extract a page number out of) and the Ichiran output per Mokuro-identified block of text, such as:

Mokuto extract:

i-126 でも専門店かぁ:

Ichiran extract:

i-126 [[[[[“demo”, {“reading”: “でも”, “text”: “でも”, “kana”: “でも”, “score”: 48, “seq”: 1008460, “gloss”: [{“pos”: “[conj]”, “gloss”: “but; however; though; nevertheless; still; yet; even so; also; as well”}, {“pos”: “[prt]”, “gloss”: “even”}, {“pos”: “[prt]”, “gloss”: “however; no matter how; even if; even though”}, {“pos”: “[prt]”, “gloss”: “… or something”}, {“pos”: “[prt]”, “gloss”: “either … or …; neither … nor …”, “info”: “as 〜でも〜でも”}, {“pos”: “[pref]”, “gloss”: “pseudo-; quack; in-name-only”, “info”: “before an occupation, etc.”}, {“pos”: “[pref]”, “gloss”: “for lack of anything better to do”, “info”: “before an occupation, etc.”}]}], [“semmonten”, {“reading”: “専門店 【せんもんてん】”, “text”: “専門店”, “kana”: “せんもんてん”, “score”: 782, “seq”: 1389930, “gloss”: [{“pos”: “[n]”, “gloss”: “specialist shop; shop specializing in a few types of product”}]},], [“ka”, {“reading”: “かぁ”, “text”: “かぁ”, “kana”: “かぁ”, “score”: 0}, ]], -170]], ": "]

For my original site (Japanese by Example), I then extract the first English translation for a word and put that onto the site, such as seen in this random example not from the Ichiran text above:

For the newer site (Manga Kotoba), logged in users can switch between the first definition and the entire definition:

image


Vocabulary and sentence tracking is mostly the same between the old site and the new with two exceptions:

  1. The old site tracks known words based on the reading of the word, whereas the new site tracks based on the JMDict ID number + the reading of the word.
  2. The old site uses client-side storage and the new site uses server-side storage.

The old site stores this in the user’s browser’s local storage, which worked well but with limitations. Your progress cannot carry over between your desktop computer and your smartphone, for example.

This means for the old site, you can see my entire word tracking for known words and known sentences by poking around the client-side Javascript that runs on each page and the JSON files that calls (for known sentence tracking).

The new site stores user data in a server-side database (PostgreSQL), so you can access your progress from any device. Here, the heavy lifting is done on the server, so it’s not visible to anyone side from me. (Considering I’m still learning Javscript, and am completely new to NodeJS, AdonisJS, Lucid, EdgeJS, and PostgreSQL, it’s probably for the best no one can see what sort of abomination I’ve cobbled together

The old site did it rather poorly, as the user had to re-visit every page to update their stats for the series/voluem.

The newer site uses PostgreSQL to store data. With data from 976 volumes across 238 series, the database sits at a little under 220 MB in size (the majority of which is likely JMDict).

Using a database means it’s fairly trivial to track when the user last marked a word as known/unknown, and use this to update the cache for their progress across series.

I’m still working out how to handle some things, such as tracking how many +1 and +n sentences a volume and series has (the being the sentences you can read with only one or with multiple word lookups).

Once I get around to working that out, and a few other minor bits, I’ll be on a quest to think up anything else useful that I can do with the existing data or by extracting more.

2 Likes

How are you finding the speed? I ran a novel through it the other day and it took 50 minutes :flushed:

Cue epic ChristopherFritz-style write-up :grin: Thanks for taking the time for that! I’ve been learning about web dev for the last couple of months (because I desperately need a career change) and I find projects the best way to learn new things. Next up is PostgreSQL, so I figured I might try playing around with Ichiran, Mokuro and frequency lists, but there are so many cross-references going on that it’s turning out to be somewhat complex :sweat_smile: Oh well, all the more educational :muscle:t2:

Ahhh, that’s clever! I was wondering how you were managing to do page-by-page and even line-by-line analysis! So when you do a frequency list from p. 50 onwards, you just ‘calculate’ it the same way as you would for the entire volume, but you leave out all entries where the page < 50. For some reason I was imagining I’d run the whole thing through Ichiran in one go (like I do with novels), but this makes a lot more sense, because you retain a lot of valuable information this way! :grin:

How did you set up that JMDict database? Did you use JMDictDB for that? I downloaded the raw JMDict data and was a bit overwhelmed with how cryptic the format was lol

I reckoned I might not need a whole JMDict database if Ichiran gives me the info I am looking for (a unique identifier, a reading and an English translation), but then again it is a bit silly to use JMDict IDs all over the place without them having a ‘home’ of their very own :grin: And I guess you never know when you might want to access the full power of JMDict!

Oooo, that’s interesting! I hadn’t even thought about caching yet!

Thanks so much for this write-up, it’s given me a lot to think about! It’s really inspiring to see all the information you manage to wring out of some Mokuro JSON files :star_struck:

You are an insatiable data fiend :grin:

1 Like

What’s that? My last post was too short, you say? I should write longer replies, you say?

Reply compacted for length.

I posted somewhere an example of the time it takes to run compared with Juman++, although I can’t find it now.

Basically, it was like that old fable about the tortoise and the hare, except the tortoise (Ichiran) is the taking a nap while the hare (Juman++) speeds through the course.

But, the advantage of getting vastly better parsing is worth it for me. With Juman++, my top issue was Juman++ misparsing text. With Ichiran, my top issue is Mokuro misreading text.

All that said, I feel a novel would take much longer than 50 minutes for me, and I have fairly okay hardware. I do wish there was a way I could query Ichiran without it being via an HTTP request, if a more direct request would be faster. But maybe that HTTP interface isn’t much of a bottleneck.

I’ve met many people who try to learn new things, struggle to retain the knowledge, and have zero interest in creating a project that requires/utilizes that knowledge. Or, feel they don’t have time to, which could very well be the case.

I’m grateful that I’m the type of person who absolutely wants to do projects to apply what I learn, and that I can make the time for it (a luxury not everyone has).

The only reason I actually started developing the new site to move my frequency lists to is because of the opportunity to learn.

I wanted to learn NodeJS and MongoDB.

As it turns out, MongoDB would be wholly inappropriate for this site. (But I have other personal projects where I may be able to utilize it.)

Thankfully, PostgreSQL was also on my “I want to learn” list, alongside getting better at developing tables and queries. (I have a limited background in MySQL and SQLite.)

You may know this already, but Ichiran uses data from JMDict, so you get an word ID for each entry Ichiran outputs. This is the “seq” field.

I’ve also loaded in data from JMDict into a separate table.

From here, the way I manage data is essentially:

  1. Extract text with Mokuro.
  2. Parse text with Ichiran.
  3. For each entry from Ichiran, take the “seq” and the “text”.

The result is:

 volume_id  | dictionary_id | reading | page_number | line_number 
------------+---------------+---------+-------------+-------------
 ed704a9178 |       2029110 | な      |          58 |           2
 ed704a9178 |       1307850 | 子供    |          58 |           2
 ed704a9178 |       2133450 | かよ    |          58 |           2
 ed704a9178 |       1007660 | ちゃん  |          58 |           3
 ed704a9178 |       2028920 | は      |          58 |           3
 ed704a9178 |       1260720 | 元気    |          58 |           3
 ed704a9178 |       1530190 | 無職    |          58 |           3
 ed704a9178 |       2028990 | に      |          58 |           3
 ed704a9178 |       2086960 | って    |          58 |           3
 ed704a9178 |       1004200 | けど    |          58 |           3

Each row in this table includes:

  • a hashed ID to uniquely refer to the volume the words are from
  • the JMDict ID number of the word
    • I can join this to the JMDict table to extract the same kanji, kana, and meaning information as Ichiran provides)
  • reading
    • I separate entries for different readings. You can track based on the JMDict ID if a word is known, but then marking ねこ as known also means 猫 and ネコ are considered known. If you don’t know the kanji and don’t know katakana, this masks things not yet learned. Thus my treating them separately by reading as it appears in the source material.
  • page number
  • line number
    • The page number and line number are used to track if a sentence contains any unknown words, used to derive the “you know all the words in this sentence” stats.

From there, most everything is based on the JMDict ID and sometime the reading.

For example, I block character names from the frequency list, which is adding the series ID and JMDict ID to a blocked words table.

Tracking known words places the JMDict ID and reading of the word to a table, alongside the user ID.

And so on.

Correct.

This is a feature on the old site that I don’t currently plan to carry over to the new site. (It only partially works on the old site anyway, and the implementation has issues…)

While I think it’s useful to be able to say “I’m on page 50, so show me frequency list information for only page 50 to the end”, the implementation requires the overhead of translating file name file numbers to actual page numbers on the page for each series, and then many manga these days don’t even include page numbers on the digital release, so now it’s a matter of the reader app showing a page number that may be inconsistent from Kindle to Kobo to BookWalker…

With the speed of Ichiran, you should only ever run a novel (or in my case, manga) through it once.

You want to capture enough information that you can do what you want without ever running it a second time.

For me, this means saving not only the output from Ichiran, but which file name it came from. (Then I count out line numbers when I prepare it to load into the database.)

The XML file download was a non-starter for me, so I downloaded it in JSON format from jmdict-simplified:

I then reformatted that into a tab-delimited file that I could load into PostgreSQL using the database’s \copy command:

You could build a JMDict table as you go based on the Ichiran results. For example, load your Ichiran output into a temp table, join with a JMDict table on the Ichiran “seq” and JMDict ID, and for rows where the JMDict ID is null, insert rows from the temp table. Then drop the temp table, and import only the “seq” and “reading” from Ichiran (and your novel ID) into your novel table.

I planned to do this originally, but I decided not to bother and just load in the whole JMDict. While it might seem absurd to load in over 200,000 rows when you may never use over two thirds of it, since the ID column is indexed, queries joining with the table are faster than I feel should even be possible.

Be careful that you’re not storing the English definition every time Ichiran gives it to you or else you’ll end up with the same definition in multiple places, and that will cause your database size to inflate more than you need to. (You may already know this, but I’m mentioning it just in case.)

Caching is great become some calculations are “expensive” in that it takes time to run them.

For Manga Kotoba, it might take five seconds (made-up number) to calculate all the stats on a series or volume. This might not sound like much until you’re looking through multiple pages, and every 12 pages you’ve been made to wait a combined minute. Caching the results takes this wait time down to zero seconds.

But then when should the cache be updated?

Obviously when it’s invalidated because the user has marked a word as known.

However, we don’t want to recalculate stats for 960 manga volumes when the user is only interested in stats for three series.

Yet, they may use those stats on other series to find something new to read.

I like what I have in place now, for performing updates, but it’s something I’m still working on improving.

As I see my personal stats for manga volumes, it makes me want to add novels and anime subtitles. I actually specifically named the new site Manga Kotoba to ensure I focus on doing one thing (more or less) well, without trying to out-complete sites like JPDB.

Speaking of other sites, I wish more sites with known-word tracking had decent (or even any) export features.

It’s something I’m still working on (I have it on my old site, but still figuring out how I want it on the new site). It’d be nice if it were easier to take one’s known words from one site and sync them up with another site. Granted, not every site tracks words the same (even my Japanese By Example site and Manga Kotoba have incompatible ways of tracking known words).

I’m taking a lot of inspiration from Natively’s way of doing things (hopefully @sweetbeems doesn’t mind; there’s zero crossover between what Natively and my site do!), and I’m really impressed with the look of his “Data Download” page. (Well, with his whole entire settings page in general!)

4 Likes
Might as well compact mine too!

Yeah, it has a fair bit of trouble with シメジシミュレーション, for example, while I am feeding it the official (legal!) Amazon version. So I’d imagine that to be pretty high quality!

On the original Ichiran Github there is also a Docker option, which gives you a container without the server. You could run docker exec -it ichiran-main-1 ichiran-cli -f "一覧は最高だぞ" on it to run the command line tool directly.

Ahhh that’s the direction I am coming from! I started with a microblog project, which I found MongoDB pretty good for! It’s very beginner-friendly! But I found that local job ads are quite overwhelmingly in favour of Postgres (and SQL in general), so now I want to learn that too!

Very illuminating! I was getting stuck on the bit where I have Ichiran output (and a lot of it too) and how to turn it into a frequency list. My first instinct was to loop over it in Node and tabulate totals, but with the page and line information it was getting ever more complicated. But of course you could just dump all that info into Postgres too!

One thing I am wondering: would it make sense to have some sort of sequential numbering so it is possible to maintain incidence order of the vocab? Then it could also be used to generate (book club) vocab lists, right? Not super relevant for me personally (yay Yomitan!), but it might be neat!

Very clever! I bet there is some other cruft you want to exclude as well, like exclamations and the like. Cause that’s always been my experience with frequency lists in the past, that there is a LOT of cruft in the top 10% frequently used ‘words’. Though I bet Ichiran does a bit better at that out of the box already cause it doesn’t split up compound verbs and the like.

I imagine it’s a fun idea in theory (and the fact that it is possible is very cool in a Palpatine UNLIMITED POWEEERRR sort of way), but in practice you’d probably go through manga too quickly to need it very often :grin:

I wonder if there is some way to run multiple instances of the Ichiran logic and use them concurrently :thinking: Give each of it a tenth of a novel to work with and then stitch the output back together again. I might look into that in the future. I believe that is what something like Kubernetes is designed to deal with.

:bowing_woman:t3: This looks a lot more manageable lol

Good to know! There is no point in trying to optimise something that isn’t even a problem!

I am keen to do something with novels too! I de-DRM my Kindle novels with Calibre and noticed that those ebooks are basically just HTML files… so I’ve been using that to make a little novel reading applet with a bookmark and everything (I showed it in the Mokuro thread last year). But I have to construct each page individually and there are some manual steps involved, which bothers me every time I make such a page (‘automate all the things!!!’). So I want to see if I could JSONify those books and then render them programmatically with React :thinking: And somewhere within that process I’d have it run Ichiran on the text too. Ideally I’d click a + button on the website, upload the HTML files and the rest happens automatically from there!

There’s nothing like project-based learning huh? :grin:

Tell me about it! I originally started looking into this stuff a year ago when I got fed up waiting for Japanese.io to get better export options. They offer some export functionality, which is tantalisingly close to being useful :sweat_smile:

Man, I love geeking out about Japanese and programming! :nerd_face:

2 Likes