Programming
Part of me feels like such jobs would require a vast background in SQL.
Then I realize that most people who apply for jobs requiring SQL are going to take a college course so they can show a piece paper stating they know SQL, without actually utilizing it and getting to know it well.
So once you’ve worked with it on a personal project, you’re probably going to be more qualified at SQL than some competing applicants who would otherwise get these jobs!
Yes.
JPDB does this, so you can sort the frequency list in order of appearance:
The only reason I don’t is because the frequency lists I generate for are manga.
I have considered adding an option to sort frequency lists by page, but then I’d want to add an option to only show words with a frequency higher than n (“show me high frequency words in page order of appearance”).
For book club vocabulary sheets, I rapidly go through each page of the volume and assign the order of the dialogue in Mokuro’s output.
I end up with tab-delimited output like this, with a sequential word number for the whole volume, and the page it’s from:
I can’t do this for all of my manga frequency lists because it would be a huge time waste for no benefit to do for a thousand volumes and counting, but it’s worth it for ABBC.
Particles, too. I feel a frequency list isn’t the place to learn particles off of (although I do leave some conjunctions in).
My admin status gives me buttons to easily add the JMDict ID and series IDs to the blocked words table, but I also have a button that adds a JMDict ID without a series ID to apply that block to all series (site-wide).
Then in PostgreSQL, every time I call a frequency list, I need to exclude all the blocked words. If I ever add a new page viewing frequency lists, if I forgot to filter them through the blocked words table, then blocked words get through and ruin the experience.
Or at least, that would be the case if views didn’t exist in PostgreSQL.
I created a view that queries the vocabulary lists and filters out the blocked works for me, so any time I want to pull data from a frequency list, rather than doing a query from the original table, I do a query on the view that has blocked items already excluded:
=> \d+ unblocked_volume_dictionary
View definition:
SELECT volumes.series_id,
volumes.id AS volume_id,
volume_dictionary.dictionary_id,
dictionary.data,
volume_dictionary.reading,
volume_dictionary.page_number,
volume_dictionary.line_number
FROM volumes
JOIN volume_dictionary ON volumes.id = volume_dictionary.volume_id
LEFT JOIN blocked_words all_blocks ON volume_dictionary.dictionary_id = all_blocks.dictionary_id
LEFT JOIN blocked_words series_blocks ON volume_dictionary.dictionary_id = series_blocks.dictionary_id AND volumes.series_id = series_blocks.series_id
JOIN dictionary ON dictionary.id = volume_dictionary.dictionary_id
WHERE all_blocks.dictionary_id IS NULL AND series_blocks.dictionary_id IS NULL;
After writing the SQL for a view, I like to run it by ChatGPT (I’m using 3.5) to ask if there are any obvious optimizations that can be done to the SQL. Sometime it gives nonsense, but I’d say at least nine times out of time it makes little changes that either improve the performance, or I cannot tell if there’s a change but it doesn’t break the result.
What’s most important is that if ChatGPT gives a supposed improvement that I’m aware of what was changed and I look into why. That way I slowly learn over time as I see the same things come up over and over so I can start writing better queries from the start.
If I could sync my manga reading progress in Mokuro to the site, such as via an API that likely no one other than me would ever use, and have the frequency lists adapt to showing me only for pages I haven’t read yet, I’d implement it right away. (Well, that’s an うそ, because I could literally implement that today. But I won’t, as it’s not a priority.)
I’ve thought about this as well, although I haven’t looked into it. (I am interested in at least trying out the non-HTTP request docker you mentioned, but I don’t want to risk the new docker install not working while also breaking the existing docker I’m using.)
I imagine the bottleneck is the design of the code for processing lines of text, and not available CPU. (I haven’t looked at the source code to confirm this.) In that case, I would very much be interested in running multiple volumes through at the same time if I had a way.
But since I normally let it run overnight when I’m running many volumes through, it’s not that big of an issue for me.
You know how up above I shows a screenshot of my block buttons?
For my old site, I blocked words by putting them in a JSON file to have them be excluded when I generated a frequency list from the cached Ichiran (and before that Juman++) output.
This meant if I found a frequency list item that was a single kanji misparsed from a character name, I’d add the word to the JSON file, then I’d rerun the process to convert the cached Ichiran/Juman++ output into Markdown pages, then rerun Jekyll to convert the Markdown pages into static HTML pages, then re-run rsync to copy all the updated pages to the web server.
But for the new site, I just click a button.
Well, I still need to manually generate/run a few queries to update the site/volume word count caches.
Eventually, I’ll automate that as well.
All my posts about first-time reading of native material being about deciphering and learning as you go certainly apply to learning new skills via using them in projects. I’ve spent hours upon hours reading documentation and watching YouTube videos that show using AdonisJS to build a site from scratch. The latter is like joining ABBC where you’re being hand-held each step of the way, so you have guided learning.