It’s definitely a small portion of words, but since they’re almost always just the pronunciation of an English word, it feels like “cheating” to say I’ve made progress when I can already understand what most katakana words mean without needing to study them. Examples: puroguramu → (computer) program
As a guideline: Kanji → what most words are made of. Hiragana → grammatical pieces and some common words of japanese origin. Katakana → imported words, onomatopeia and similar.
The easiest way to “patch” around this problem might be to only count word that contain kanji for the purposes of the progress chart.
It’s a slightly different flavour, but it’s the same idea. The intervals are 5m, 25m, ~2.5h, ~10h, ~day, and then they double for every right word. When you get a word wrong, it sets the interval to 30%.
There is no “easyness” modifier that makes intervals become spaced tighter as you get the word wrong many times.
Here’s the biggest difference IMO: the correct answer is shown after x milliseconds (currently 4000, but adjustable and could reasonably go lower). It gets you into a flow in a way that having to decide whether you’ve thought about the answer for enough time just doesn’t. Hundreds of reviews due just seem to disappear without the process being annoying or mentally fatiguing.
Right now it’s weighted by the log of the frequency. The most common words are weighted 3-4 times as much as the least common words.
As of a few minutes ago, the frequency of words is calculated off a little corpus of the first volume of 5 different light novels, and a word is considered and added to the “universe” of words only if it appears in at least two different volumes. This is probably not quite right, but I feel it’s a step in the right direction. It deals with names, fantasy words specific to a single novel and such problems. It brings the total # of words used to calculate the score to about 6000 from about 12000. I feel it would work better if I had a bigger corpus, but we’ll go with this for now…
it looks suspiciously good, I suspect due to:
- I also weight by “knowledge” (log of interval), and that weighting is a bit wonky right now
- the impact of the most common words and word-looking-tokens.