Switch the language to: Russian
I am not an expert in this subject, and my only connection with the Japanese language is my desire to learn it.
Some peculiarities of Japanese pronunciation and writing are described here. These peculiarities are the reasons for the difficulties I encountered in writing the kana-transformer mini-library.
toKana()
- Dependence of syllable reading on position
- Different notation of vowel longness
- Separation of row na and syllabic n
- Defenition of a devoiced vowel and a long consonant
- Devoiced vowel: i or u?
- Extended kana. Two consonants
fromKana()
convertKana()
In addition to hieroglyphs, Japanese writing uses two syllabic alphabets (kana): hiragana and katakana. Each alphabet has 46 characters, each of which stands for a syllable (except for ん/ン, which stands for the consonant n). As a result, each syllable has two characters to represent it in writing, one from hiragana and one from katakana.
w | r | y | m | h | n | t | s | k | согл/глас | ||
---|---|---|---|---|---|---|---|---|---|---|---|
わ/ワ | ら/ラ | や/ヤ | ま/マ | は/ハ | な/ナ | た/タ | さ/サ | か/カ | あ/ア | a | |
り/リ | み/ミ | ひ/ヒ | に/ニ | ち/チ (chi) | し/シ (shi) | き/キ | い/イ | i | |||
る/ル | ゆ/ユ | む/ム | ふ/フ | ぬ/ヌ | つ/ツ (tsu) | す/ス | く/ク | う/ウ | u | ||
れ/レ | め/メ | へ/ヘ | ね/ネ | て/テ | せ/セ | け/ケ | え/エ | e | |||
を | ろ/ロ | よ/ヨ | も/モ | ほ/ホ | の/ノ | と/ト | そ/ソ | こ/コ | お/オ | o | |
ん/ン (n) |
Japanese has more than 46 syllables. Some of the missing ones are indicated in writing by adding special characters to the basic symbols: nigori ゛ or hannigori ゜.
Extended kana are used to convey sounds that have no counterparts in Japanese. These kana, for example, can convey syllables with two consonants in a row.
Often completes a word that is written in hieroglyphics by being parts of it (suffixes). Sometimes hieroglyphs are not used, and the word is written with hiragana only.
It is also used to write some auxiliary parts of speech like particles, conjunctions and adverbs.
Generally used for two reasons: either when writing foreign words or names; or to make a particular concept stand out (katakana is characterized by its sharper features).
Spaces and other symbols are not used to separate words, as is typical of hieroglyphic writing. Usually, to separate one word from another, at least one of them must be known.
Some kanas can act as particles. In this case, their pronunciation changes:
- は is pronounced [wa].
- へ is pronounced [e].
- を is pronounced [o].
It is common to denote the longness of a vowel in different ways depending on the word:
- Doubling is a method common to all vowels
- [i] after [e] often means long [e].
- [u] after [o] often means long [o].
It's simpler here: for this the character ー is always used.
The vowels [i] and [u], which are between or after deaf consonants, are often pronounced more softly, even to the point of being inaudible. Syllables with [i] more often convey soft consonants (ました - [mashta]), and with [u] - hard consonants (です - [des]).
It is not uncommon to find words in which the consonant is pronounced longer or with a slight delay. When writing, such a syllable is emphasized by the reduced kana っ (in hiragana) or ッ (in katakana).
ん / ン is the only kana that does not end in a vowel sound. It is pronounced differently before different sounds.
In the rows sa and ta, adding nigori makes two characters each pronounced the same:
Reading/row | sa | ta |
---|---|---|
ji | じ/ジ | ぢ/ヂ |
zu | ず/ズ | づ/ヅ |
The sa row characters are almost always used, but in some cases, such as when the same letter comes before the char, the ta row characters are used.
Above I have tried to give the information needed to understand the problems presented next. I write about things that either have not been solved, or have been partially solved.
By solution is more often meant not a way out of the situation, but some compromise accepted because I could not find a better way.
The motivation for creating the library was to allow people to search for characters using not only kana, but also their own alphabets (now available for Russian and English). Say, if the Japanese alphabet is not installed. This is the responsibility of toKana()
. So ideally the function should pass the exact spelling from the user given transcription.
Relevant when several words or sentence rather than a single word are transforming to kana: particles, syllabic n, devoiced vowel - all of these can be handled correctly (or almost correctly) if the position of the symbol in the word is understood.
There is only one condition for understanding the position of a character in a word - if there is something separating the words. Classically, a space is used for this purpose.
From here it is also possible to trace the symbols standing before or after a given syllable (otherwise it would be impossible to tell if a symbol is not the first or the last in the word). This knowledge is used, among other things, to distinguish syllables in the sa - ta rows.
The long [o] can be equally correctly rendered to kana as おお and as おう. The differences in spelling from word to word are suggested to be memorized, but as for the algorithm spelling is a dead end. Same situation with ええ - えい.
We have to choose the lesser of evils and use the first option. At the expense of this, some words will be written incorrectly, but at least the reading will remain unambiguous.
The above applies to hiragana, with katakana the situation is more complicated. The lengthening of a vowel is written with the character ー. But you can't distinguish whether the vowel is a continuation of the previous vowel, or whether it should be read as a separate element.
The symbol ー is not used as a result. As with hiragana - some (and even many) words will be written incorrectly, but the correct reading will not be ruled out as an option.
The syllabic n can be confused with the na row, since the former can also be followed by vowels. The result is an incorrect entry.
To distinguish syllabic n, it is accepted to put a special sign after it:
- (in English) - apostrophe '
- (in Russian) - hard sign ъ
Since the vowel is inaudible, when two consonants are introduced in a word in a row (and the first consonant is not a syllabic n), they are interpreted as either a devoiced vowel or a long consonant. To form a double consonant, the letter must match the next letter, but this is an unreliable method of verification: a devoiced vowel may be followed by a syllable of the same row, causing the consonants to match.
There is no answer per se; I chose the strategy that gives the correct spelling more often: if the consonant matches the next consonant, it is a long consonant, otherwise it is a devoiced vowel.
Depending on the languages, the accuracy will vary.
There are some consonants that aren't used with u vowel: 'sh', 'ch', 'j'. So, devoiced vowel is considered as i if only written earlier consonants are used. If not - vowel is understood as u.
For Russian, everything is simpler: since the usually devoiced i is used with soft consonants, it remains to distinguish soft from hard consonants. To do this, a soft sign ь must be placed after such a consonant (or the consonant must always be soft, like [щ], [ч]).
In extended kana there are a number of symbols that are interpreted as syllables with two consonants in a row. Since the function implies the insertion of vowels between two consonants (understood as two different syllables, the first of which has a devoiced vowel), it is impossible to process such syllables correctly.
Starting from version 2.0.0 with extended kana support, the function does not work by default in the mode of recognizing devoiced vowels. To activate it, you need to set the guess: true
option. The choice is either to guess devoiced vowels or to support more extended kana.
There are fewer problems here, and they are essentially described earlier. The function is fromKana()
.
Without understanding the position, it is unknown how some kanas are read:
- は - [ha] or [wa] ?
- へ - [he] or [e] ?
- を - [wo] or [o] ?
Separate the words using a space.
However, using the function by passing copy-paste will not be very convenient.
Since its pronunciation changes depending on its position, its transcription should be different.
By the revised Hepburn's system we get:
- n' - before vowels
- n - in other cases
Using Polivanov's system, we deduce:
- м - before the sounds [b], [p], [m]
- нъ - before vowels
- н - in other cases
It can be difficult to distinguish where a construction of two vowels is pronounced as a long first vowel, and where it means two different sounds: おう can mean both [ou] and [oo]; the same with えい.
To give a person who knows the problems of the algorithm at least a chance to read the word close to the original, I decided to do a letter-by-letter translation. So, from おう we will get ou.
This function - convertKana()
- has only one problem, but it's a serious one.
In hiragana, two identical, consecutive vowels can be read in two ways: both as a long vowel and as two separate vowels. Katakana, on the other hand, marks the long vowel with the special symbol ー. So, in the course of deriving katakana from hiragana, the function must distinguish between a long vowel and two separate vowels.
Same problem with the ambiguity of the readings おう and えい.
In order not to exclude the correct reading, the character ー is not used. This, unfortunately, often leads to incorrect spelling.