Skip to content

Latest commit

 

History

History
267 lines (140 loc) · 11.2 KB

explanation.md

File metadata and controls

267 lines (140 loc) · 11.2 KB

Writing difficulties

Switch the language to: Russian

I am not an expert in this subject, and my only connection with the Japanese language is my desire to learn it.

Some peculiarities of Japanese pronunciation and writing are described here. These peculiarities are the reasons for the difficulties I encountered in writing the kana-transformer mini-library.

toKana()

  1. Dependence of syllable reading on position
  2. Different notation of vowel longness
  3. Separation of row na and syllabic n
  4. Defenition of a devoiced vowel and a long consonant
  5. Devoiced vowel: i or u?
  6. Extended kana. Two consonants

fromKana()

  1. Particle recognition
  2. Understanding the syllable n
  3. Recognizing long vowels

convertKana()

  1. Recognizing long vowels in hiragana

Introduction

Alphabet

In addition to hieroglyphs, Japanese writing uses two syllabic alphabets (kana): hiragana and katakana. Each alphabet has 46 characters, each of which stands for a syllable (except for ん/ン, which stands for the consonant n). As a result, each syllable has two characters to represent it in writing, one from hiragana and one from katakana.

w r y m h n t s k согл/глас
わ/ワ ら/ラ や/ヤ ま/マ は/ハ な/ナ た/タ さ/サ か/カ あ/ア a
り/リ み/ミ ひ/ヒ に/ニ ち/チ (chi) し/シ (shi) き/キ い/イ i
る/ル ゆ/ユ む/ム ふ/フ ぬ/ヌ つ/ツ (tsu) す/ス く/ク う/ウ u
れ/レ め/メ へ/ヘ ね/ネ て/テ せ/セ け/ケ え/エ e
ろ/ロ よ/ヨ も/モ ほ/ホ の/ノ と/ト そ/ソ こ/コ お/オ o
ん/ン (n)

Japanese has more than 46 syllables. Some of the missing ones are indicated in writing by adding special characters to the basic symbols: nigori ゛ or hannigori ゜.

Extended kana are used to convey sounds that have no counterparts in Japanese. These kana, for example, can convey syllables with two consonants in a row.

Hiragana

Often completes a word that is written in hieroglyphics by being parts of it (suffixes). Sometimes hieroglyphs are not used, and the word is written with hiragana only.

It is also used to write some auxiliary parts of speech like particles, conjunctions and adverbs.

Katakana

Generally used for two reasons: either when writing foreign words or names; or to make a particular concept stand out (katakana is characterized by its sharper features).

Speech and Writing Features

Word Separation

Spaces and other symbols are not used to separate words, as is typical of hieroglyphic writing. Usually, to separate one word from another, at least one of them must be known.

Particles

Some kanas can act as particles. In this case, their pronunciation changes:

  • は is pronounced [wa].
  • へ is pronounced [e].
  • を is pronounced [o].

Long vowel

Hiragana

It is common to denote the longness of a vowel in different ways depending on the word:

  • Doubling is a method common to all vowels
  • [i] after [e] often means long [e].
  • [u] after [o] often means long [o].
Katakana

It's simpler here: for this the character ー is always used.

Devoiced vowel

The vowels [i] and [u], which are between or after deaf consonants, are often pronounced more softly, even to the point of being inaudible. Syllables with [i] more often convey soft consonants (ました - [mashta]), and with [u] - hard consonants (です - [des]).

Double, or long consonant

It is not uncommon to find words in which the consonant is pronounced longer or with a slight delay. When writing, such a syllable is emphasized by the reduced kana っ (in hiragana) or ッ (in katakana).

Syllable n

ん / ン is the only kana that does not end in a vowel sound. It is pronounced differently before different sounds.

Characters with the same pronunciation

In the rows sa and ta, adding nigori makes two characters each pronounced the same:

Reading/row sa ta
ji じ/ジ ぢ/ヂ
zu ず/ズ づ/ヅ

The sa row characters are almost always used, but in some cases, such as when the same letter comes before the char, the ta row characters are used.

Rationale for difficulties

Above I have tried to give the information needed to understand the problems presented next. I write about things that either have not been solved, or have been partially solved.

By solution is more often meant not a way out of the situation, but some compromise accepted because I could not find a better way.

Transform to kana

The motivation for creating the library was to allow people to search for characters using not only kana, but also their own alphabets (now available for Russian and English). Say, if the Japanese alphabet is not installed. This is the responsibility of toKana(). So ideally the function should pass the exact spelling from the user given transcription.

Dependence of syllable reading on position

Relevant when several words or sentence rather than a single word are transforming to kana: particles, syllabic n, devoiced vowel - all of these can be handled correctly (or almost correctly) if the position of the symbol in the word is understood.

Solution

There is only one condition for understanding the position of a character in a word - if there is something separating the words. Classically, a space is used for this purpose.

From here it is also possible to trace the symbols standing before or after a given syllable (otherwise it would be impossible to tell if a symbol is not the first or the last in the word). This knowledge is used, among other things, to distinguish syllables in the sa - ta rows.

Different notation of vowel longness

The long [o] can be equally correctly rendered to kana as おお and as おう. The differences in spelling from word to word are suggested to be memorized, but as for the algorithm spelling is a dead end. Same situation with ええ - えい.

Solution (hiragana)

We have to choose the lesser of evils and use the first option. At the expense of this, some words will be written incorrectly, but at least the reading will remain unambiguous.

Katakana

The above applies to hiragana, with katakana the situation is more complicated. The lengthening of a vowel is written with the character ー. But you can't distinguish whether the vowel is a continuation of the previous vowel, or whether it should be read as a separate element.

Solution (katakana)

The symbol ー is not used as a result. As with hiragana - some (and even many) words will be written incorrectly, but the correct reading will not be ruled out as an option.

Separation of row na and syllabic n

The syllabic n can be confused with the na row, since the former can also be followed by vowels. The result is an incorrect entry.

Solution

To distinguish syllabic n, it is accepted to put a special sign after it:

  • (in English) - apostrophe '
  • (in Russian) - hard sign ъ

Definition of a devoiced vowel and a long consonant

Since the vowel is inaudible, when two consonants are introduced in a word in a row (and the first consonant is not a syllabic n), they are interpreted as either a devoiced vowel or a long consonant. To form a double consonant, the letter must match the next letter, but this is an unreliable method of verification: a devoiced vowel may be followed by a syllable of the same row, causing the consonants to match.

Solution

There is no answer per se; I chose the strategy that gives the correct spelling more often: if the consonant matches the next consonant, it is a long consonant, otherwise it is a devoiced vowel.

Devoiced vowel: i or u?

Depending on the languages, the accuracy will vary.

Solution (English)

There are some consonants that aren't used with u vowel: 'sh', 'ch', 'j'. So, devoiced vowel is considered as i if only written earlier consonants are used. If not - vowel is understood as u.

Solution (Russian)

For Russian, everything is simpler: since the usually devoiced i is used with soft consonants, it remains to distinguish soft from hard consonants. To do this, a soft sign ь must be placed after such a consonant (or the consonant must always be soft, like [щ], [ч]).

Extended kana. Two consonants

In extended kana there are a number of symbols that are interpreted as syllables with two consonants in a row. Since the function implies the insertion of vowels between two consonants (understood as two different syllables, the first of which has a devoiced vowel), it is impossible to process such syllables correctly.

Solution

Starting from version 2.0.0 with extended kana support, the function does not work by default in the mode of recognizing devoiced vowels. To activate it, you need to set the guess: true option. The choice is either to guess devoiced vowels or to support more extended kana.

Transcribe kana

There are fewer problems here, and they are essentially described earlier. The function is fromKana().

Particle recognition

Without understanding the position, it is unknown how some kanas are read:

  • は - [ha] or [wa] ?
  • へ - [he] or [e] ?
  • を - [wo] or [o] ?
Solution

Separate the words using a space.

However, using the function by passing copy-paste will not be very convenient.

Understanding the syllable n

Since its pronunciation changes depending on its position, its transcription should be different.

Solution (English)

By the revised Hepburn's system we get:

  • n' - before vowels
  • n - in other cases
Solution (Russian)

Using Polivanov's system, we deduce:

  • м - before the sounds [b], [p], [m]
  • нъ - before vowels
  • н - in other cases

Recognizing long vowels

It can be difficult to distinguish where a construction of two vowels is pronounced as a long first vowel, and where it means two different sounds: おう can mean both [ou] and [oo]; the same with えい.

Solution

To give a person who knows the problems of the algorithm at least a chance to read the word close to the original, I decided to do a letter-by-letter translation. So, from おう we will get ou.

Converting one kana to the other

This function - convertKana() - has only one problem, but it's a serious one.

Recognizing long vowels in hiragana

In hiragana, two identical, consecutive vowels can be read in two ways: both as a long vowel and as two separate vowels. Katakana, on the other hand, marks the long vowel with the special symbol ー. So, in the course of deriving katakana from hiragana, the function must distinguish between a long vowel and two separate vowels.

Same problem with the ambiguity of the readings おう and えい.

Solution

In order not to exclude the correct reading, the character ー is not used. This, unfortunately, often leads to incorrect spelling.