CJK characters: Difference between revisions

Content deleted Content added

Inline

Latest revision as of 17:27, 3 November 2024

In internationalization, CJK characters is a collective term for graphemes used in the Chinese, Japanese, and Korean writing systems, which each include Chinese characters. It can also go by CJKV to include Chữ Nôm, the Chinese-origin logographic script formerly used for the Vietnamese language, or CJKVZ to also include Sawndip, used to write the Zhuang languages.

Character repertoire

Standard Mandarin Chinese and Standard Cantonese are written almost exclusively in Chinese characters. Over 3,000 characters are required for general literacy, with up to 40,000 characters for reasonably complete coverage. Japanese uses fewer characters—general literacy in Japanese can be expected with 2,136 characters. The use of Chinese characters in Korea is increasingly rare, although idiosyncratic use of Chinese characters in proper names requires knowledge (and therefore availability) of many more characters. Even today, however, South Korean students are taught 1,800 characters.

Other scripts used for these languages, such as bopomofo and the Latin-based pinyin for Chinese, hiragana and katakana for Japanese, and hangul for Korean, are not strictly "CJK characters", although CJK character sets almost invariably include them as necessary for full coverage of the target languages.

The sinologist Carl Leban (1971) produced an early survey of CJK encoding systems.

Until the early 20th century, Classical Chinese was the written language of government and scholarship in Vietnam. Popular literature in Vietnamese was written in the chữ Nôm script, consisting of Chinese characters with many characters created locally. Since the 1920s, the script since then used for recording literature has been the Latin-based Vietnamese alphabet.^[1]^[2]

Encoding

The number of characters required for complete coverage of all these languages' needs cannot fit in the 256-character code space of 8-bit character encodings, requiring at least a 16-bit fixed width encoding or multi-byte variable-length encodings. The 16-bit fixed width encodings, such as those from Unicode up to and including version 2.0, are now deprecated due to the requirement to encode more characters than a 16-bit encoding can accommodate—Unicode 5.0 has some 70,000 Han characters—and the requirement by the Chinese government that software in China support the GB 18030 character set.

Although CJK encodings have common character sets, the encodings often used to represent them have been developed separately by different East Asian governments and software companies, and are mutually incompatible. Unicode has attempted, with some controversy, to unify the character sets in a process known as Han unification.

CJK character encodings should consist minimally of Han characters plus language-specific phonetic scripts such as pinyin, bopomofo, hiragana, katakana and hangul.^[3]

CJK character encodings include:

Big5 (the most prevalent encoding before Unicode was implemented)
CCCII
CNS 11643 (official standard of Republic of China)
EUC-JP
EUC-KR
GB 2312 (subset and predecessor of GB 18030)
GB 18030 (mandated standard in the People's Republic of China)
Giga Character Set (GCS)
ISO 2022-JP
KS C 5861
Shift-JIS
TRON
Unicode

The CJK character sets take up the bulk of the assigned Unicode code space. There is much controversy among Japanese experts of Chinese characters about the desirability and technical merit of the Han unification process used to map multiple Chinese and Japanese character sets into a single set of unified characters.^{[citation needed]}

All three languages can be written both left-to-right and top-to-bottom (right-to-left and top-to-bottom in ancient documents), but are usually considered left-to-right scripts when discussing encoding issues.

Legal status

Libraries cooperated on encoding standards for JACKPHY characters in the early 1980s. According to Ken Lunde, the abbreviation "CJK" was a registered trademark of Research Libraries Group^[4] (which merged with OCLC in 2006). The trademark owned by OCLC between 1987 and 2009 has now expired.^[5]

References

^ Coulmas (1991), pp. 113–115.
^ DeFrancis (1977).
^ This article is based on material taken from CJK at the Free On-line Dictionary of Computing prior to 1 November 2008 and incorporated under the "relicensing" terms of the GFDL, version 1.3 or later.
^ Ken Lunde, 1996
^ Justia listing

Works cited

Coulmas, Florian (1991). The writing systems of the world. Blackwell. ISBN 978-0-631-18028-9.
DeFrancis, John (1977). Colonialism and language policy in Viet Nam. The Hague: Mouton. ISBN 978-90-279-7643-7.

Sources

DeFrancis, John. The Chinese Language: Fact and Fantasy. Honolulu: University of Hawaii Press, 1990. ISBN 0-8248-1068-6.
Hannas, William C. Asia's Orthographic Dilemma. Honolulu: University of Hawaii Press, 1997. ISBN 0-8248-1892-X (paperback); ISBN 0-8248-1842-3 (hardcover).
Lemberg, Werner: The CJK package for LATEX2ε—Multilingual support beyond babel. TUGboat, Volume 18 (1997), No. 3—Proceedings of the 1997 Annual Meeting.
Leban, Carl. Automated Orthographic Systems for East Asian Languages (Chinese, Japanese, Korean), State-of-the-art Report, Prepared for the Board of Directors, Association for Asian Studies. 1971.
Lunde, Ken. CJKV Information Processing. Sebastopol, Calif.: O'Reilly & Associates, 1998. ISBN 1-56592-224-7.

External links

[FOOTNOTECoulmas1991113–115-1] Coulmas (1991), pp. 113–115.

[FOOTNOTEDeFrancis1977-2] DeFrancis (1977).

[3] This article is based on material taken from CJK at the Free On-line Dictionary of Computing prior to 1 November 2008 and incorporated under the "relicensing" terms of the GFDL, version 1.3 or later.

[:0-4] Ken Lunde, 1996

[5] Justia listing

[cnote_a_grp_version] 
As of version 16.0

[1]

[2]

[3]

[4]

[5]

[a]

@@ Line 1: / Line 1: @@
+{{Short description|Logographs in shared East Asian written tradition}}
-{{Refimprove|date=March 2008}}{{citation style}}'''CJK''' is a collective term for [[Chinese language|Chinese]], [[Japanese language|Japanese]], and [[Korean language|Korean]], which constitute the main [[East Asian languages]]. The term is used in the field of [[software]] and communications [[internationalization]].
+{{About||help with CJK character display|Help:Multilingual support (East Asian)|selfref=true}}
+[[File:The old man is 72 years old final.png|thumb|342x342px|Translation of "That old man is 72 years old" in [[Vietnamese language|Vietnamese]], [[Cantonese]], [[Mandarin Chinese|Mandarin]] (in [[Simplified Chinese characters|simplified]] and [[Traditional Chinese characters|traditional characters]]), [[Japanese language|Japanese]], and [[Korean language|Korean]].]]
+In [[internationalization and localization|internationalization]], '''CJK characters''' is a collective term for [[graphemes]] used in the [[Written Chinese|Chinese]], [[Japanese writing system|Japanese]], and [[Korean writing system]]s, which each include [[Chinese characters]]. It can also go by '''CJKV''' to include [[Chữ Nôm]], the Chinese-origin [[logogram|logographic]] script formerly used for the [[Vietnamese language]], or '''CJKVZ''' to also include [[Sawndip]], used to write the [[Zhuang languages]].
+== Character repertoire ==
-The term '''CJKV''' means CJK plus [[Vietnamese language|Vietnamese]], which in the past used [[Hán tự]]/[[Chinese characters]] and [[Chu Nom|Chữ Nôm]] prior to adopting [[Vietnamese alphabet|Quốc Ngữ]].
+Standard Mandarin Chinese and Standard Cantonese are written almost exclusively in Chinese characters. Over 3,000 characters are required for general [[literacy]], with up to 40,000 characters for reasonably complete coverage. Japanese uses fewer characters—general literacy in Japanese can be expected with 2,136 characters. The use of Chinese characters in Korea is increasingly rare, although idiosyncratic use of Chinese characters in proper names requires knowledge (and therefore availability) of many more characters. Even today, however, South Korean students are taught [[Basic Hanja for educational use|1,800 characters]].
+Other scripts used for these languages, such as [[bopomofo]] and the [[Latin script|Latin]]-based [[pinyin]] for Chinese, [[hiragana]] and [[katakana]] for Japanese, and [[hangul]] for Korean, are not strictly "CJK characters", although CJK character sets almost invariably include them as necessary for full coverage of the target languages.
-These languages all have a shared characteristic: Their [[writing system]]s all completely or partly use [[Chinese character]]s — [[Chinese character|hànzì]] in Chinese,  [[kanji]] in Japanese, and [[hanja]] in Korean.  Chinese is written in Chinese characters only and requires c. 4,000 characters for general literacy although there are up to 40,000 characters for reasonably complete coverage.  Japanese uses fewer characters — general literacy in Japan can be expected with about 2,000 characters — together with two [[Syllabary|syllabaries]]. The use of Chinese characters in Korea is becoming increasingly rare altogether, although idiosyncratic use of Chinese characters in proper names requires knowledge (and therefore availability) of many more characters. The number of characters required for complete coverage of all these languages' needs cannot fit in the 256-character code space of 8-bit [[character encoding]]s, requiring at least a 16-bit fixed width encoding or multi-byte variable-length encodings. The 16-bit fixed width encodings, such as [[Unicode]] up to and including version 2.0, are now deprecated due to the requirement to encode more characters than a 16-bit encoding can accommodate — Unicode 5.0 has some 90,000 Han characters — and the requirement by the Chinese government that software in China support the [[GB18030]] character set.
+The [[Sinology|sinologist]] Carl Leban (1971) produced an early survey of CJK encoding systems.
+Until the early 20th century, [[Classical Chinese]] was the written language of government and scholarship in Vietnam. Popular literature in [[Vietnamese language|Vietnamese]] was written in the {{lang|vi|[[chữ Nôm]]}} script, consisting of Chinese characters with many characters created locally. Since the 1920s, the script since then used for recording literature has been the Latin-based [[Vietnamese alphabet]].{{sfnp|Coulmas|1991|pp=113–115}}{{sfnp|DeFrancis|1977}}
+== Encoding ==
+The number of characters required for complete coverage of all these languages' needs cannot fit in the 256-character code space of 8-bit [[character encoding]]s, requiring at least a 16-bit fixed width encoding or multi-byte variable-length encodings. The 16-bit fixed width encodings, such as those from [[Unicode]] up to and including version 2.0, are now deprecated due to the requirement to encode more characters than a 16-bit encoding can accommodate—Unicode 5.0 has some 70,000 Han characters—and the requirement by the Chinese government that software in China support the [[GB 18030]] character set.
 Although CJK encodings have common character sets, the encodings often used to represent them have been developed separately by different East Asian governments and software companies, and are mutually incompatible. [[Unicode]] has attempted, with some controversy, to unify the character sets in a process known as [[Han unification]].
-CJK character encodings should consist minimally of Han characters plus language-specific phonetic scripts such as [[pinyin]], [[bopomofo]], [[hiragana]], [[katakana]], and [[hangul]].
+CJK character encodings should consist minimally of Han characters plus language-specific phonetic scripts such as [[pinyin]], [[bopomofo]], hiragana, katakana and hangul.<ref>{{FOLDOC|CJK}}</ref>
 CJK character encodings include:
+{{div col|colwidth=40em}}
-*[[Big5]]
+* [[Big5]] (the most prevalent encoding before Unicode was implemented)
-*[[EUC-JP]]
+* [[Chinese Character Code for Information Interchange|CCCII]]
-*[[EUC-KR]]
-*[[GB18030]] (the mandated standard in the [[People's Republic of China]])
+* [[CNS 11643]] (official standard of [[Republic of China (Taiwan)|Republic of China]])
-*[[GB2312]]
+* [[EUC-JP]]
-*[[ISO 2022|ISO 2022-JP]]
+* [[EUC-KR]]
+* [[GB 2312]] (subset and predecessor of GB 18030)
-*[[KS C 5861]]
+* [[GB 18030]] (mandated standard in the [[People's Republic of China]])
-*[[Shift-JIS]]
+* Giga Character Set (GCS)
-*[[Unicode]]
+* [[ISO 2022|ISO 2022-JP]]
+* KS C 5861
+* [[Shift-JIS]]
+* [[TRON (encoding)|TRON]]
+* [[Unicode]]
+{{div col end}}
+The CJK character sets take up the bulk of the assigned [[Unicode]] code space. There is much controversy among Japanese experts of Chinese characters about the desirability and technical merit of the [[Han unification]] process used to map multiple Chinese and Japanese character sets into a single set of unified characters.{{Citation needed|date=March 2011}}
+All three languages can be written both [[Horizontal and vertical writing in East Asian scripts|left-to-right and top-to-bottom]] (right-to-left and top-to-bottom in ancient documents), but are usually considered left-to-right scripts when discussing encoding issues.
+== Legal status ==
+Libraries cooperated on encoding standards for [[JACKPHY]] characters in the early 1980s. According to [[Ken Lunde]], the abbreviation "CJK" was a registered [[trademark]] of [[Research Libraries Group]]<ref name=":0">[http://www.csse.monash.edu.au/~jwb/cjk.inf Ken Lunde, 1996]</ref> (which merged with [[OCLC]] in 2006). The trademark owned by OCLC between 1987 and 2009 has now expired.<ref>[http://trademarks.justia.com/736/38/cjk-73638777.html Justia listing]</ref>
+== See also ==
-The CJK character sets take up the bulk of the [[Unicode]] code space. There is much controversy among Japanese experts of Chinese characters about the desirability and technical merit of the Han unification process used to map multiple Chinese and Japanese characters sets into a single set of unified characters.
+* [[Chinese character description languages]]
+* [[Chinese character encoding]]
+* [[Chinese input methods for computers]]
+* [[CJK Compatibility Ideographs]]
+* [[Chinese character strokes]]
+* [[CJK Unified Ideographs]]
+* [[Complex Text Layout languages]] (CTL)
+* [[Input method editor]]
+* [[Japanese language and computers]]
+* [[Korean language and computers]]
+* [[List of CJK fonts]]
+* [[Sinoxenic]]
+* [[Variable-width encoding]]
+* [[Vietnamese language and computers]]
+== References ==
-Chinese and Japanese can be written both [[Horizontal and vertical writing in East Asian scripts|left-to-right and top-to-bottom]], but is usually considered a left-to-right script when discussing encoding issues.
+{{Reflist}}
-==See also==
+===Works cited===
+* {{cite book |last=Coulmas |first=Florian |title=The writing systems of the world |url=https://archive.org/details/writingsystemsof0000coul |publisher=Blackwell |year=1991 |isbn=978-0-631-18028-9 |url-access=registration}}
-*[[Chinese character encoding]]
+* {{cite book |last1=DeFrancis |first1=John |title=Colonialism and language policy in Viet Nam |date=1977 |publisher=Mouton |location=The Hague |isbn=978-90-279-7643-7}}
-*[[Han unification]]
-*[[Chinese input methods for computers]]
-*[[Japanese language and computers]]
-*[[Korean language and computers]]
-*[[Input method editor]]
-*[[Variable-width encoding]]
-*[[Complex Text Layout languages]] (CTL)
-* [[CJK strokes]]
-*[[Horizontal and vertical writing in East Asian scripts]]
-*[[Graphics tablet]]
-==References==
+==Sources==
-{{FOLDOC}}
+{{refbegin}}
-*[[John DeFrancis|DeFrancis, John]]. ''[[The Chinese Language: Fact and Fantasy]]''. Honolulu: University of Hawaii Press, 1990. ISBN 0-8248-1068-6.
+* [[John DeFrancis|DeFrancis, John]]. ''[[The Chinese Language: Fact and Fantasy]]''. Honolulu: University of Hawaii Press, 1990. {{ISBN|0-8248-1068-6}}.
-*Hannas, William C. ''Asia's Orthographic Dilemma''. Honolulu: University of Hawaii Press, 1997. ISBN 0-8248-1892-X (paperback); ISBN 0-8248-1842-3 (hardcover).
+* Hannas, William C. ''Asia's Orthographic Dilemma''. Honolulu: University of Hawaii Press, 1997. {{ISBN|0-8248-1892-X}} (paperback); {{ISBN|0-8248-1842-3}} (hardcover).
-*[[Werner Lemberg|Lemberg, Werner]]: The CJK package for LATEX2ε—Multilingual support beyond babel. TUGboat, Volume 18 (1997), No. 3—Proceedings of the 1997 Annual Meeting
+* Lemberg, Werner: The CJK package for LATEX2ε—Multilingual support beyond babel. TUGboat, Volume 18 (1997), No. 3—Proceedings of the 1997 Annual Meeting.
+* Leban, Carl. ''[https://books.google.com/books?id=ePLMGwAACAAJ Automated Orthographic Systems for East Asian Languages (Chinese, Japanese, Korean)]'', State-of-the-art Report, Prepared for the Board of Directors, Association for Asian Studies. 1971.
-*[[Ken Lunde|Lunde, Ken]]. ''CJKV Information Processing''.  Sebastopol, Calif.: O'Reilly & Associates, 1998.  ISBN 1-56592-224-7.
+* [[Ken Lunde|Lunde, Ken]]. ''CJKV Information Processing''. Sebastopol, Calif.: O'Reilly & Associates, 1998. {{ISBN|1-56592-224-7}}.
+{{refend}}
-==External links==
+== External links ==
-*[http://www.linfo.org/cjkv.html CJKV: A Brief Introduction]
+* [http://www.linfo.org/cjkv.html CJKV: A Brief Introduction]
-*[http://tug.org/TUGboat/Articles/tb18-3/cjkintro600.pdf: Lemberg CJK article from above, TUGboat18-3]
+* [http://tug.org/TUGboat/Articles/tb18-3/cjkintro600.pdf Lemberg CJK article from above, TUGboat18-3]
+* [http://www.wenlin.com/cdl/#jarg On "CJK Unified Ideograph"], from Wenlin.com
+* [https://web.archive.org/web/20130624130411/http://homepage.ntlworld.com/jonathan.deboynepollard/FGA/unicode-cjkv-character-set-rationalization.html FGA: Unicode CJKV character set rationalization]
+{{CJK ideographs in Unicode}}
-[[Category:Chinese language]]
-[[Category:Japanese language]]
-[[Category:Korean language]]
-[[Category:Encodings of Asian Languages]]
+[[Category:Encodings of Asian languages]]
-[[de:CJK]]
+[[Category:Languages of East Asia]]
-[[fr:Chinois, japonais et coréen]]
+[[Category:Natural language and computing]]
-[[ja:CJK]]
+[[Category:Chinese-language computing]]
-[[ko:CJK]]
+[[Category:Japanese-language computing]]
-[[pl:CJK]]
+[[Category:Korean-language computing]]
-[[sv:CJK]]
+[[Category:Writing systems using Chinese characters]]
-[[vi:CJV]]
-[[zh:CJKV]]
+[[ja:CJKV]]