Lankan Web: What is the Unicode ? (මොනවද මේ යුනික් කේත කියන්නෙ?)

Unicode, or Unicode, is a somewhat unfamiliar word, but who knows what Unicode is.

What is Unicode? What does that do? Did you just think about it?

Yeah, you probably know that. May or may not know. Isn't it?

Unicode is an IT standard for encoding, representing, and manipulating the expressive text of most writing systems in the world. The standard is maintained by the Unicode Consortium, and by March 2020, the latest version, Unicode 13.0, features 143,924 characters (143,696 graphic characters, 163 graphic characters, and 65 control characters), 154 modern and scripted scripts, as well as multiple code sets and emoji. The Unicode standard font is synchronized with ISO / IEC 10646 and both codes are identical for code.

The Unicode standard consists of a set of code charts for visual reference, a coding scheme and a standard character encoder, a set of reference data files and character attributes, rules for generalization, decomposition, combinations and a number of related items.

Decomposition, combination, rendering, and bilateral text display order (for displaying right-to-left scripts and left-to-right scripted text correctly)

Unicode's success in combining character sets has resulted in its widespread and prominent use of computer software internationalization and localization. Many modern technologies, including modern operating systems, XML, Java (and other programming languages), and .NET Framework have been implemented.

Unicode can be executed by various character encodings. The Unicode standard defines UTF-8, UTF-16, and UTF-32 and several other encodings are in use. UTF-16's predecessor, UTF-8, UTF-16 and UCS-2 (without full support for Unicode); GB18030 is standardized in China and is not an official Unicode standard.

UTF-8, the world's leading code source (used by more than 94% of websites by November 2019), uses one byte for the first 128 code points, and up to 4 bytes for the other. The first 128 Unicode code points represent ASCII characters, which means any ASCII string is a UTF-8 string.

The UCS-2 uses two bytes (16 bits) for each character, but only the so-called Basic Multilingual Plane (BMP), which has a first code score of 65,536. There are 1,112,064 Unicode code points corresponding to 17 aircraft characters, with version 13.0 defined over 143,000 code points, while UCS-2 can represent less than half the encoded Unicode characters.

Therefore, UCS-2 is outdated even though it is widely used in software. UTF-16 extends UCS-2 using a 16-bit encoding similar to UCS-2 for the base multilingual aircraft, while the other aircraft uses a 4-bit encoding. As long as there are no code points in the reserved range U + D800 - U + DFFF, the UCS-2 text is a valid UTF-16 string.

UTF-32 (also known as UCS-4) uses four bytes for each character. Like UCS-2, the number of bytes per character is fixed and character indexing is easy; Unlike UCS-2, however, UTF-32 is capable of encoding all Unicode code points.

However, since each letter uses four bytes, the UTF-32 takes up a considerable amount of space than most other coding and is not widely used.

යුනික් කේත ඒහෙමත් නැත්නම් යුනික් අක්ශර කියුවම ටිකක් නුපුරුදු වචනයක් , ඒ උනත් යුනිකෝඩ් කියුවම නම් කොයි කවුරුත් දන්නව.

ඇත්තටම මොකක්ද මේ යුනිකෝඩ් කියන්නෙ? ඒකෙන් මොකක්ද කරන්නේ? ඔයා නිකමට ඒ ගැන හිතුවද?

ඔව් සමහර වෙලාවට ඔයා ඒ ගැන දන්නවත් ඇති. දන්නෙ නැතිවත් ඇති. එහෙම නේද?

යුනිකෝඩ් යනු ලෝකයේ බොහෝ ලේඛන පද්ධතිවල ප්‍රකාශිත පෙළ කේතනය කිරීම,නිරූපණය කිරීම සහ හැසිරවීම සඳහා උ තොරතුරු තාක්ෂණ ප්‍රමිතියකි. එම ප්‍රමිතිය යුනිකෝඩ් කොන්සෝර්ටියම් විසින් පවත්වාගෙන යනු ලබන අතර 2020 මාර්තු වන විට නවතම අනුවාදය වන යුනිකෝඩ් 13.0 හි අක්ෂර 143,924 (ග්‍රැෆික් අක්ෂර 143,696, ග්‍රැෆික් අක්ෂර 163 සහ පාලන අක්ෂර 65 කින් සමන්විත) 154 නවීන හා script තිහාසික ස්ක්‍රිප්ට් මෙන්ම බහු සංකේත කට්ටල සහ ඉමොජි. යුනිකෝඩ් ප්‍රමිතියේ අක්ෂර තලය ISO / IEC 10646 සමඟ සමමුහුර්ත කර ඇති අතර දෙකම කේත සඳහා කේත සමාන වේ.

යුනිකෝඩ් ප්‍රමිතිය දෘශ්‍ය යොමු කිරීම සඳහා කේත ප්‍රස්ථාර සමූහයක්, කේතීකරණ ක්‍රමයක් සහ සම්මත අක්ෂර කේතන කට්ටලයක්, විමර්ශන දත්ත ගොනු සමූහයක් සහ චරිත ගුණාංග, සාමාන්‍යකරණය සඳහා වන නීති, වියෝජනය, සංයෝජනය වැනි අදාළ අයිතම ගණනාවකින් සමන්විත වේ.

වියෝජනය, සංයෝජනය, විදැහුම්කරණය සහ ද්විපාර්ශ්වික පෙළ සංදර්ශන අනුපිළිවෙල (දකුණේ සිට වමට ස්ක්‍රිප්ට් සහ වමේ සිට දකුණට ස්ක්‍රිප්ට් අඩංගු පෙළ නිවැරදිව ප්‍රදර්ශනය කිරීම සඳහා)

අක්ෂර කට්ටල ඒකාබද්ධ කිරීමේ යුනිකෝඩ් හි සාර්ථකත්වය පරිගණක මෘදුකාංග අන්තර්ජාතිකකරණය සහ දේශීයකරණය සඳහා එහි පුළුල් හා ප්‍රමුඛ භාවිතයට හේතු වී තිබේ. නවීන මෙහෙයුම් පද්ධති, XML, ජාවා (සහ වෙනත් ක්‍රමලේඛන භාෂා) සහ .NET Framework ඇතුළු බොහෝ නවීන තාක්ෂණයන්හි ප්‍රමිතිය ක්‍රියාත්මක කර ඇත.

විවිධ අක්ෂර කේතන මඟින් යුනිකෝඩ් ක්‍රියාත්මක කළ හැකිය. යුනිකෝඩ් ප්‍රමිතිය UTF-8, UTF-16, සහ UTF-32 නිර්වචනය කර ඇති අතර තවත් කේතීකරණ කිහිපයක් භාවිතයේ පවතී. UTF-16 හි පූර්වගාමියා වන UTF-8, UTF-16 සහ UCS-2 (යුනිකෝඩ් සඳහා පූර්ණ සහාය නොමැතිව); GB18030 චීනයේ ප්‍රමිතිගත කර ඇති අතර නිල යුනිකෝඩ් ප්‍රමිතියක් නොව යුනිකෝඩ් සම්පූර්ණයෙන්ම ක්‍රියාත්මක කරයි

ලෝක ව්‍යාප්ත වෙබ් අඩවියේ ප්‍රමුඛතම කේතන කේතය වන UTF-8 (2019 නොවැම්බර් වන විට වෙබ් අඩවි වලින් 94% කට වඩා භාවිතා කර ඇත), පළමු කේත ලකුණු 128 සඳහා එක් බයිට් එකක් භාවිතා කරයි, සහ අනෙක් බයිට් 4 ක් දක්වා අක්ෂර. පළමු 128 යුනිකෝඩ් කේත ලක්ෂ්‍යයන් ASCII අක්ෂර නිරූපණය කරයි, එයින් අදහස් වන්නේ ඕනෑම ASCII පෙළක් UTF-8 පෙළක් බවයි.

UCS-2 සෑම අක්‍ෂරයක් සඳහාම බයිට් දෙකක් (බිටු 16) භාවිතා කරන නමුත් කේතනය කළ හැක්කේ පළමු කේත ලකුණු 65,536 ක් වන ඊනියා මූලික බහුභාෂා ගුවන්යානය (BMP) පමණි. ගුවන් යානා 17 ක අක්ෂර වලට අනුරූපව යුනිකෝඩ් කේත ලක්ෂ්‍ය 1,112,064 ක් ඇති අතර, 13.0 අනුවාදය අනුව කේත ලකුණු 143,000 කට වඩා අර්ථ දක්වා ඇති අතර, යූසීඑස් -2 හට සංකේතවත් කළ යුනිකෝඩ් අක්ෂරවලින් අඩකටත් වඩා අඩු ප්‍රමාණයක් නිරූපණය කළ හැකිය.

එබැවින්, මෘදුකාංගවල බහුලව භාවිතා වුවද UCS-2 යල් පැන ගිය එකක් වේ. මූලික බහුභාෂා ගුවන්යානය සඳහා යූසීඑස් -2 හා සමාන බිටු 16 කේතන ක්‍රමයක් භාවිතා කරමින් යූටීඑෆ් -16 යූසීඑස් -2 විස්තාරණය කරයි, අනෙක් ගුවන් යානා සඳහා බයිට් 4 කේතන ක්‍රමයක් භාවිතා කරයි. වෙන් කර ඇති පරාසයේ U + D800 - U + DFFF හි කේත ලක්ෂ්‍ය නොමැති තාක් කල්, UCS-2 පෙළ වලංගු UTF-16 පෙළ වේ.

UTF-32 (UCS-4 ලෙසද හැඳින්වේ) එක් එක් අක්ෂර සඳහා බයිට් හතරක් භාවිතා කරයි. UCS-2 මෙන්, අක්‍ෂරයකට බයිට් ගණන ස්ථාවර කර ඇති අතර අක්ෂර සුචිගත කිරීම පහසු කරයි; නමුත් UCS-2 මෙන් නොව UTF-32 ට සියළුම යුනිකෝඩ් කේත ලක්ෂ්‍ය කේතනය කිරීමට හැකියාව ඇත.

කෙසේ වෙතත්, සෑම අක්‍ෂරයක්ම බයිට් හතරක් භාවිතා කරන හෙයින්, යූටීඑෆ් -32 අනෙකුත් කේතීකරණ වලට වඩා සැලකිය යුතු තරම් ඉඩ ප්‍රමාණයක් ගන්නා අතර එය බහුලව භාවිතා නොවේ.