Tutorial about character handling in Ada

This is unfinished!!!

Character sets

"A character set" means the groups of characters not representation.

Latin-1 (character set)
This is the character set used in European and American.
It is aloso same as first 16#100# (256) code-points of Unicode.
Refer http://en.wikipedia.org/wiki/ISO/IEC_8859-1.
UCS-2 (character set)
This is the subset, first 16#10000# (65536) code-points of full Unicode.
It's differ from UTF-16.
UCS-2 does not have 16#D800# .. 16#DFFF# called "surrogate pair".
UCS-4
This is full Unicode.
It's defined as 21-bit.
It's also defined as 31-bit in ISO/IEC 10646.
Refer http://en.wikipedia.org/wiki/Universal_Character_Set and http://unicode.org/.
JIS code.
This is the local character set used in Japanease.
I don't know why, but GNAT supports it (but useless).

Encodings

"A Encoding" means the representation of "A character set".
A variable in programing languages can hold an encoded string.

ASCII (encoding)
In this document, it means ISO/IEC 646.
The (encoded?) 7-bit string. Each element is in 16#00# .. 16#7F#.
(Strictly speacking, ASCII is not ISO-646. Please excuse me.)
Refer http://en.wikipedia.org/wiki/ISO/IEC_646.
Latin-1 (encoding)
The (encoded?) 8-bit string. Each element presents one characer of Latin-1 (character-set).
UCS-2 (encoding)
The (encoded?) 16-bit string. Each element presents one characer of UCS-2 (character-set).
This format is not commonly used. But Wide_String of Ada is this.
UTF-8
The encoded 8-bit string.
One code-point of Unicode is encoded to multi bytes in UTF-8.
Maximum length of one code-point is 4 (21-bit Unicode) or 6 (31-bit Unicode).
UTF-16
The encoded 16-bit string.
It has 16#D800# .. 16#DFFF# called surrogate pair.
16#D800# .. 16#DBFF# is a first half.
16#DC00# .. 16#DFFF# is a second half.
A one unicode code-point over 16 bit is splited to a first half and a second half in UTF-16.
UTF-32
The (encoded?) 32-bit string. Each element presents one characer of UCS-4.
Shift-JIS
This is an encoding of JIS code into 8-bit string.
It's used in japanease-version Windows.
EUC-JP
This is another encoding of JIS code into 8-bit string.
It's used in old japanease-version UNIX. (UTF-8 is used recently)
local encoding
This is not a name of any encoding.
In this document, it's alias of an encoding depending on user's setting of the operating system.

Ada types

Character/String
8-bit character/string types.
These types are defined to hold Latin-1 (encoding).
Refer http://www.adaic.org/resources/add_content/standards/05rm/html/RM-3-5-2.html.
And, the package Ada.Strings.UTF_Encoding in Ada 2012 stores UTF-8 into String type.
Referr, http://ada-auth.org/standards/12rm/html/RM-A-4-11.html.
Wide_Character/Wide_String
16-bit character/string types.
These types are defined to hold UCS-2 (encoding).
And, the package Ada.Strings.UTF_Encoding in Ada 2012 stores UTF-16 into Wide_String type.
Wide_Wide_Character/Wide_Wide_String
32-bit character/string types.
These types are defined to hold UTF-32.
Interfaces.C.char
This is the character type of C language.
It's not Latin-1 (encoding), not UTF-8, but local encoding (implementation defined).
Interfaces.C.wchar_t
This is the wide character type of C language.
It's not UCS-2 (encoding), not UTF-16, not UTF-32, but wide version of local encoding (implementation defined).
Interfaces.C.char16_t
This is the 16-bit character type of C language.
It's just UTF-16.
Interfaces.C.char32_t
This is the 32-bit character type of C language.
It's just UTF-32.

Writing literals in Ada

ASCII character

   ASCII_Character_1 : constant Character := 'A';
   ASCII_Character_2 : constant Character := ASCII.HT;

Latin-1 character

   Latin_1_Character_1 : constant Character := 'A';
   Latin_1_Character_2 : constant Character := Ada.Characters.Latin_1.UC_A_Grave;

Using GNAT

Save your souce code as Latin-1.

   pragma Wide_Character_Encoding (UPPER);
   Latin_1_Character_3 : constant Character := 'À';

Save your souce code as UTF-8.

   pragma Wide_Character_Encoding (UTF8);
   Latin_1_Character_3 : constant Character := 'À';

UCS-2 (encoding) or UTF-16 character

   UTF_16_Character_1 : constant Wide_Character := 'A';

Using GNAT

Save your souce code as UTF-8.

   pragma Wide_Character_Encoding (UTF8);
   UTF_16_Character_1 : constant Wide_Character := 'Ǡ';

UTF-32 character

   UTF_32_Character_1 : constant Wide_Wide_Character := 'A';

Using GNAT

Save your souce code as UTF-8.

   pragma Wide_Character_Encoding (UTF8);
   UTF_32_Character_1 : constant Wide_Wide_Character := 'Ǡ';

Shift-JIS/EUC-JP character

Ada does not support these encodings in the standard.

Note about JIS code literals in GNAT

Do NOT use pragma Wide_Character_Encoding (SHIFT_JIS); or pragma Wide_Character_Encoding (EUC);.
These options are useless because string literals with these pragmas are encoded to Wide_String contains raw values of JIS code. But raw values of JIS code is not used anywhere in fact. For example, A-version of Windows API requires Shift-JIS, not raw values of JIS code. Mail applications requires ISO-2022-JP (that is another encoding), not raw values of JIS code.
So you have to save source code as UTF-8 and convert strings at run-time. (Or save source code just as Shift-JIS or EUC-JP and write these encoded literals into String directly without pragma.)

Save your source code as UTF-8.

   pragma Wide_Character_Encoding (UTF8);
   UTF_16_Character_1 : constant Wide_Character := 'あ';

And convert it at runtime with libiconv, wcstombs or WideCharToMultiByte API.

(Or save your source code as Shift-JIS.

   Shift_JIS_Character_1 : constant String := "あ";

Please attention, the standard library (or GNAT runtime) can not handle multi-byte encoded String like Shift-JIS/EUC-JP(or UTF-8). You have to write almost string operations yourself.)

Handling real letters of Unicode

Probably, you may be confused in this section.
You shold know one code-point of Unicode is not one real letter for display.
Plural code-points represents one real letter in some cases.
(In this section, "one real letter" means one glyph that you are looking now.)

Composed character

TBD...

Variation selector

TBD...

One real letter

You shold forget considering one Wide_Wide_Character as "one character".
One real letter is composed:

part	one code-point having combining class 0	(optional) multi code-points having combining class >= 1	(optional) one variation selector
example

Perhaps you may think to convert Unicode string by NFC to remove composed characters. But NFC does not help you about this. Because NFC characters are not defined every NFD characters (One of reasons: composable code-points having combining class >= 1 are able to be connected infinity). So some composed characters may be left after NFC.
And, NFC replaces some compatibility characters. It's bad behavior for some languages. You shold not do NFC without much thought.

Iterate each real letter of Wide_Wide_String

TBD...

Iterate each real letter of String containing UTF-8

TBD...

Handling local encoding

On UNIX

In this document, "local encoding" means an encoding depending on user's setting of the operating system.
It's also used in C runtime.
Therefore the method of handling local encoding is same as C language.
At first, set the locale of C runtime to user's setting to use other functions in C runtime.

declare
   LC_ALL : constant Interfaces.C.int := 0;
   function setlocale (
      category : Interfaces.C.int;
      locale : access constant Interfaces.C.char)
      return access constant Interfaces.C.char;
   pragma Import (C, setlocale);
   Empty : aliased Interfaces.C.char_array := (0 => Interfaces.C.nul);
   Previous_Locale : access constant Interfaces.C.char;
begin
   Previous_Locale := setlocale (LC_ALL, Empty (0)'Access);
end;

This is same as C code in below:

    char const *previous_locale = setlocale (LC_ALL, "");

Iterate each multi-byte characters of Interfaces.C.char_array containing local encoding

Import mblen to get length of one multi-byte character.

   function mblen (
      s : access constant C.char;
      n : Interfaces.C.size_t;
      return Interfaces.C.int;
   pragma Import (C, mblen);

Then, iterate.

declare
   Text : Interfaces.C.char_array := ...string containing local encoding...;
   I : Interfaces.C.size_t := Text'First;
begin
   while I <= Text'Last loop
      declare
         Length : Interfaces.C.size_t := Interfaces.C.size_t (
            mblen (Text (I)'Access, Text'Last - I + 1));
         One_Multi_Byte_Character : Interfaces.C.char_array
            renames Text (I .. I + Length - 1);
      begin
         ...
         ... -- use One_Multi_Byte_Character
         ...
         I := I + Length;
      end;
   end loop;
end;

convert between local encoding and string types of Ada

TBD...

On Windows

Iterate each multi-byte characters of Interfaces.C.char_array containing local encoding

Import IsDBCSLeadByte to get length of one multi-byte character.

   function IsDBCSLeadByte (
      TestChar : C.char) -- BYTE
      return Interfaces.C.int; -- BOOL
   pragma Import (C, IsDBCSLeadByte);

Then, iterate.

declare
   Text : Interfaces.C.char_array := ...string containing local encoding...;
   I : Interfaces.C.size_t := Text'First;
begin
   while I <= Text'Last loop
      declare
         Length : Interfaces.C.size_t :=
            Boolean'Pos (IsDBCSLeadByte (Text (I)) /= 0) + 1; -- 1 or 2
         One_Multi_Byte_Character : Interfaces.C.char_array
            renames Text (I .. I + Length - 1);
      begin
         ...
         ... -- use One_Multi_Byte_Character
         ...
         I := I + Length;
      end;
   end loop;
end;

convert between local encoding and string types of Ada

TBD...

Ada standard libraries

Ada.Command_Line/Ada.Environment_Varaibles

Ada.Command_Line.Argument and Ada.Environment_Varaibles.Value are defined as String.
It means that we can not get characters not in Latin-1 (character set) of command line and environment variables.

But, using GNAT

But GNAT runtime does not convert command-line and environment variables from local encoding to Latin-1.
(This behavior seems bug, but useful.)
So we can get raw command-line and environment variables with Ada.Command_Line/Ada.Environment_Varaibles.
And, it's necessary to convert from local encoding to UTF-32 (Wide_Wide_String) or any encoding what you need.

Ada.Text_IO

Name parameter is String that holds Latin-1 (encoding).
But you can use UTF-8 file name by Form parameter.

   Open (
      File,
      Name => Ada.Strings.UTF_Encoding.Wide_Wide_Strings.Encode (Wide_Wide_File_Name),
      Form => implementation-defined);

Refer http://groups.google.com/group/comp.lang.ada/msg/038c559fd843a19f?hl=en.

Using GNAT

Use "ENCODING=UTF8".

   Open (
      File,
      Name => Ada.Strings.UTF_Encoding.Wide_Wide_Strings.Encode (Wide_Wide_File_Name),
      Form => "ENCODING=UTF8");

Refer http://gcc.gnu.org/onlinedocs/gnat_rm/FORM-Strings.html.

Ada.Directories

It's same as Text_IO. If function has Form parameter, you can use UTF-8 file name.
But almost functions does not have Form parameter to regret.
It's unclear which are Form parameters added or not in the future. But we have to use functions in C runtime currently. Do not forget the encoding of C runtime is differ from Ada.

Name_Case_Equivalence

Ada.Directories.Name_Case_Equivalence are defined in Ada 2012.
Refer http://ada-auth.org/standards/12rm/html/RM-A-16.html.

But, Actually this function can not help you.
Because it may return Case_Preserving on NTFS (file system of Windows). It's case-insensitive for accessing an existing file but keeping cases of each letter. And, it may return Case_Preserving on HFS+ (file system of Mac OS X). It's case-insensitive for accessing an existing file but keeping cases of each letter. Then, are behaviors of these two file systems same?
No. Case-insensitive rules of these file systems are different. (See below.)
So you have to write function to compare two file names for each file systems.

Ada.Characters/Ada.Strings

Almost subprograms of Ada.Characters/Ada.Strings are defined to work with Latin-1 (encoding).
Do not use these with UTF-8 or local encoding.

Interfaces.C

You who read so far are may awake to bad definition of To_Ada/To_C for char/wchar_t in Interfaces.C.
char of C is local encoding, wchar_t of C is wide version of local encoding. And, wchar_t'Size is able to be 16 (on Windows) or 32 (on UNIX).
Character of Ada is Latin-1 (encoding), Wide_Character of Ada is UCS-2 (encoding).
But in the standard, function To_Ada (Item : wchar_t) return Wide_Character; is defined as Wide_Character'Value (wchar_t'Image (Item)). These functions will raise Constraint_Error instead of encoding, if multi-byte encoding is necessary.
You shold use functions in C runtime to convert between Latin-1/UCS-2/UTF-32 and local encoding instead of To_Ada/To_C for char/wchar_t.

char16_t and char32_t are right.

wchar_t of GNAT

Interfaces.C.wchar_t of GNAT's implementation is more bad.

   type wchar_t is new Wide_Character;
   for wchar_t'Size use Standard'Wchar_T_Size;

On UNIX platforms, it can not hold Unicode characters over 16-bit. Constraint_Error will be raised.

For the real world applications

BSD

BSD is operating system(s) of UNIX.
It's unique about I18N.
wchar_t of BSD is not Unicode.
Functions like wcstombs for wide character set in C runtime of BSD work as character-set indepent.
What is wchar_t in BSD? It's something that holds packed one character that's wide version of local encoding.
You have to use libiconv to convert between Unicode and local encoding.

HFS+

HFS+ is the file system of Mac OSX.
It's unique about normalization/case insensitive of file names.
16#2000# .. 16#2FFF#, 16#F900# .. 16#FAFF#, and 16#2F800# .. 16#2FAFF# are not replaced on the normalization of HFS+.
This variation of normalization is also resolving a kind of problem of normalizations defined in Unicode standard, because compatibility characters are removed from targets of the normalization.
Refer http://developer.apple.com/library/mac/#qa/qa1173/_index.html and http://search.cpan.org/dist/Encode-UTF8Mac/lib/Unicode/Normalize/Mac.pm.
The case insensitive rule of HFS+ is differ from Unicode standard, too.
Refer http://www.opensource.apple.com/source/boot/boot-132/i386/libsaio/hfs_CaseTables.h.

NTFS

NTFS is the file system of Windows.
File names in NTFS are stored as UTF-16.
So a file name having illegal UTF-16 sequence can be existing.
If you want to handle all file names of NTFS, you shold keep file names as Wide_String, not convert to UTF-8/UTF-32.
And, You can use CompareString(W) API for case insensitive of NTFS.

Windows console

Windows console (cmd.exe) can not display UTF-8.
You have to convert to local encoding, if necessary.

Cygwin is an exception. Cygwin console can display UTF-8.

Ada.Wide_Text_IO of GNAT

GNAT's implementation of Wide_Text_IO outputs UTF-8 encoded from UCS-2 in Wide_String that given to Put/Put_Line, irrespective of Windows.
As a result, the output will be garbled to read.
I think it's bug of GNAT.