Proposed tidy-up of the FPC Manual section on Character Types and the FPC Wiki

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Proposed tidy-up of the FPC Manual section on Character Types and the FPC Wiki

Tony Whyman
There has been some heated discussion on the Lazarus lists on the
subject to character encodings etc. This has exposed several issues with
the FPC Manual that I wanted to record.

1. The char type

The manual says: "A Char is exactly 1 byte in size, and contains one
ASCII character. "

This was probably true when Pascal was first defined, but char is often
now used for any on-byte character set e.g. ISO 8859-1. Replace ASCII
with ANSI.

2. WideChar

The Manual says: "A WideChar is exactly 2 bytes in size, and contains
one UNICODE character in UTF-16 encoding. "

This seems to be wrong as UTF-16 is not limited to code points defined
using a single 16-bit code unit, but also permits code points comprising
two 16-bit code units. The definition should be updated to indicate that
a WideChar was really created for the obsolescent UCS-2 and is limited
to a UTF-16 subset (Unicode characters that can be expressed as a single
16-bit code unit).

Proposed replacement text: "A WideChar is exactly 2 bytes in size, and
contains one UNICODE character in UCS-2 encoding or UTF-16 encoding
limited to the Basic Multilingual Plane. Note that Unicode Characters
represented by a UTF-16 code points that require two 16-bit code units
cannot be contained in a single WideChar variable."

3. UnicodeStrings

The Manual says: "For multi-byte string types, the basic character has a
size of at least 2."

Proposed improvement:

"Multi-byte string types are used to represent Unicode characters
encoded as code points requiring two or four bytes".

As with UTF8String, the following caveat should also be added:

"Since a unicode character may require two or four bytes to be
represented in the UTF-16 encoding, there are 2 points to take care of
when using UnicodeString/WideString:

1. The character index – which retrieves a WideChar at a certain
position – must be used with care: the expression S[i] will not
necessarily be a valid character for a string S of type
UnicodeString/WideString.

2. The  length of the string is not necessarily equal to the number of
elements in the array. The standard function length cannot be used to
get the character length of the string, it will always return the array
length.

------------------------------------------------------

Wiki Page on "Character and string Type"

1. This needs to start with a Health Warning on the use of the word
Unicode. Proposed Text (borrowing from Wikipedia):

"Free Pascal supports several character and string types. They range
from single ANSI characters to unicode strings and also include pointer
types. Differences also apply to encodings and reference counting. ANSI
is typically used to refer to single byte character encodings - although
FPC also uses AnsiStrings to hold Unicode UTF-8 encoded strings.

Unicode is a computing industry standard for the consistent encoding,
representation, and handling of text expressed in most of the world's
writing systems. Developed in conjunction with the Universal Coded
Character Set (UCS) standard and published as The Unicode Standard, the
latest version of Unicode contains a repertoire of 136,755 characters
covering 139 modern and historic scripts, as well as multiple symbol sets.

Unicode can be implemented by different character encodings. The Unicode
standard defines UTF-8, UTF-16, and UTF-32, and several other encodings
are in use. The most commonly used encodings are UTF-8, UTF-16 and
UCS-2, a precursor of UTF-16.

The original idea behind Unicode was to replace the typical
256-character encodings requiring 1 byte per character with an encoding
using 2^16 = 65,536 values requiring 2 bytes per character.The early
2-byte encoding was usually called "Unicode", but is now called "UCS-2".
UCS-2 differs from UTF-16 by being a constant length encoding and only
capable of encoding characters of Basic Multilingual Plane (BMP), it is
supported by many programs. However, "UCS-2 should now be considered
obsolete. It no longer refers to an encoding form in either 10646 or the
Unicode Standard.

Unfortunately, the term Unicode, in common usage, is still often used to
refer to the UCS-2 two byte encoding and this can give rise to much
confusion e.g. when Unicode is used when referring to the UTF-8 encoding."

2. The text on WideChar is too terse and needs to be expanded. Proposed
text:

"A variable of type WideChar, also referred to as UnicodeChar (which
derives from the archaic use of Unicode to mean UCS-2), is exactly 2
bytes in size, and usually contains either:

(a) a single UCS-2 code point, or

(b) a single UTF-16 code unit.

In case (b), this is sufficient for Unicode Characters that have a
UTF-16 code point that comprises a single 16-bit code unit i.e.
characters in the Basic Multilingual Plane. However, all other UTF-16
characters have a UTF-16 code point that comprises a two 16-bit code
units. FPC provides no specific support for such characters which
require, e.g. a WideChar pair to encoded them."

Note: that the byte order used to store a WideChar can vary between
platforms.

2. PChar

This should be identified as a synonym for PAnsiChar in FPC, It can also
be as a C style pointer to any AnsiString including UTF-8.

It may also be useful to add a note that in later versions of Delphi,
PChar is a synonym for PWideChar.


_______________________________________________
fpc-pascal maillist  -  [hidden email]
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal
Reply | Threaded
Open this post in threaded view
|

Re: Proposed tidy-up of the FPC Manual section on Character Types and the FPC Wiki

Mark Morgan Lloyd-5
On 18/08/17 10:00, Tony Whyman wrote:
> There has been some heated discussion on the Lazarus lists on the
> subject to character encodings etc. This has exposed several issues with
> the FPC Manual that I wanted to record.

Could I ask one thing on behalf of people who try to maintain code so
that it still works properly with a range of compiler (and FCL/LCL)
versions.

In cases where there's been a change in default behaviour as the
compiler has matured, please could we have the breaking version numbers
noted explicitly.

--
Mark Morgan Lloyd
markMLl .AT. telemetry.co .DOT. uk

[Opinions above are the author's, not those of his employers or colleagues]
_______________________________________________
fpc-pascal maillist  -  [hidden email]
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal
Reply | Threaded
Open this post in threaded view
|

Re: Proposed tidy-up of the FPC Manual section on Character Types and the FPC Wiki

Michael Van Canneyt
In reply to this post by Tony Whyman

Tony,

Thank you for your proposals!

I will take them to heart and improve the docs accordingly. Some parts are
20 years old, so no surprise that they are not exactly correct any more...

Michael.

On Fri, 18 Aug 2017, Tony Whyman wrote:

> There has been some heated discussion on the Lazarus lists on the
> subject to character encodings etc. This has exposed several issues with
> the FPC Manual that I wanted to record.
>
> 1. The char type
>
> The manual says: "A Char is exactly 1 byte in size, and contains one
> ASCII character. "
>
> This was probably true when Pascal was first defined, but char is often
> now used for any on-byte character set e.g. ISO 8859-1. Replace ASCII
> with ANSI.
>
> 2. WideChar
>
> The Manual says: "A WideChar is exactly 2 bytes in size, and contains
> one UNICODE character in UTF-16 encoding. "
>
> This seems to be wrong as UTF-16 is not limited to code points defined
> using a single 16-bit code unit, but also permits code points comprising
> two 16-bit code units. The definition should be updated to indicate that
> a WideChar was really created for the obsolescent UCS-2 and is limited
> to a UTF-16 subset (Unicode characters that can be expressed as a single
> 16-bit code unit).
>
> Proposed replacement text: "A WideChar is exactly 2 bytes in size, and
> contains one UNICODE character in UCS-2 encoding or UTF-16 encoding
> limited to the Basic Multilingual Plane. Note that Unicode Characters
> represented by a UTF-16 code points that require two 16-bit code units
> cannot be contained in a single WideChar variable."
>
> 3. UnicodeStrings
>
> The Manual says: "For multi-byte string types, the basic character has a
> size of at least 2."
>
> Proposed improvement:
>
> "Multi-byte string types are used to represent Unicode characters
> encoded as code points requiring two or four bytes".
>
> As with UTF8String, the following caveat should also be added:
>
> "Since a unicode character may require two or four bytes to be
> represented in the UTF-16 encoding, there are 2 points to take care of
> when using UnicodeString/WideString:
>
> 1. The character index – which retrieves a WideChar at a certain
> position – must be used with care: the expression S[i] will not
> necessarily be a valid character for a string S of type
> UnicodeString/WideString.
>
> 2. The  length of the string is not necessarily equal to the number of
> elements in the array. The standard function length cannot be used to
> get the character length of the string, it will always return the array
> length.
>
> ------------------------------------------------------
>
> Wiki Page on "Character and string Type"
>
> 1. This needs to start with a Health Warning on the use of the word
> Unicode. Proposed Text (borrowing from Wikipedia):
>
> "Free Pascal supports several character and string types. They range
> from single ANSI characters to unicode strings and also include pointer
> types. Differences also apply to encodings and reference counting. ANSI
> is typically used to refer to single byte character encodings - although
> FPC also uses AnsiStrings to hold Unicode UTF-8 encoded strings.
>
> Unicode is a computing industry standard for the consistent encoding,
> representation, and handling of text expressed in most of the world's
> writing systems. Developed in conjunction with the Universal Coded
> Character Set (UCS) standard and published as The Unicode Standard, the
> latest version of Unicode contains a repertoire of 136,755 characters
> covering 139 modern and historic scripts, as well as multiple symbol sets.
>
> Unicode can be implemented by different character encodings. The Unicode
> standard defines UTF-8, UTF-16, and UTF-32, and several other encodings
> are in use. The most commonly used encodings are UTF-8, UTF-16 and
> UCS-2, a precursor of UTF-16.
>
> The original idea behind Unicode was to replace the typical
> 256-character encodings requiring 1 byte per character with an encoding
> using 2^16 = 65,536 values requiring 2 bytes per character.The early
> 2-byte encoding was usually called "Unicode", but is now called "UCS-2".
> UCS-2 differs from UTF-16 by being a constant length encoding and only
> capable of encoding characters of Basic Multilingual Plane (BMP), it is
> supported by many programs. However, "UCS-2 should now be considered
> obsolete. It no longer refers to an encoding form in either 10646 or the
> Unicode Standard.
>
> Unfortunately, the term Unicode, in common usage, is still often used to
> refer to the UCS-2 two byte encoding and this can give rise to much
> confusion e.g. when Unicode is used when referring to the UTF-8 encoding."
>
> 2. The text on WideChar is too terse and needs to be expanded. Proposed
> text:
>
> "A variable of type WideChar, also referred to as UnicodeChar (which
> derives from the archaic use of Unicode to mean UCS-2), is exactly 2
> bytes in size, and usually contains either:
>
> (a) a single UCS-2 code point, or
>
> (b) a single UTF-16 code unit.
>
> In case (b), this is sufficient for Unicode Characters that have a
> UTF-16 code point that comprises a single 16-bit code unit i.e.
> characters in the Basic Multilingual Plane. However, all other UTF-16
> characters have a UTF-16 code point that comprises a two 16-bit code
> units. FPC provides no specific support for such characters which
> require, e.g. a WideChar pair to encoded them."
>
> Note: that the byte order used to store a WideChar can vary between
> platforms.
>
> 2. PChar
>
> This should be identified as a synonym for PAnsiChar in FPC, It can also
> be as a C style pointer to any AnsiString including UTF-8.
>
> It may also be useful to add a note that in later versions of Delphi,
> PChar is a synonym for PWideChar.
>
>
> _______________________________________________
> fpc-pascal maillist  -  [hidden email]
> http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal
_______________________________________________
fpc-pascal maillist  -  [hidden email]
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal
Reply | Threaded
Open this post in threaded view
|

Re: Proposed tidy-up of the FPC Manual section on Character Types and the FPC Wiki

Graeme Geldenhuys-6
In reply to this post by Mark Morgan Lloyd-5
On 2017-08-18 11:13, Mark Morgan Lloyd wrote:
> please could we have the breaking version numbers
> noted explicitly.

The 'fpdoc' tool already has support for that feature. The <version> tag
in description files. I don't know if it has actually been used in FPC
class documentation though. I know I use it often in fpGUI Toolkit's
class documentation.

See FPDOC pdf documentation around Section 5.3.37.

Regards,
   Graeme

--
fpGUI Toolkit - a cross-platform GUI toolkit using Free Pascal
http://fpgui.sourceforge.net/

My public PGP key:  http://tinyurl.com/graeme-pgp
_______________________________________________
fpc-pascal maillist  -  [hidden email]
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal