UnicodeString and surrogate pairs

classic Classic list List threaded Threaded
10 messages Options
Reply | Threaded
Open this post in threaded view
|

UnicodeString and surrogate pairs

Graeme Geldenhuys-6
Hi,

Does FPC's RTL (or FCL) include a function to check for UTF-16 surrogate
pairs? I'd be very surprised if there isn't, but I have yet to find it
in the documentation or source code I searched.

I need to process one "character" (loosely based on what you see on the
screen) at a time while calculating text width up to a maximum width. I
need to make sure I handle all Unicode text correctly, thus NOT just the
BMP of the Unicode standard, but all supplementary planes too. My
alternative is to convert the text from UTF-16 to UTF-8 and then process
it, but in this instance I'm hoping to stay with UTF-16 [never thought I
would ever say that out loudly! ;-)]

Regards,
  Graeme

--
fpGUI Toolkit - a cross-platform GUI toolkit using Free Pascal
http://fpgui.sourceforge.net/

My public PGP key:  http://tinyurl.com/graeme-pgp
_______________________________________________
fpc-pascal maillist  -  [hidden email]
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal
Reply | Threaded
Open this post in threaded view
|

Re: UnicodeString and surrogate pairs

Marco van de Voort
In our previous episode, Graeme Geldenhuys said:
> Does FPC's RTL (or FCL) include a function to check for UTF-16 surrogate
> pairs? I'd be very surprised if there isn't, but I have yet to find it
> in the documentation or source code I searched.

Same as Delphi, character.tcharacter.issurrogate() or
character.issurrogate()

(modern delphi units group everything as class methods in classes for some vague
reason)
_______________________________________________
fpc-pascal maillist  -  [hidden email]
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal
Reply | Threaded
Open this post in threaded view
|

Re: UnicodeString and surrogate pairs

Graeme Geldenhuys-6
On 2016-04-27 16:24, Marco van de Voort wrote:
> Same as Delphi, character.tcharacter.issurrogate() or
> character.issurrogate()

Ah, thank you very much.


Regards,
  Graeme

_______________________________________________
fpc-pascal maillist  -  [hidden email]
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal
Reply | Threaded
Open this post in threaded view
|

Re: UnicodeString and surrogate pairs

Michael Schnell
In reply to this post by Graeme Geldenhuys-6
On 04/27/2016 04:36 PM, Graeme Geldenhuys wrote:
> Does FPC's RTL (or FCL) include a function to check for UTF-16 surrogate
> pairs?

Would that necessarily be an UTF-8 issue  ?

-Michael
_______________________________________________
fpc-pascal maillist  -  [hidden email]
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal
Reply | Threaded
Open this post in threaded view
|

Re: UnicodeString and surrogate pairs

Graeme Geldenhuys-6
On 2016-04-28 09:05, Michael Schnell wrote:
> Would that necessarily be an UTF-8 issue  ?

No, because UTF-8 doesn't use surrogate pairs. In this instance the
string is of type UnicodeString, thus UTF-16 encoded. Now I could
internally assign that to a UTF8String type, but in this case I wanted
to use UnicodeString directly with standard RTL or FCL functions.

On a side note:
  I always use UTF-8 encoded strings with fpGUI and my personal
  projects, because I simply find it easier and more stable (by
  default supporting the whole 1.1 million available Unicode code
  points). The code I'm currently working on is for a client, so I
  didn't enforce my coding habits. ;-)

Regards,
  Graeme

--
fpGUI Toolkit - a cross-platform GUI toolkit using Free Pascal
http://fpgui.sourceforge.net/

My public PGP key:  http://tinyurl.com/graeme-pgp
_______________________________________________
fpc-pascal maillist  -  [hidden email]
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal
Reply | Threaded
Open this post in threaded view
|

Re: UnicodeString and surrogate pairs

Michael Schnell
On 04/29/2016 11:09 AM, Graeme Geldenhuys wrote:
>
> No, because UTF-8 doesn't use surrogate pairs.
Really ?

I understand that "surrogate pairs" is combining a printable character
(i.e on of the nearly 2^32 UTF thingies) with another of those to be
combined to a different printable thingy (/e.g. "A" plus "add two dots
above" to crate a "Ä").

Now a series of 32-bit UTF thingies can be compressed to as well a
series of UTF8 encoded bytes or as a series of UTF16 encoded words. Both
of which usually is much shorter (measured in bytes) than the
uncompressed UTF32 information.

So the UTF8 vs UTF16 issue is a lower layer of encoding.

-Michael
_______________________________________________
fpc-pascal maillist  -  [hidden email]
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal
Reply | Threaded
Open this post in threaded view
|

Re: UnicodeString and surrogate pairs

Sven Barth-2

Am 30.04.2016 08:24 schrieb "Michael Schnell" <[hidden email]>:
>
> On 04/29/2016 11:09 AM, Graeme Geldenhuys wrote:
>>
>>
>> No, because UTF-8 doesn't use surrogate pairs.
>
> Really ?
>
> I understand that "surrogate pairs" is combining a printable character (i.e on of the nearly 2^32 UTF thingies) with another of those to be combined to a different printable thingy (/e.g. "A" plus "add two dots above" to crate a "Ä").

No, that's a different thingie. Surrogate pairs are used in UTF-16 to represent characters which would be > $FFFF. What you are talking about is - I think - decomposition (don't know the exact name) and is a whole more complex topic cause you need to know which characters can be combined. Surrogate pairs on the other hand are specific byte ranges that act as first and second part of the character.

Regards,
Sven


_______________________________________________
fpc-pascal maillist  -  [hidden email]
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal
Reply | Threaded
Open this post in threaded view
|

Re: UnicodeString and surrogate pairs

Graeme Geldenhuys-6
In reply to this post by Michael Schnell
Hello Michael,

On 2016-04-29 at 11:23 you wrote:
> > No, because UTF-8 doesn't use surrogate pairs.  
> Really ?

Yes.


> those to be combined to a different printable thingy (/e.g. "A" plus
> "add two dots above" to crate a "Ä").

No, that is something totally different and not what I was talking
about. You are refering to combining diacritics. Two or more code-points
(think "characters") combined to make a new looking single character on
screen or printed.


> Both of which usually is much shorter (measured in bytes) than the
> uncompressed UTF32 information.

Without you using the correct terminology, I think you are refering to
composed and decomposed formats of a character.

For example:

   e (U+0065) + ̈  (U+0308) = ë  (2 code-points used)
vs
   e (U+0065) + ̈  (U+0308) -->  ë (1 code-point used)

The first example above results in the decomposed version of ë. The
second example above results in the composed version of ë.

The decomposed versions are the prefered and recommended way by the
Unicode Consortium. They (the Unicode Consortium) only included the
composed versions for backward compatibility with existing character
sets - when the Unicode standard was established. No new composed
code-points will be added to the Unicode standard.


Anyway, I was refering to surrogate pairs (applies to UTF-16 only), not
composed/decomposed glyphs.

Regards,
  Graeme
_______________________________________________
fpc-pascal maillist  -  [hidden email]
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal
Reply | Threaded
Open this post in threaded view
|

Re: UnicodeString and surrogate pairs

Martin Schreiber-2
On Saturday 30 April 2016 12:12:35 Graeme Geldenhuys wrote:

>
> Anyway, I was refering to surrogate pairs (applies to UTF-16 only)
>
One could say that utf-8 has surrogate pairs, surrogate triplets and surrogate
quads.

Martin
_______________________________________________
fpc-pascal maillist  -  [hidden email]
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal
Reply | Threaded
Open this post in threaded view
|

Re: UnicodeString and surrogate pairs

Graeme Geldenhuys-6
On 2016-04-30 11:32, Martin Schreiber wrote:
> One could say that utf-8 has surrogate pairs, surrogate triplets and surrogate
> quads.

No, don't confuse the point. As per the Unicode Standards definition of
"surrogate pairs", UTF-8 and UTF-32 don't have surrogate pairs.


Regards,
  Graeme

_______________________________________________
fpc-pascal maillist  -  [hidden email]
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal