RTL and Unicode Strings

classic Classic list List threaded Threaded
13 messages Options
Reply | Threaded
Open this post in threaded view
|

RTL and Unicode Strings

Mazo Winst
Hello all,

I am very confused about the way the system codepage are determined.
From what i understand, the string codepage is determined at runtime in a platform dependent manner. Suppose that my app needs to read a file encoded with UTF-8. Suppose that my app runs on Windows, where the system codepage is most likely to be Windows ANSI. As RTL will use the system codepage, Windows ANSI doesn't support the full range of unicode chars and need to use RTL to read the file, what should i do to prevent data loss?

How the sqldb package handles this point?

Best regards


_______________________________________________
fpc-pascal maillist  -  [hidden email]
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal
Reply | Threaded
Open this post in threaded view
|

Re: RTL and Unicode Strings

LacaK

>
>
> How the sqldb package handles this point?

sqlDB does not perform any character translation.
Only stores data in record buffers as they arrive.
So it expects, that programmer is aware of that and sets correct
"connection encoding".
In case of Lazarus it is often UTF-8, because Lazarus expects that
character data are UTF-8 encoded (at least it was so).
So user programmer must set connection encoding to UTF-8 then data
arrive utf-8 encoded and sqlDB only stores them and forwards them to for
example data-aware controls for displaying.
-Laco.

_______________________________________________
fpc-pascal maillist  -  [hidden email]
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal
Reply | Threaded
Open this post in threaded view
|

Re: RTL and Unicode Strings

Jonas Maebe-2
In reply to this post by Mazo Winst
Mazo Winst wrote:
> Suppose that my app needs to read a file encoded with UTF-8. Suppose
> that my app runs on Windows, where the system codepage is most likely to
> be Windows ANSI. As RTL will use the system codepage, Windows ANSI
> doesn't support the full range of unicode chars and need to use RTL to
> read the file, what should i do to prevent data loss?

If by reading you mean read/readln, then you can use
http://www.freepascal.org/docs-html/rtl/system/settextcodepage.html to
specify to the RTL what the encoding is of the text file you are reading.

In other cases, like LacaK said, you will have to read the data as plain
bytes into e.g. a RawByteString and next use
http://www.freepascal.org/docs-html/rtl/system/setcodepage.html (with
the last parameter set to "false") to afterwards specify the code page
this data has.


Jonas
_______________________________________________
fpc-pascal maillist  -  [hidden email]
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal
Reply | Threaded
Open this post in threaded view
|

Re: RTL and Unicode Strings

Graeme Geldenhuys-6
On 2016-05-11 09:21, Jonas Maebe wrote:
> In other cases, like LacaK said, you will have to read the data as plain
> bytes into e.g. a RawByteString and next use
> http://www.freepascal.org/docs-html/rtl/system/setcodepage.html (with
> the last parameter set to "false") to afterwards specify the code page
> this data has.

But this is where I'm getting a bit confused too.

The RTL and FCL uses String data type predominantly.
  eg: TField.AsString: String.

The RTL and FCL uses String (AnsiString) with default encoding set to Auto.

In my application I enable unicodestring mode. So I'm reading data from
a Firebird database. The data is stored as UTF-8 in a VarChar field. The
DB connection is set up as UTF-8.  Now lets assume my FreeBSD box is set
up with a default encoding of Latin-1.

So I read the UTF-8 data from the database, somewhere inside the SqlDB
code it gets assigned to a TField's String property. ie: UTF-8 ->
Latin-1 conversion.

Then I read the field value into my application. ie: Latin-1 -> UTF-16

The problem as I see it, is that I already lost data when SqlDB
converted it to Latin-1. Am I not understanding the problem?

I checked the FPC 3.x db.pas unit. It uses {$mode objfpc}{$H+} - it
doesn't use UnicodeString and neither does in use RawByteString. So a
text encoding conversion to AnsiString(latin-1) [based on my example] is
going to happen, right?

Regards,
  Graeme

--
fpGUI Toolkit - a cross-platform GUI toolkit using Free Pascal
http://fpgui.sourceforge.net/

My public PGP key:  http://tinyurl.com/graeme-pgp
_______________________________________________
fpc-pascal maillist  -  [hidden email]
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal
Reply | Threaded
Open this post in threaded view
|

Re: RTL and Unicode Strings

Jonas Maebe-2

Graeme Geldenhuys wrote on Wed, 11 May 2016:

> In my application I enable unicodestring mode. So I'm reading data from
> a Firebird database. The data is stored as UTF-8 in a VarChar field. The
> DB connection is set up as UTF-8.  Now lets assume my FreeBSD box is set
> up with a default encoding of Latin-1.
>
> So I read the UTF-8 data from the database, somewhere inside the SqlDB
> code it gets assigned to a TField's String property. ie: UTF-8 ->
> Latin-1 conversion.

This depends on how sqlDB is implemented, and I have absolutely no  
clue about that (other than what LacaK wrote).

As mentioned at  
http://wiki.freepascal.org/FPC_Unicode_support#Dynamic_code_page ,  
conversions on assignment only happen when the *declared* code page of  
the target string is different from that of the source string (other  
than the special case for RawByteString). So if sqlDB only uses plain  
String with {$h+} and/or AnsiString, then no conversions will happen  
anywhere in the scenario you describe since it will just assign  
ansistrings with declared code page CP_ACP to each other.

> Then I read the field value into my application. ie: Latin-1 -> UTF-16

If sqlDB correctly sets the dynamic codepage of the strings it creates  
via SetCodePage(x,CP_UTF8,false), then when you assign those strings  
with declared codepage = CP_ACP and dynamic code page CP_UTF8 to your  
unicodestrings, they will be converted from UTF-8 to UTF-16 at that  
point.

If it does not set the dynamic code page of the strings it creates to  
the appropriate encoding, then you will indeed get data corruption at  
this point, because the UTF-8 encoded data will be interpreted as  
Latin-1 and then be "converted" to UTF-16.

For dealing with such code, which is not yet codepage-aware, by  
default the situation is no worse or no better than it was in previous  
FPC versions: exactly the same would happen there. However, in FPC 3.x  
you can generally fix it by changing the default code page for  
ansistrings using SetMultiByteConversionCodePage() to what you  
know/want to be the encoding of ansistrings, like Lazarus does.

All of this is moreover completely independent of {$modeswitch  
unicodestrings}, since that is just a shortcut to make String an alias  
for UnicodeString in the current compilation module (and Char for  
WideChar, and PChar for PWideChar).


Jonas
_______________________________________________
fpc-pascal maillist  -  [hidden email]
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal
Reply | Threaded
Open this post in threaded view
|

Re: RTL and Unicode Strings

Tony Whyman
In reply to this post by Graeme Geldenhuys-6

On 11/05/16 10:18, Graeme Geldenhuys wrote:
> In my application I enable unicodestring mode. So I'm reading data from
> a Firebird database. The data is stored as UTF-8 in a VarChar field. The
> DB connection is set up as UTF-8.  Now lets assume my FreeBSD box is set
> up with a default encoding of Latin-1.
>
> So I read the UTF-8 data from the database, somewhere inside the SqlDB
> code it gets assigned to a TField's String property. ie: UTF-8 ->
> Latin-1 conversion.

Now this is what interests me as well - in the context of IBX if nothing
else.

It was news to me yesterday that FPC now stores page code information
with AnsiStrings and while IBX still works OK with FPC 3.0.0, it should
work better with this new facility. The IBX code here comes from years
ago and is:

> function TIBStringField.GetValue(var Value: string): Boolean;
> var
>   Buffer: PChar;
> begin
>   Buffer := nil;
>   IBAlloc(Buffer, 0, Size + 1);
>   try
>     Result := GetData(Buffer);
>     if Result then
>     begin
>       Value := string(Buffer);
>       if Transliterate and (Value <> '') then
>         DataSet.Translate(PChar(Value), PChar(Value), False);
>     end
>   finally
>     FreeMem(Buffer);
>   end;
> end;
Note the really nasty coercion that comes after the call to
TField.GetData (which is common to all DB Drivers)  - GetData returns
untyped data into a buffer. DataSet.Translate is a no-op, and I was
never sure what purpose it has - if anything.

To make this code play properly with the new AnsiString, it looks like I
should revise this to (e.g. for utf-8 fields)

   Value := string(Buffer);
   SetCodePage(Value,cp_UTF8,false);
   ...

The outgoing side has a similar problem e.g.

> procedure TIBStringField.SetAsString(const Value: string);
> var
>   Buffer: PChar;
> begin
>   Buffer := nil;
>   IBAlloc(Buffer, 0, Size + 1);
>   try
>     StrLCopy(Buffer, PChar(Value), Size);
>     if Transliterate then
>       DataSet.Translate(Buffer, Buffer, True);
>     SetData(Buffer);
>   finally
>     FreeMem(Buffer);
>   end;
> end;

This probably needs a

SetCodePage(Value,cp_UTF8,true);

before the StrLCopy.

Anyone know if this is a correct interpretation of the AnsiString
codepage facility?
_______________________________________________
fpc-pascal maillist  -  [hidden email]
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal
Reply | Threaded
Open this post in threaded view
|

Re: RTL and Unicode Strings

Marco van de Voort
In reply to this post by Graeme Geldenhuys-6
In our previous episode, Graeme Geldenhuys said:

> > In other cases, like LacaK said, you will have to read the data as plain
> > bytes into e.g. a RawByteString and next use
> > http://www.freepascal.org/docs-html/rtl/system/setcodepage.html (with
> > the last parameter set to "false") to afterwards specify the code page
> > this data has.
>
> But this is where I'm getting a bit confused too.
>
> The RTL and FCL uses String data type predominantly.
>   eg: TField.AsString: String.

String is not a type, but an alias, that is key. So any definition is as how
string is defined when it was compiled. (short/ansi/unicodestring)

> The RTL and FCL uses String (AnsiString) with default encoding set to Auto.

To the default encoding, which is the only runtime variable one, and the
base type that is used as.  So in Orwellian speak ansistring(0) is more
equal then the other ansistring()'s.

> In my application I enable unicodestring mode. So I'm reading data from
> a Firebird database. The data is stored as UTF-8 in a VarChar field. The
> DB connection is set up as UTF-8.  Now lets assume my FreeBSD box is set
> up with a default encoding of Latin-1.
>
> So I read the UTF-8 data from the database, somewhere inside the SqlDB
> code it gets assigned to a TField's String property. ie: UTF-8 ->
> Latin-1 conversion.

Then it is basically equal to 2.6.x, and old Delphi. You are on your own and
must handle conversions yourself and be careful to not mutilate your utf8
content.

> Then I read the field value into my application. ie: Latin-1 -> UTF-16

Yes, you must also handle that conversion manually (either by moving the
character dat to an utf8 typed string and then assigning, or by a manual
encoding routine that basically takes an adress and disregards the codepage
info)

> The problem as I see it, is that I already lost data when SqlDB
> converted it to Latin-1. Am I not understanding the problem?

It depends. Sqldb assigned non ansistring data to an ansistring. In the old
(2.6.4, old delphi) logic it would simply move without conversion, and you
would obtain an ansistring with utf8 in it and be converting forever.

Nothing changed there, except your expectations :-)
 
> I checked the FPC 3.x db.pas unit. It uses {$mode objfpc}{$H+} - it
> doesn't use UnicodeString and neither does in use RawByteString. So a
> text encoding conversion to AnsiString(latin-1) [based on my example] is
> going to happen, right?

Yes. As said many times before, the parts above RTL level have been kept
working, but not changed.
 
So basically the only viable cases are the utf16 D2009+ model. (for Windows,
but works elsewhere too) and the utf8 as default (which needs to be hacked
for systems that don't default to utf8 as one byte conversion).

Both have advantages and disadvantages (and the utf8 ones are not as big as
many people think. They confuse utf8 as dominant document encoding with
apis).

But in the end the choice is simple IMHO. One is delphi compatible, one not.
Period.
_______________________________________________
fpc-pascal maillist  -  [hidden email]
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal
Reply | Threaded
Open this post in threaded view
|

Re: RTL and Unicode Strings

LacaK
In reply to this post by Graeme Geldenhuys-6

>> In other cases, like LacaK said, you will have to read the data as plain
>> bytes into e.g. a RawByteString and next use
>> http://www.freepascal.org/docs-html/rtl/system/setcodepage.html (with
>> the last parameter set to "false") to afterwards specify the code page
>> this data has.
> But this is where I'm getting a bit confused too.
>
> The RTL and FCL uses String data type predominantly.
>    eg: TField.AsString: String.
>
> The RTL and FCL uses String (AnsiString) with default encoding set to Auto.
>
> In my application I enable unicodestring mode. So I'm reading data from
> a Firebird database. The data is stored as UTF-8 in a VarChar field. The
> DB connection is set up as UTF-8.  Now lets assume my FreeBSD box is set
> up with a default encoding of Latin-1.
>
> So I read the UTF-8 data from the database, somewhere inside the SqlDB
> code it gets assigned to a TField's String property. ie: UTF-8 ->
> Latin-1 conversion.
IMO this does not happen.
Because sqlDB provides only pointers to field buffers where "sql
connector" stores data which receives from server.
DB unit only allocates memory of given size and then provides pointer to
that memory, where data are stored.
(may be that somewhere popups any issue, for now I still use FPC 2.6.4
so I can not say more about FPC 3.0.0)
-Laco.
_______________________________________________
fpc-pascal maillist  -  [hidden email]
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal
Reply | Threaded
Open this post in threaded view
|

Re: RTL and Unicode Strings

Michael Van Canneyt
In reply to this post by Jonas Maebe-2


On Wed, 11 May 2016, Jonas Maebe wrote:

>
> Graeme Geldenhuys wrote on Wed, 11 May 2016:
>
>> In my application I enable unicodestring mode. So I'm reading data from
>> a Firebird database. The data is stored as UTF-8 in a VarChar field. The
>> DB connection is set up as UTF-8.  Now lets assume my FreeBSD box is set
>> up with a default encoding of Latin-1.
>>
>> So I read the UTF-8 data from the database, somewhere inside the SqlDB
>> code it gets assigned to a TField's String property. ie: UTF-8 ->
>> Latin-1 conversion.
>
> This depends on how sqlDB is implemented, and I have absolutely no clue about
> that (other than what LacaK wrote).
>
> As mentioned at
> http://wiki.freepascal.org/FPC_Unicode_support#Dynamic_code_page ,
> conversions on assignment only happen when the *declared* code page of the
> target string is different from that of the source string (other than the
> special case for RawByteString). So if sqlDB only uses plain String with
> {$h+} and/or AnsiString, then no conversions will happen anywhere in the
> scenario you describe since it will just assign ansistrings with declared
> code page CP_ACP to each other.

This is the case.

>
>> Then I read the field value into my application. ie: Latin-1 -> UTF-16
>
> If sqlDB correctly sets the dynamic codepage of the strings it creates via
> SetCodePage(x,CP_UTF8,false), then when you assign those strings with
> declared codepage = CP_ACP and dynamic code page CP_UTF8 to your
> unicodestrings, they will be converted from UTF-8 to UTF-16 at that point.

It does not do this.

>
> If it does not set the dynamic code page of the strings it creates to the
> appropriate encoding, then you will indeed get data corruption at this point,
> because the UTF-8 encoded data will be interpreted as Latin-1 and then be
> "converted" to UTF-16.

That is what happens.

Currently, the ONLY provision that is made is that, if SQLDB detects somehow that the
server uses UTF8, it will use an ansistring, allocate 4 bytes in the buffers for each
character.

But it currently does not set the code page of the allocated string to UTF8.

> For dealing with such code, which is not yet codepage-aware, by default the
> situation is no worse or no better than it was in previous FPC versions:
> exactly the same would happen there. However, in FPC 3.x you can generally
> fix it by changing the default code page for ansistrings using
> SetMultiByteConversionCodePage() to what you know/want to be the encoding of
> ansistrings, like Lazarus does.

If Lazarus already sets SetMultiByteConversionCodePage, then it will wreak
havoc to set it to something else.

This matter must be decided at the TDataset level: it should have a property
to determine the character set of string fields (and possibly different for
each field, since this can differ in the database on a field basis).

>
> All of this is moreover completely independent of {$modeswitch
> unicodestrings}, since that is just a shortcut to make String an alias for
> UnicodeString in the current compilation module (and Char for WideChar, and
> PChar for PWideChar).

Honestly, I don't understand this preoccupation with {$modeswitch  unicodestrings}

It just means that

Var
  a : string;

is read by the compiler as

Var
  a : unicodestring;

No more, no less.

Michael.
_______________________________________________
fpc-pascal maillist  -  [hidden email]
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal
Reply | Threaded
Open this post in threaded view
|

Re: RTL and Unicode Strings

Graeme Geldenhuys-6
On 2016-05-11 10:48, Michael Van Canneyt wrote:

> Honestly, I don't understand this preoccupation with {$modeswitch  unicodestrings}
>
> It just means that
>
> Var
>   a : string;
>
> is read by the compiler as
>
> Var
>   a : unicodestring;
>
> No more, no less.


It saves you from data loss in the case where you use units that use the
String data type and assign Unicode data to it -- and you run your
program on a system where the locale is not UTF-8 or UTF-16. eg: Latin-1.


Regards,
  Graeme

--
fpGUI Toolkit - a cross-platform GUI toolkit using Free Pascal
http://fpgui.sourceforge.net/

My public PGP key:  http://tinyurl.com/graeme-pgp
_______________________________________________
fpc-pascal maillist  -  [hidden email]
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal
Reply | Threaded
Open this post in threaded view
|

Re: RTL and Unicode Strings

Graeme Geldenhuys-6
In reply to this post by Marco van de Voort
On 2016-05-11 10:43, Marco van de Voort wrote:
>> The problem as I see it, is that I already lost data when SqlDB
>> converted it to Latin-1. Am I not understanding the problem?
>
> It depends. Sqldb assigned non ansistring data to an ansistring. In the old
> (2.6.4, old delphi) logic it would simply move without conversion, and you
> would obtain an ansistring with utf8 in it and be converting forever.

Correct, and because 2.6.4 did no conversions I can accurately assume in
my application that an AnsiString contains UTF-8 encoded data, and work
with it appropriately. This is how fpGUI and LCL has been working for
many years.

But now with 3.0.0, auto-conversion occurs inside the RTL and FCL code,
corrupting the data before I can get to it.

That's a massive difference between 2.6.4 and 3.x
As it stands now, I cannot see how anybody can actually switch to FPC
3.0 - it simply isn't ready to be used.


Regards,
  Graeme

--
fpGUI Toolkit - a cross-platform GUI toolkit using Free Pascal
http://fpgui.sourceforge.net/

My public PGP key:  http://tinyurl.com/graeme-pgp
_______________________________________________
fpc-pascal maillist  -  [hidden email]
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal
Reply | Threaded
Open this post in threaded view
|

Re: RTL and Unicode Strings

Michael Van Canneyt
In reply to this post by Graeme Geldenhuys-6


On Wed, 11 May 2016, Graeme Geldenhuys wrote:

> On 2016-05-11 10:48, Michael Van Canneyt wrote:
>> Honestly, I don't understand this preoccupation with {$modeswitch  unicodestrings}
>>
>> It just means that
>>
>> Var
>>   a : string;
>>
>> is read by the compiler as
>>
>> Var
>>   a : unicodestring;
>>
>> No more, no less.
>
>
> It saves you from data loss in the case where you use units that use the
> String data type and assign Unicode data to it -- and you run your
> program on a system where the locale is not UTF-8 or UTF-16. eg: Latin-1.

No, it does not save you, where did you get that from ?

Michael.
_______________________________________________
fpc-pascal maillist  -  [hidden email]
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal
Reply | Threaded
Open this post in threaded view
|

Re: RTL and Unicode Strings

Graeme Geldenhuys-6
On 2016-05-11 12:48, Michael Van Canneyt wrote:
> No, it does not save you, where did you get that from ?

It helps. Any encoding to UTF-16 (or UTF-8) is safe. The other way round
is not. There is no guarantee that String (or AnsiString) is using a
Unicode encoding. So depending on where you get your data from, in my
case that data is one of the Unicode encodings, doing a conversion to
anything other than another Unicode encoded variable (or RawByteString)
means I could loose data.

See my actual database example (with sample code) titled "code example
where AnsiString used in FCL (SqlDB) causes data loss" - whenever a
moderator releases that post to the mailing list. Otherwise I can
forward it to you in private.

Regards,
  Graeme

--
fpGUI Toolkit - a cross-platform GUI toolkit using Free Pascal
http://fpgui.sourceforge.net/

My public PGP key:  http://tinyurl.com/graeme-pgp
_______________________________________________
fpc-pascal maillist  -  [hidden email]
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal