FPC 3 regression: cannot use TStringList for UTF-8 data any more?

classic Classic list List threaded Threaded
47 messages Options
123
Reply | Threaded
Open this post in threaded view
|

FPC 3 regression: cannot use TStringList for UTF-8 data any more?

tobiasgiesen
Hello,

disallowing "AnsiString" code for UTF-8 is a huge regression.

I use TStringList for UTF-8 strings. This is no longer possible, because
automatic conversions cause question marks and data loss.

I also use a large amount of third-party libraries that use the AnsiString
data type for UTF-8.

I really want to use FPC 3 due to other things, but this is a deal
breaker. Why not add a simple switch or even a run-time Boolean global
variable to turn off codepage conversions?

It behaves differently from Delphi too.

Cheers,
Tobias

_______________________________________________
fpc-pascal maillist  -  [hidden email]
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal
Reply | Threaded
Open this post in threaded view
|

Re: FPC 3 regression: cannot use TStringList for UTF-8 data any more?

Mattias Gaertner
On Mon, 04 Apr 2016 10:18:18 +0200
[hidden email] wrote:

> Hello,
>
> disallowing "AnsiString" code for UTF-8 is a huge regression.
>
> I use TStringList for UTF-8 strings. This is no longer possible, because
> automatic conversions cause question marks and data loss.

Lazarus uses TStringList with UTF-8 all over the place.

Please post a complete example demonstrating the problem.

Mattias
_______________________________________________
fpc-pascal maillist  -  [hidden email]
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal
Reply | Threaded
Open this post in threaded view
|

Re: FPC 3 regression: cannot use TStringList for UTF-8 data any more?

Michael Van Canneyt
In reply to this post by tobiasgiesen


On Mon, 4 Apr 2016, [hidden email] wrote:

> Hello,
>
> disallowing "AnsiString" code for UTF-8 is a huge regression.
>
> I use TStringList for UTF-8 strings. This is no longer possible, because
> automatic conversions cause question marks and data loss.

Same answer as in my other mail. Set DefaultSystemCodePage to CP_UTF8.

>
> I also use a large amount of third-party libraries that use the AnsiString
> data type for UTF-8.
>
> I really want to use FPC 3 due to other things, but this is a deal
> breaker. Why not add a simple switch or even a run-time Boolean global
> variable to turn off codepage conversions?
>
> It behaves differently from Delphi too.

This depends on the version of Delphi :)

Michael.
_______________________________________________
fpc-pascal maillist  -  [hidden email]
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal
Reply | Threaded
Open this post in threaded view
|

Re: FPC 3 regression: cannot use TStringList for UTF-8 data any more?

tobiasgiesen
In reply to this post by Mattias Gaertner
> > I use TStringList for UTF-8 strings. This is no longer possible, because
> > automatic conversions cause question marks and data loss.
>
> Lazarus uses TStringList with UTF-8 all over the place.
>
> Please post a complete example demonstrating the problem.

Sorry - this was only theoretical, because of the Backward compatibility
section on the FPC Unicode Support page.

It says that a "defined way" to use strings is "you do not store data in
an ansistring that has been encoded using something else than the
system's default code page, and subsequently pass this string as-is to
an FPC RTL routine".

That would mean I cannot use TStringList for UTF-8.  The paragraph is
misleading, really. Very theoretical. What you really need to tell
people is something like this:

"Unicode aware Pascal code needs to set DefaultSystemCodePage to
CP_UTF8".

I am sorry but I was really shocked this morning when I saw the question
marks :)

Cheers,
Tobias

_______________________________________________
fpc-pascal maillist  -  [hidden email]
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal
Reply | Threaded
Open this post in threaded view
|

Re: FPC 3 regression: cannot use TStringList for UTF-8 data any more?

Graeme Geldenhuys-6
On 2016-04-04 09:43, [hidden email] wrote:
> Very theoretical. What you really need to tell
> people is something like this:

That please update the wiki - it is user editable. Even a seasoned
developers as myself still needs to get my head around all this FPC
Unicode stuff. So any information and tips on the wiki would be greatly
appreciated.

I haven't moved to FPC 3.0 yet, but when I do, I too will have lots of
testing to do in my own code. I don't use LCL, but but do currently
store UTF-8 text inside AnsiString's for years (on all platforms).

Regards,
  - Graeme -

--
fpGUI Toolkit - a cross-platform GUI toolkit using Free Pascal
http://fpgui.sourceforge.net/

My public PGP key:  http://tinyurl.com/graeme-pgp
_______________________________________________
fpc-pascal maillist  -  [hidden email]
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal
Reply | Threaded
Open this post in threaded view
|

Re: FPC 3 regression: cannot use TStringList for UTF-8 data any more?

tobiasgiesen
> That please update the wiki - it is user editable.

Done:
http://wiki.freepascal.org/FPC_Unicode_support#Backward_compatibility

I hope this is correct.

Cheers,
Tobias


_______________________________________________
fpc-pascal maillist  -  [hidden email]
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal
Reply | Threaded
Open this post in threaded view
|

Re: FPC 3 regression: cannot use TStringList for UTF-8 data any more?

Jonas Maebe-2

tobiasgiesen wrote on Mon, 04 Apr 2016:

>> That please update the wiki - it is user editable.
>
> Done:
> http://wiki.freepascal.org/FPC_Unicode_support#Backward_compatibility
>
> I hope this is correct.

It is incorrect in the sense that there is nothing utf8-specific about  
the way your code (ab)used ansistrings. I will fix it, since that page  
is more or less part of the official FPC documentation (since it's  
linked from the FPC 3.0 release notes).


Jonas
_______________________________________________
fpc-pascal maillist  -  [hidden email]
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal
Reply | Threaded
Open this post in threaded view
|

Re: FPC 3 regression: cannot use TStringList for UTF-8 data any more?

Juha Manninen
In reply to this post by tobiasgiesen
On Mon, Apr 4, 2016 at 11:18 AM,  <[hidden email]> wrote:
> I use TStringList for UTF-8 strings. This is no longer possible, because
> automatic conversions cause question marks and data loss.

You are completely lost with this issue. The automatic conversion of
encodings is a big step forward.
Just use the new UTF-8 mode provided by Lazarus and remove all
explicit conversion functions.
  http://wiki.freepascal.org/Better_Unicode_Support_in_Lazarus

Juha
_______________________________________________
fpc-pascal maillist  -  [hidden email]
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal
Reply | Threaded
Open this post in threaded view
|

Re: FPC 3 regression: cannot use TStringList for UTF-8 data any more?

Graeme Geldenhuys-6
On 2016-04-04 10:27, Juha Manninen wrote:
> Just use the new UTF-8 mode provided by Lazarus and remove all
> explicit conversion functions.

This is the FPC mailing list. Not everybody here uses Lazarus or LCL, so
making such a suggestion is wishful thinking. For example, your
suggestion means nothing to me, I don't use LCL.

Regards,
  - Graeme -

_______________________________________________
fpc-pascal maillist  -  [hidden email]
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal
Reply | Threaded
Open this post in threaded view
|

Re: FPC 3 regression: cannot use TStringList for UTF-8 data any more?

Juha Manninen
On Mon, Apr 4, 2016 at 12:52 PM, Graeme Geldenhuys
<[hidden email]> wrote:
> This is the FPC mailing list. Not everybody here uses Lazarus or LCL, so
> making such a suggestion is wishful thinking. For example, your
> suggestion means nothing to me, I don't use LCL.

Yes, I should have mentioned that this feature does not require LCL.
It only requires LazUtils package and LazUTF8 unit in your uses
section.
It can be used in cmd line and server programs and I guess in fpGUI,
too, although I have not tested.
But yes, it requires Lazarus IDE because LazUtils is a Lazarus
package. At least you must create and compile the project using
Lazarus IDE.

Anyway, this UTF-8 mode does more that sets the default String encoding.
It also provides proper UTF-8 functions as backends for RTL's
Ansi...() string functions.
It also uses cwstring although it pulls in clib.
Then typical users' code is amazingly Delphi compatible despite the
different encoding, because code only seldom deals with individual
codepoints beyond 7-bit ASCII.

Juha
_______________________________________________
fpc-pascal maillist  -  [hidden email]
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal
Reply | Threaded
Open this post in threaded view
|

Re: FPC 3 regression: cannot use TStringList for UTF-8 data any more?

Mattias Gaertner
In reply to this post by Graeme Geldenhuys-6
On Mon, 4 Apr 2016 10:52:20 +0100
Graeme Geldenhuys <[hidden email]> wrote:

> On 2016-04-04 10:27, Juha Manninen wrote:
> > Just use the new UTF-8 mode provided by Lazarus and remove all
> > explicit conversion functions.
>
> This is the FPC mailing list. Not everybody here uses Lazarus or LCL, so
> making such a suggestion is wishful thinking. For example, your
> suggestion means nothing to me, I don't use LCL.

First of all it's part of LazUtils. So you don't have to use the LCL
for that. In fact you don't have to use LazUtils: some users simply
copied the two units FPCAdds and LazUTF8. It's all open source.

Second I find it funny that the statement comes from you - a notorious
promoter of software on forums/lists of competing projects.

And third setting the DefaultSystemCodePage is a good start, but not
enough. Instead of explaining all the gory details, Juha promoted a
more complete solution for UTF-8. This is useful for many users. They
don't have to reinvent the wheel.


Mattias
_______________________________________________
fpc-pascal maillist  -  [hidden email]
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal
Reply | Threaded
Open this post in threaded view
|

Re: FPC 3 regression: cannot use TStringList for UTF-8 data any more?

Mattias Gaertner
In reply to this post by Juha Manninen
On Mon, 4 Apr 2016 13:27:05 +0300
Juha Manninen <[hidden email]> wrote:

>[...]
> But yes, it requires Lazarus IDE because LazUtils is a Lazarus
> package. At least you must create and compile the project using
> Lazarus IDE.

Or simply copy the two units FPCAdds, LazUTF-8 or parts of them from
here:
http://svn.freepascal.org/svn/lazarus/tags/lazarus_1_6/components/lazutils/


Mattias
_______________________________________________
fpc-pascal maillist  -  [hidden email]
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal
Reply | Threaded
Open this post in threaded view
|

Re: FPC 3 regression: cannot use TStringList for UTF-8 data any more?

Michael Schnell
In reply to this post by tobiasgiesen
On 04/04/2016 10:43 AM, [hidden email] wrote:
> "Unicode aware Pascal code needs to set DefaultSystemCodePage to
> CP_UTF8".

That can't be this ubiquitous. I do suppose that the default value is
supposed to make sense in many cases.

OTOH, if - as you seem to suggest - there is any conversion at all when
using TSTringList to store your UTF8 strings, (independent whether it
"works" or not) this will introduce a decent performance problem. I
don't know it that depends on the setting of DefaultSystemCodePage.

Please let us know what you find. (right now Lazarus does not seem to
compile for me with 3.1.1, so I can't easily check myself.)

-Michael
_______________________________________________
fpc-pascal maillist  -  [hidden email]
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal
Reply | Threaded
Open this post in threaded view
|

Re: FPC 3 regression: cannot use TStringList for UTF-8 data any more?

Graeme Geldenhuys-6
In reply to this post by Mattias Gaertner
On 2016-04-04 11:40, Mattias Gaertner wrote:
> Or simply copy the two units FPCAdds, LazUTF-8 or parts of them from
> here:

Thank you Juha and Mattias - I'll take a look at those to see what they do.

Regards,
  - Graeme -


_______________________________________________
fpc-pascal maillist  -  [hidden email]
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal
Reply | Threaded
Open this post in threaded view
|

Re: FPC 3 regression: cannot use TStringList for UTF-8 data any more?

Michael Schnell
In reply to this post by Juha Manninen
On 04/04/2016 11:27 AM, Juha Manninen wrote:
> On Mon, Apr 4, 2016 at 11:18 AM,  <[hidden email]> wrote:
>> I use TStringList for UTF-8 strings. This is no longer possible, because
>> automatic conversions cause question marks and data loss.
> You are completely lost with this issue. The automatic conversion of
> encodings is a big step forward.
> Just use the new UTF-8 mode provided by Lazarus and remove all
> explicit conversion functions.
(How) does the new UTF-8 mode in Lazarus change the way TStringList
works (as this is what Tobias is concerned about)  ?

-Michael
_______________________________________________
fpc-pascal maillist  -  [hidden email]
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal
Reply | Threaded
Open this post in threaded view
|

Re: FPC 3 regression: cannot use TStringList for UTF-8 data any more?

Graeme Geldenhuys-6
In reply to this post by Mattias Gaertner
On 2016-04-04 11:34, Mattias Gaertner wrote:
> for that. In fact you don't have to use LazUtils: some users simply
> copied the two units FPCAdds and LazUTF8. It's all open source.

This was not made clear until you explicitly mentioned it. Juha's
initial comment was vague on the matter, and the original poster never
mentioned they used Lazarus or LCL.


> Second I find it funny that the statement comes from you

I simply wanted an answer or explanation that benefits anybody using FPC.


> more complete solution for UTF-8. This is useful for many users. They
> don't have to reinvent the wheel.

Not having looked at the two units you mentioned... but if this is a
general requirement for anybody using UTF-8 or similar with FPC 3.0,
then wouldn't it be best to see if those units can be contributed to
FPC's FCL? The ultimate "don't reinvent the wheel" location. ;-)


Regards,
  - Graeme -

_______________________________________________
fpc-pascal maillist  -  [hidden email]
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal
Reply | Threaded
Open this post in threaded view
|

Re: FPC 3 regression: cannot use TStringList for UTF-8 data any more?

Michael Van Canneyt


On Mon, 4 Apr 2016, Graeme Geldenhuys wrote:

>> more complete solution for UTF-8. This is useful for many users. They
>> don't have to reinvent the wheel.
>
> Not having looked at the two units you mentioned... but if this is a
> general requirement for anybody using UTF-8 or similar with FPC 3.0,
> then wouldn't it be best to see if those units can be contributed to
> FPC's FCL? The ultimate "don't reinvent the wheel" location. ;-)

One would think so but:

1. Using UTF8 is a choice of lazarus. Other people may prefer UnicodeString.
    On Windows, UnicodeString is more 'natural' or 'native'.

2. The release cycle of FPC is rather long, so updates will be available not
    as fast as the lazarus team needs them.
    And in view of 1. that may be a problem.

If memory serves well, there was initially an attempt to get some of the
functionality into FPC by Felipe, but this was quickly abandoned due to above
arguments...

Michael.
_______________________________________________
fpc-pascal maillist  -  [hidden email]
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal
Reply | Threaded
Open this post in threaded view
|

Re: FPC 3 regression: cannot use TStringList for UTF-8 data any more?

Graeme Geldenhuys-6
On 2016-04-04 12:06, Michael Van Canneyt wrote:
> 1. Using UTF8 is a choice of lazarus. Other people may prefer UnicodeString.
>     On Windows, UnicodeString is more 'natural' or 'native'.

Based on Internet standards and most popular OSes (mobile devices
included), UTF-8 is kind - so we all know Windows backed the wrong horse
[encoding]. ;-)

   [...Graeme runs and hides...]



> 2. The release cycle of FPC is rather long, so updates will be available not
>     as fast as the lazarus team needs them.

That's a valid point.

Though it could probably be added as quick as in FPC 3.0.2. It's simply
two new units that need to be explicitly used by somebody to have any
affect, so it will not break existing code otherwise [if not used].


Regards,
  - Graeme -

_______________________________________________
fpc-pascal maillist  -  [hidden email]
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal
Reply | Threaded
Open this post in threaded view
|

Re: FPC 3 regression: cannot use TStringList for UTF-8 data any more?

Michael Van Canneyt


On Mon, 4 Apr 2016, Graeme Geldenhuys wrote:

> On 2016-04-04 12:06, Michael Van Canneyt wrote:
>> 1. Using UTF8 is a choice of lazarus. Other people may prefer UnicodeString.
>>     On Windows, UnicodeString is more 'natural' or 'native'.
>
> Based on Internet standards and most popular OSes (mobile devices
> included), UTF-8 is kind - so we all know Windows backed the wrong horse
> [encoding]. ;-)
>
>   [...Graeme runs and hides...]
>

Well, in 2016, I still only use UTF-8, even on windows.
It works without problems if you know what you're doing.


>> 2. The release cycle of FPC is rather long, so updates will be available not
>>     as fast as the lazarus team needs them.
>
> That's a valid point.
>
> Though it could probably be added as quick as in FPC 3.0.2. It's simply
> two new units that need to be explicitly used by somebody to have any
> affect, so it will not break existing code otherwise [if not used].

They should at least be renamed, to avoid confusion.

Other than that, I personally see no objections.

Michael.
_______________________________________________
fpc-pascal maillist  -  [hidden email]
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal
Reply | Threaded
Open this post in threaded view
|

Re: FPC 3 regression: cannot use TStringList for UTF-8 data any more?

Jonas Maebe-2

Michael Van Canneyt wrote on Mon, 04 Apr 2016:

> On Mon, 4 Apr 2016, Graeme Geldenhuys wrote:
>
[add LCL UTF-8 helper units to FPC]
>> Though it could probably be added as quick as in FPC 3.0.2. It's simply
>> two new units that need to be explicitly used by somebody to have any
>> affect, so it will not break existing code otherwise [if not used].
>
> They should at least be renamed, to avoid confusion.
>
> Other than that, I personally see no objections.

I do: it's more units that we have to maintain, process bug reports  
and feature requests for, etc (or, in case they are supposed to remain  
copies of the Lazarus units, then it's extra work keeping them in sync  
and given the non-synchronised release cycles, they will almost never  
be in sync). We already have plenty of work with our own code.


Jonas
_______________________________________________
fpc-pascal maillist  -  [hidden email]
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal
123