Unicode file routines proposal

classic Classic list List threaded Threaded
98 messages Options
12345
Reply | Threaded
Open this post in threaded view
|

Unicode file routines proposal

Felipe Monteiro de Carvalho
Hello,

There is already another thread about that, but the thread got too
long, and I would like to make a concrete proposal about unicode file
routines.

It looks simple to me, there are just 2 ways to go, either utf-8 or
utf-16. Correct me if I am wrong, but I beliave that FPC developers
prefer utf-16, so we can have a widestring version of every routine in
the RTL which involves filenames.

So let me start with a concrete example:

http://www.freepascal.org/docs-html/rtl/system/assign.html

We would need to add a:

procedure Assign(
  var f: ;
  const Name: widestring
);

Also for all this routines:

http://www.freepascal.org/docs-html/rtl/sysutils/filenameroutines.html

Under Windows it can be implemented like this with Windows 9x support:

procedure AnyFileRoutineInWin32(AFileName: widestring);
begin
 if UnicodeEnabledOS then SomeWin32APIW()
 else AnsiToWideString(SomeWin32ApiA())
end;

One can initialize UnicodeEnabledOS by reading the operating system
version and the operating system type NT/9x very easily.

Under Windows 9x we won't support true unicode filenames, but this
doesn't matter, because the operating system doesn't support them
anyway. The widestring routines will keep working under Windows 9x for
most code.

This method is used with great success in the LCL. Extended information here:

http://wiki.lazarus.freepascal.org/LCL_Unicode_Support#Guidelines

--
Felipe Monteiro de Carvalho
_______________________________________________
fpc-pascal maillist  -  [hidden email]
http://lists.freepascal.org/mailman/listinfo/fpc-pascal
Reply | Threaded
Open this post in threaded view
|

Re: Unicode file routines proposal

Mattias Gaertner
Zitat von Felipe Monteiro de Carvalho <[hidden email]>:

> Hello,
>
> There is already another thread about that, but the thread got too
> long, and I would like to make a concrete proposal about unicode file
> routines.
>
> It looks simple to me, there are just 2 ways to go, either utf-8 or
> utf-16. Correct me if I am wrong, but I beliave that FPC developers
> prefer utf-16, so we can have a widestring version of every routine in
> the RTL which involves filenames.
>
> So let me start with a concrete example:
>
> http://www.freepascal.org/docs-html/rtl/system/assign.html
>
> We would need to add a:
>
> procedure Assign(
>   var f: ;
>   const Name: widestring
> );
>
> Also for all this routines:
>
> http://www.freepascal.org/docs-html/rtl/sysutils/filenameroutines.html
>
> Under Windows it can be implemented like this with Windows 9x support:
>
> procedure AnyFileRoutineInWin32(AFileName: widestring);
> begin
>  if UnicodeEnabledOS then SomeWin32APIW()
>  else AnsiToWideString(SomeWin32ApiA())
> end;

But what about all existing code?
For example the FCL?
How will TStringList.LoadFromFile be converted?


> One can initialize UnicodeEnabledOS by reading the operating system
> version and the operating system type NT/9x very easily.
>
> Under Windows 9x we won't support true unicode filenames, but this
> doesn't matter, because the operating system doesn't support them
> anyway. The widestring routines will keep working under Windows 9x for
> most code.
>
> This method is used with great success in the LCL. Extended information here:
>
> http://wiki.lazarus.freepascal.org/LCL_Unicode_Support#Guidelines

Mattias

_______________________________________________
fpc-pascal maillist  -  [hidden email]
http://lists.freepascal.org/mailman/listinfo/fpc-pascal
Reply | Threaded
Open this post in threaded view
|

Re: Unicode file routines proposal

Marco van de Voort
In reply to this post by Felipe Monteiro de Carvalho
> There is already another thread about that, but the thread got too
> long, and I would like to make a concrete proposal about unicode file
> routines.
>
> It looks simple to me, there are just 2 ways to go, either utf-8 or
> utf-16.

There are more possibilities:
- native encoding (utf-8 on *nix, utf-16 on windows)
- have two types.
- an unified type (type contains encoding)

Even this has not been decided.
_______________________________________________
fpc-pascal maillist  -  [hidden email]
http://lists.freepascal.org/mailman/listinfo/fpc-pascal
Reply | Threaded
Open this post in threaded view
|

Re: Unicode file routines proposal

Vincent Snijders
Marco van de Voort schreef:
>> It looks simple to me, there are just 2 ways to go, either utf-8 or
>> utf-16.
>
> There are more possibilities:
> - native encoding (utf-8 on *nix, utf-16 on windows)
> - have two types.

How can one write portable code with these options?

> - an unified type (type contains encoding)
>

Vincent
_______________________________________________
fpc-pascal maillist  -  [hidden email]
http://lists.freepascal.org/mailman/listinfo/fpc-pascal
Reply | Threaded
Open this post in threaded view
|

Re: Unicode file routines proposal

Graeme Geldenhuys-2
In reply to this post by Felipe Monteiro de Carvalho
2008/6/30 Felipe Monteiro de Carvalho <[hidden email]>:
> It looks simple to me, there are just 2 ways to go, either utf-8 or
> utf-16. Correct me if I am wrong, but I beliave that FPC developers
> prefer utf-16, so we can have a widestring version of every routine in
> the RTL which involves filenames.


I thought UTF-8 was prefered. Hence the reason Lazarus followed the
UTF-8 route in LCL and Unicode support.


Regards,
 - Graeme -


_______________________________________________
fpGUI - a cross-platform Free Pascal GUI toolkit
http://opensoft.homeip.net/fpgui/
_______________________________________________
fpc-pascal maillist  -  [hidden email]
http://lists.freepascal.org/mailman/listinfo/fpc-pascal
Reply | Threaded
Open this post in threaded view
|

Re: Unicode file routines proposal

Felipe Monteiro de Carvalho
In reply to this post by Mattias Gaertner
On Mon, Jun 30, 2008 at 9:31 AM, Mattias Gärtner
<[hidden email]> wrote:
> But what about all existing code?
> For example the FCL?
> How will TStringList.LoadFromFile be converted?

TStringList.LoadFromFile(AFileName: widestring); overload

The ansi version could call the wide version and just do the string conversion.

--
Felipe Monteiro de Carvalho
_______________________________________________
fpc-pascal maillist  -  [hidden email]
http://lists.freepascal.org/mailman/listinfo/fpc-pascal
Reply | Threaded
Open this post in threaded view
|

Re: Unicode file routines proposal

Felipe Monteiro de Carvalho
In reply to this post by Graeme Geldenhuys-2
On Mon, Jun 30, 2008 at 9:55 AM, Graeme Geldenhuys
<[hidden email]> wrote:
> I thought UTF-8 was prefered. Hence the reason Lazarus followed the
> UTF-8 route in LCL and Unicode support.

UTF-8 is much better for the LCL because it just fits much better in
out existing codebase.

For the RTL we would also like to have UTF-8, but in previous
conversations I got the impression that RTL developers prefer UTF-16.

--
Felipe Monteiro de Carvalho
_______________________________________________
fpc-pascal maillist  -  [hidden email]
http://lists.freepascal.org/mailman/listinfo/fpc-pascal
Reply | Threaded
Open this post in threaded view
|

Re: Unicode file routines proposal

Marco van de Voort
In reply to this post by Vincent Snijders
> Marco van de Voort schreef:
> >> It looks simple to me, there are just 2 ways to go, either utf-8 or
> >> utf-16.
> >
> > There are more possibilities:
> > - native encoding (utf-8 on *nix, utf-16 on windows)
> > - have two types.
>
> How can one write portable code with these options?

How can you consider yourself portable by picking one systems encoding, and
emulating it on others?

Note also that reliance on encoding is way less important, since fewer
people will be parsing through strings manually (simply because it is more
difficult)
_______________________________________________
fpc-pascal maillist  -  [hidden email]
http://lists.freepascal.org/mailman/listinfo/fpc-pascal
Reply | Threaded
Open this post in threaded view
|

Re: Unicode file routines proposal

Vincent Snijders
Marco van de Voort schreef:

>> Marco van de Voort schreef:
>>>> It looks simple to me, there are just 2 ways to go, either utf-8 or
>>>> utf-16.
>>> There are more possibilities:
>>> - native encoding (utf-8 on *nix, utf-16 on windows)
>>> - have two types.
>> How can one write portable code with these options?
>
> How can you consider yourself portable by picking one systems encoding, and
> emulating it on others?
>

At the borders of my I convert all strings to the 'internal type' and encoding and
use it like that. Kind of like we are doing nowadays to convert the line-endings in
text files.

I see what you are trying to say, but having a string type that is UTF8 encoded on
one system and UTF16 encoded on another system, doesn't seem easy to work with to
me, even if you name it for example RTLString. Even widestring is an example of bad
portability, because they are refcounted everywhere except on windows.

> Note also that reliance on encoding is way less important, since fewer
> people will be parsing through strings manually (simply because it is more
> difficult)

Right, but they rely on not having to convert it all the time.

ATM, all the client libs above the RTL have chosen one encoding, string type: LCL en
  fpGUI: UTF8, MseGui: widestring

So for those libs to interface with a platform dependent string type in the LCL,
they would have to write platform dependent code. I don't feel much like writing a
LCLSysutils.FileExists, like Graham already has done, to hide these conversions.

Vincent
_______________________________________________
fpc-pascal maillist  -  [hidden email]
http://lists.freepascal.org/mailman/listinfo/fpc-pascal
Reply | Threaded
Open this post in threaded view
|

Re: Unicode file routines proposal

Marco van de Voort
> Marco van de Voort schreef:
> At the borders of my I convert all strings to the 'internal type' and encoding and
> use it like that. Kind of like we are doing nowadays to convert the line-endings in
> text files.

I don't like this. This makes e.g. processing a database export on Unix
unnecessarily costly
 
> I see what you are trying to say, but having a string type that is UTF8
> encoded on one system and UTF16 encoded on another system, doesn't seem
> easy to work with to me, even if you name it for example RTLString.

It should be possible to work in the native encoding. One doesn't want to
wrap _every_ function in _every_ header with conversions procs.

> > Note also that reliance on encoding is way less important, since fewer
> > people will be parsing through strings manually (simply because it is more
> > difficult)
>
> Right, but they rely on not having to convert it all the time.

Well, they will have to do that with one string type too, at every external
barrier.

That also kills the benefit of choosing UTF-16 in the first place, since
Delphi code won't work on Unix without manually inserting a lot of
conversion code.
 
> ATM, all the client libs above the RTL have chosen one encoding, string type: LCL en
>   fpGUI: UTF8, MseGui: widestring

That has nothing to do with these decisions. They chose that in the absence
of a good solution. This is about picking a good solution.
 
> So for those libs to interface with a platform dependent string type in
> the LCL, they would have to write platform dependent code.

You will have to anyway for any solution that only supports one encoding.
_______________________________________________
fpc-pascal maillist  -  [hidden email]
http://lists.freepascal.org/mailman/listinfo/fpc-pascal
Reply | Threaded
Open this post in threaded view
|

Re: Unicode file routines proposal

Felipe Monteiro de Carvalho
On Mon, Jun 30, 2008 at 10:32 AM, Marco van de Voort <[hidden email]> wrote:
> It should be possible to work in the native encoding. One doesn't want to
> wrap _every_ function in _every_ header with conversions procs.

It is not possible to work with a ever changing encoding.

MyLabel.Caption := 'Lição';

How would that ever work with a ever changing encoding? It would not.

If you go to the real implementation level a changing encoding quickly
becomes unmanagable.

And what about the LFM files? In which encoding will they be? What if
you develop a software in one system and tryes to build it in another?

Ok, to go one step further: Has anyone ever seen a fully unicode
system which works with changing encodings? I beliave there exists
none, because this is not a good solution.

> Well, they will have to do that with one string type too, at every external
> barrier.

This is already necessary.

> That also kills the benefit of choosing UTF-16 in the first place, since
> Delphi code won't work on Unix without manually inserting a lot of
> conversion code.

Delphi code can use the ansi routines, which could just call the
utf-16 routines with a string conversion, or you can implement every
routine twice to maximize speed.

--
Felipe Monteiro de Carvalho
_______________________________________________
fpc-pascal maillist  -  [hidden email]
http://lists.freepascal.org/mailman/listinfo/fpc-pascal
Reply | Threaded
Open this post in threaded view
|

Re: Unicode file routines proposal

Marco van de Voort
> On Mon, Jun 30, 2008 at 10:32 AM, Marco van de Voort <[hidden email]> wrote:
> > It should be possible to work in the native encoding. One doesn't want to
> > wrap _every_ function in _every_ header with conversions procs.
>
> It is not possible to work with a ever changing encoding.
>
> MyLabel.Caption := 'Li??o';
>
> How would that ever work with a ever changing encoding? It would not.

Encoding in source is something totally different. This is '\u1232\u2314'
like syntax can be changed to utf8/16 by the compiler. In theory I think,
practice might be else.
 
> If you go to the real implementation level a changing encoding quickly
> becomes unmanagable.

That's why I don't believe the one string type two encoding helps. But if
fileexists is utf-8 on unix and utf-16 on windows, and any utf-16 or UTF-8
string that you pass from Lazarus is auto converted, what is the exact
problem? Everybody can maintain certain subsystems in a certain encoding,
but doesn't force that choice upon others.

> And what about the LFM files? In which encoding will they be?

The one you annotate in it? The loading code can decode both, since both
systems have both ?

> What if you develop a software in one system and tryes to build it in
> another?

What does that mean for the fully UTF-16 system? First you may start with
wrapping all C api's that use utf-8 on Unix.

I understand the simplicity of one encoding is appealing, but you have to
look at all aspects, and that is not just representation in the GUI.

It will mean that _every_ string transactie to the outside will have to be
manually wrapped AND have a performance penalty. That is a heavy price to
pay for not touching a bit of lfm loading code.
 
> Ok, to go one step further: Has anyone ever seen a fully unicode
> system which works with changing encodings? I beliave there exists
> none, because this is not a good solution.

How many systems do you know have datafiles of like .lfm's over system
borders?
 
> > Well, they will have to do that with one string type too, at every
> > external barrier.
>
> This is already necessary.

But if you properly type them, some conversions maybe automatic. Something
you don't have with a single type.
 
> > That also kills the benefit of choosing UTF-16 in the first place, since
> > Delphi code won't work on Unix without manually inserting a lot of
> > conversion code.
>
> Delphi code can use the ansi routines, which could just call the
> utf-16 routines with a string conversion, or you can implement every
> routine twice to maximize speed.

If the unicode code is not compatible with Delphi (UTF-16), there is no
point in using UTf-16 in the first place.
_______________________________________________
fpc-pascal maillist  -  [hidden email]
http://lists.freepascal.org/mailman/listinfo/fpc-pascal
Reply | Threaded
Open this post in threaded view
|

Re: Unicode file routines proposal

John Coppens
In reply to this post by Felipe Monteiro de Carvalho
On Mon, 30 Jun 2008 10:03:18 -0300
"Felipe Monteiro de Carvalho" <[hidden email]> wrote:

> On Mon, Jun 30, 2008 at 9:55 AM, Graeme Geldenhuys
> <[hidden email]> wrote:
> > I thought UTF-8 was prefered. Hence the reason Lazarus followed the
> > UTF-8 route in LCL and Unicode support.
>
> UTF-8 is much better for the LCL because it just fits much better in
> out existing codebase.

This may have been discussed before - but should the encoding not be
dependent on the locale? What would happen if I write a FPC program,
if the internal routines are, eg., UTF-16, and my locale is set to
en_US.UTF8?

Anyway, I have the impression that most of Linux is utf-8 oriented by now.

John
_______________________________________________
fpc-pascal maillist  -  [hidden email]
http://lists.freepascal.org/mailman/listinfo/fpc-pascal
Reply | Threaded
Open this post in threaded view
|

Re: Unicode file routines proposal

Felipe Monteiro de Carvalho
In reply to this post by Marco van de Voort
On Mon, Jun 30, 2008 at 11:35 AM, Marco van de Voort <[hidden email]> wrote:
> I understand the simplicity of one encoding is appealing, but you have to
> look at all aspects, and that is not just representation in the GUI.
>
> It will mean that _every_ string transactie to the outside will have to be
> manually wrapped AND have a performance penalty. That is a heavy price to
> pay for not touching a bit of lfm loading code.

It won't need to be wrapped in platforms which nativelly support the
choosen encoding. UTF-16 is natively supported in Windows and Windows
CE. Not sure on unixes.

Because LCL uses a single encoding this performance difference
disappears as soon as you need to convert the string in LCL.

> How many systems do you know have datafiles of like .lfm's over system
> borders?

Gtk can load XML files, somewhat equivalent to our LFMs. They use
UTF-8 everywhere.

Java is cross-platform and uses UTF-16 everywhere.

wxWidgets uses UTF-16 everywhere.

Let me try to sumarize my oppinion on multiple encodings vs single encoding:

multiple encodings:

* More complex
* Innovative solution, no known example of a implementation of this
system exists = uncertainty if it works at all, or if it is convenient
for developers
* Depends on a not yet implemented string type
* Potentially will have a higher performance then a single encoding
system, but only if you use this new special string type

Single encoding:

* Simple, proved solution
* Does not need any new string type, can start being implemented immediately
* Potentially has a lower performance due to string conversions.

Actually for Lazarus the only advantage I see in the multiple encoding
system does not exist, because we use a single encoding system in some
platforms we will need conversion and in others we won't need, which
just makes things worse for us.

--
Felipe Monteiro de Carvalho
_______________________________________________
fpc-pascal maillist  -  [hidden email]
http://lists.freepascal.org/mailman/listinfo/fpc-pascal
Reply | Threaded
Open this post in threaded view
|

Re: Unicode file routines proposal

Luca Olivetti-2
In reply to this post by John Coppens
En/na John Coppens ha escrit:

> This may have been discussed before - but should the encoding not be
> dependent on the locale? What would happen if I write a FPC program,
> if the internal routines are, eg., UTF-16, and my locale is set to
> en_US.UTF8?
>
> Anyway, I have the impression that most of Linux is utf-8 oriented by now.

Well, yes, but that's the external representation.
I'd say to take a look at how python managed to integrate unicode support:

http://www.google.com/search?domains=www.python.org&sitesearch=www.python.org&sourceid=google-search&q=unicode&submit=search

Bye
--
Luca

_______________________________________________
fpc-pascal maillist  -  [hidden email]
http://lists.freepascal.org/mailman/listinfo/fpc-pascal
Reply | Threaded
Open this post in threaded view
|

Re: Unicode file routines proposal

Martin Schreiber
On Monday 30 June 2008 22.19:49 Luca Olivetti wrote:

> En/na John Coppens ha escrit:
> > This may have been discussed before - but should the encoding not be
> > dependent on the locale? What would happen if I write a FPC program,
> > if the internal routines are, eg., UTF-16, and my locale is set to
> > en_US.UTF8?
> >
> > Anyway, I have the impression that most of Linux is utf-8 oriented by
> > now.
>
> Well, yes, but that's the external representation.
> I'd say to take a look at how python managed to integrate unicode support:
>
> http://www.google.com/search?domains=www.python.org&sitesearch=www.python.o
>rg&sourceid=google-search&q=unicode&submit=search
>
They have a UTF-16/UCS-2 internal representation, same as MSEgui which works
very well and is fast and handy BTW.
What is missing is a reference counted widestring type on Windows. ;-)

Martin
_______________________________________________
fpc-pascal maillist  -  [hidden email]
http://lists.freepascal.org/mailman/listinfo/fpc-pascal
Reply | Threaded
Open this post in threaded view
|

Re: Unicode file routines proposal

Luca Olivetti-2
En/na Martin Schreiber ha escrit:

>> I'd say to take a look at how python managed to integrate unicode support:
>>
>> http://www.google.com/search?domains=www.python.org&sitesearch=www.python.o
>> rg&sourceid=google-search&q=unicode&submit=search
>>
> They have a UTF-16/UCS-2 internal representation, same as MSEgui which works
> very well and is fast and handy BTW.

And len, slicing, etc. work as expected.
Note that if you need characters beyond $ffff you have to compile it
with wide unicode support, and in that case every character will use 4
bytes.

http://www.python.org/dev/peps/pep-0261/

I think the default is still to compile without wide unicode support (at
least python under mandriva, that has no special configure option, is a
narrow python build, while debian etch has a wide one).

Bye
--
Luca


_______________________________________________
fpc-pascal maillist  -  [hidden email]
http://lists.freepascal.org/mailman/listinfo/fpc-pascal
Reply | Threaded
Open this post in threaded view
|

Re: Unicode file routines proposal

Marco van de Voort
In reply to this post by Felipe Monteiro de Carvalho
> On Mon, Jun 30, 2008 at 11:35 AM, Marco van de Voort <[hidden email]> wrote:
> > borders?
>
> Gtk can load XML files, somewhat equivalent to our LFMs. They use
> UTF-8 everywhere.

GTK is unix centric on other systems. They don't have a firm leg in both the
Unix as the Windows world as we do. I can't judge the wxwidgets situation,
since I know nobody that uses it.
 
> Java is cross-platform and uses UTF-16 everywhere.

Java has to emulate everything (read: put up a barrier) from the outside
anyway, and not doing that is one of our fortes.

> multiple encodings:
>
> * More complex
> * Innovative solution, no known example of a implementation of this
> system exists = uncertainty if it works at all, or if it is convenient
> for developers
> * Depends on a not yet implemented string type

Needs to be done anyway, since widestring on windows is COM, and that must
be also retained. So it is about adding 1 vs 2, and the work will be huge,
with UTF-16 too, and to make it worthwhile the best, not the quikest
solution should be sought.

> * Potentially will have a higher performance then a single encoding
> system, but only if you use this new special string type

Certainly. Can you imagine loading a non trivial file in a tstringlist and
saving it again and the heaps of conversions?

Moreover, there is an important reason missing:

* Being able to declare the outside world in the right encoding, without
  manually inserting conversions in each header.

* Does not make one of the two core platforms (Unix/windows) effectively
  second rate.

* Can be done phased, IOW in the beginning lots of conversion, but later
  have more and more routines in the right encoding ready.

> Single encoding:
>
> * Simple, proved solution

Simple solution, complex implementation (needs conversions anywhere).

> * Does not need any new string type, can start being implemented immediately

It does. And you can start making UTF-16 routines anyway

> * Potentially has a lower performance due to string conversions.


_______________________________________________
fpc-pascal maillist  -  [hidden email]
http://lists.freepascal.org/mailman/listinfo/fpc-pascal
Reply | Threaded
Open this post in threaded view
|

Re: Unicode file routines proposal

Marco van de Voort
In reply to this post by Luca Olivetti-2
> > They have a UTF-16/UCS-2 internal representation, same as MSEgui which works
> > very well and is fast and handy BTW.
>
> And len, slicing, etc. work as expected.
> Note that if you need characters beyond $ffff you have to compile it
> with wide unicode support, and in that case every character will use 4
> bytes.
>
That's IMHO a faulty system. It requires you to choose between an incomplete
solution or making strings a horrible memory hog. But maybe that doesn't
matter for mere scripting languages (though I wonder then why they didn't
chose UTF-32 directly)

Surrogates are not nice, but they were invented for a reason.

_______________________________________________
fpc-pascal maillist  -  [hidden email]
http://lists.freepascal.org/mailman/listinfo/fpc-pascal
Reply | Threaded
Open this post in threaded view
|

Re: Unicode file routines proposal

Luca Olivetti-2
En/na Marco van de Voort ha escrit:
>>> They have a UTF-16/UCS-2 internal representation, same as MSEgui which works
>>> very well and is fast and handy BTW.
>> And len, slicing, etc. work as expected.
>> Note that if you need characters beyond $ffff you have to compile it
>> with wide unicode support, and in that case every character will use 4
>> bytes.
>>
> That's IMHO a faulty system. It requires you to choose between an incomplete
> solution or making strings a horrible memory hog.

OTOH using variable length characters will make string operations
expensive (since you can't just multiply the index by 2 or 4 but you
have to examine the string from the beginning, and the length in bytes
isn't the same as the length in characters).

> But maybe that doesn't
> matter for mere scripting languages (though I wonder then why they didn't
> chose UTF-32 directly)
>
> Surrogates are not nice, but they were invented for a reason.

Well, yes, they're a trade-off between performance and memory
consumption, but I fear we're losing one of the advantages that pascal
has over C: fast and simple string handling.

Bye
--
Luca
_______________________________________________
fpc-pascal maillist  -  [hidden email]
http://lists.freepascal.org/mailman/listinfo/fpc-pascal
12345