Unicode file routines proposal

classic Classic list List threaded Threaded
98 messages Options
12345
Reply | Threaded
Open this post in threaded view
|

Re: Summary on Re: Unicode file routines proposal

Marco van de Voort
[ Charset ISO-8859-1 unsupported, converting... ]

> On Tue, Jul 1, 2008 at 9:30 AM, Marco van de Voort <[hidden email]> wrote:
> > My is having both an UTF8string and a UTF16string, on all platforms that support
> > unicode. So I don't get this remark.
>
> Unless I understood your proposal wrong it involves a TMarcoString
> which will be declared like this:
>
> {$ifdef Linux}
>   If SystemEncoding = utf-8 then TMarcoString = Utf8String
>   else TMarcoString = ansistring;
> {$endif}
> {$ifdef Windows}
>   If WindowsNT then TMarcoString = utf16string
>   else TMarcoString = ansistring;
> {$endif}
> {$ifdef Darwin}
>   TMarcoString = utf8string;
> {$endif}
>
> Just how do you implement a string routine with TMarcoString? Fill it
> with ifdefs?

No. Just utf8string and utf16string, with tutf16string aliased to the
identifier that Tiburon nems it.

But on Linux the RTL is mostly utf-8, and on Delphi the RTL is mostly
utf-16. And if you pass the utf-16 filename that you got from the Lazarus
.dfm (that is apparantly already set to remain utf-16) to a filename routine
a converrsion will automatically happen.

People that make a FPC distro can decide if their Linux version contains a
full complement of overloaded utf16 routines or not. If not more conversions
will happen if you have pure utf-16 code, but that can be worthwhile for
embedded/minimalist distributions if those ever emerge.

And only the few special cases with var are a problem, as you correctly
pointed out, and a these few cases can be fixed for the RTL by overloading
them aslo for utf16.

> > It is just that on unix, the fileroutines will be defined as utf8string
> So you are going to convert in non utf8 unix?

Maybe I should have said "in the native encoding" then. So if the it's a
utf-16 unix it will be utf-16.  In principle at least. We will have to see
how this fares with the shared character of the unix rtl.

Note that both is also possible, e.g for  most used string routines (like
extractfilename etc) they can be simply overloaded to support both.

_______________________________________________
fpc-pascal maillist  -  [hidden email]
http://lists.freepascal.org/mailman/listinfo/fpc-pascal
Reply | Threaded
Open this post in threaded view
|

Re: Summary on Re: Unicode file routines proposal

Felipe Monteiro de Carvalho
In reply to this post by Florian Klaempfl
On Tue, Jul 1, 2008 at 9:56 AM, Florian Klaempfl <[hidden email]> wrote:
> Because using utf-16 on linux is very unnatural, same for utf-8 on
> windows. Platforms like go32 even don't have any unicode. Coding
> platform independent but fast applications is really ugly having fixed
> types.

Well, then you mean that it requires conversion in some platforms
rather then it not being cross-platform.

What I am trying to say is that the new proposed systems will be
harder to use, trying to please everyone everywhere with a perceived
performance gain without any indication that this gain will actually
be significant in real world applications. It uses an exotic solution,
never tested before.

The speed difference in LCL-Qt apps and LCL-Gtk apps is negletible,
althougth we do string conversions when using Qt. Because the
manipulation of strings is usually not a bottleneck.

And the ansi routines will not be removed, they will be kept for those
really interrested in speed.

I for one prefer simplicity and easy of use to speed.

--
Felipe Monteiro de Carvalho
_______________________________________________
fpc-pascal maillist  -  [hidden email]
http://lists.freepascal.org/mailman/listinfo/fpc-pascal
Reply | Threaded
Open this post in threaded view
|

Re: Summary on Re: Unicode file routines proposal

Felipe Monteiro de Carvalho
In reply to this post by Marco van de Voort
>> > It is just that on unix, the fileroutines will be defined as utf8string
>> So you are going to convert in non utf8 unix?
>
> Maybe I should have said "in the native encoding" then. So if the it's a
> utf-16 unix it will be utf-16.  In principle at least. We will have to see
> how this fares with the shared character of the unix rtl.

I mean Unixes with iso encoding and not utf-16

--
Felipe Monteiro de Carvalho
_______________________________________________
fpc-pascal maillist  -  [hidden email]
http://lists.freepascal.org/mailman/listinfo/fpc-pascal
Reply | Threaded
Open this post in threaded view
|

Re: Summary on Re: Unicode file routines proposal

Marco van de Voort
In reply to this post by Felipe Monteiro de Carvalho
> On Tue, Jul 1, 2008 at 9:56 AM, Florian Klaempfl <[hidden email]> wrote:
> > platform independent but fast applications is really ugly having fixed
> > types.
>
> Well, then you mean that it requires conversion in some platforms
> rather then it not being cross-platform.
>
> What I am trying to say is that the new proposed systems will be
> harder to use, trying to please everyone everywhere with a perceived
> performance gain without any indication that this gain will actually
> be significant in real world applications. It uses an exotic solution,
> never tested before.

C/C++ support the native encoding on all platforms.
 
> The speed difference in LCL-Qt apps and LCL-Gtk apps is negletible,
> althougth we do string conversions when using Qt. Because the
> manipulation of strings is usually not a bottleneck.

That's because it doesn't do that much string processing, compared to
e.g. iterating through a db-export and transforming it. That should be the
norm for a native unicode type, not an UI.
 
_______________________________________________
fpc-pascal maillist  -  [hidden email]
http://lists.freepascal.org/mailman/listinfo/fpc-pascal
Reply | Threaded
Open this post in threaded view
|

Re: Summary on Re: Unicode file routines proposal

Michael Van Canneyt
In reply to this post by Felipe Monteiro de Carvalho


On Tue, 1 Jul 2008, Felipe Monteiro de Carvalho wrote:

> On Tue, Jul 1, 2008 at 9:56 AM, Florian Klaempfl <[hidden email]> wrote:
> > Because using utf-16 on linux is very unnatural, same for utf-8 on
> > windows. Platforms like go32 even don't have any unicode. Coding
> > platform independent but fast applications is really ugly having fixed
> > types.
>
> Well, then you mean that it requires conversion in some platforms
> rather then it not being cross-platform.
>
> What I am trying to say is that the new proposed systems will be
> harder to use, trying to please everyone everywhere with a perceived
> performance gain without any indication that this gain will actually
> be significant in real world applications. It uses an exotic solution,
> never tested before.
>
> The speed difference in LCL-Qt apps and LCL-Gtk apps is negletible,
> althougth we do string conversions when using Qt. Because the
> manipulation of strings is usually not a bottleneck.
>
> And the ansi routines will not be removed, they will be kept for those
> really interrested in speed.
>
> I for one prefer simplicity and easy of use to speed.

I don't see what is difficult about Florians proposition.
On the contrary, it is the simplest possible solution,
and quite elegant in my eyes.

For the LCL/fpGUI/MSEGui programmers, nothing changes,
you can even throw away your own conversion routines.
You need only a single call just prior to passing a string
to the OS/GUI system: ForceEncoding(). No ifdefs needed,
all is transparant.

Michael.
_______________________________________________
fpc-pascal maillist  -  [hidden email]
http://lists.freepascal.org/mailman/listinfo/fpc-pascal
Reply | Threaded
Open this post in threaded view
|

Re: Summary on Re: Unicode file routines proposal

Marco van de Voort
In reply to this post by Felipe Monteiro de Carvalho
> >> > It is just that on unix, the fileroutines will be defined as utf8string
> >> So you are going to convert in non utf8 unix?
> >
> > Maybe I should have said "in the native encoding" then. So if the it's a
> > utf-16 unix it will be utf-16.  In principle at least. We will have to see
> > how this fares with the shared character of the unix rtl.
>
> I mean Unixes with iso encoding and not utf-16

In my opinion: no automated support. Simply since the target "encoding"
can't represent all characters. In theory one could throw an exception, but
that would only require to guard all string routines in even more exception
handling, moreover there is not really an alternate path to take then.

So all ansistring to unicode handling must be done by proper conversions
procedures, manually.

And I assume long term, these will die out anyway
_______________________________________________
fpc-pascal maillist  -  [hidden email]
http://lists.freepascal.org/mailman/listinfo/fpc-pascal
Reply | Threaded
Open this post in threaded view
|

Re: Summary on Re: Unicode file routines proposal

Marco van de Voort
In reply to this post by Michael Van Canneyt
> On Tue, 1 Jul 2008, Felipe Monteiro de Carvalho wrote:
> I don't see what is difficult about Florians proposition.
> On the contrary, it is the simplest possible solution,
> and quite elegant in my eyes.

To be honest, I flabbergasted that the two of you agreed on such a runtime
construct. It goes IMHO against Pascal principles.

> For the LCL/fpGUI/MSEGui programmers, nothing changes,
> you can even throw away your own conversion routines.
> You need only a single call just prior to passing a string
> to the OS/GUI system: ForceEncoding(). No ifdefs needed,
> all is transparant.

That's one of the problems. Having to check and insert code for a Tiburon
solution. (that simply will expect UTF-16). The least it should do is have a
way to flag a routine (e.g. by directive/ Tiburon mode) to only accept UTF16
and insert that call itself.
_______________________________________________
fpc-pascal maillist  -  [hidden email]
http://lists.freepascal.org/mailman/listinfo/fpc-pascal
Reply | Threaded
Open this post in threaded view
|

Re: Summary on Re: Unicode file routines proposal

Felipe Monteiro de Carvalho
In reply to this post by Marco van de Voort
On Tue, Jul 1, 2008 at 10:28 AM, Marco van de Voort <[hidden email]> wrote:
> C/C++ support the native encoding on all platforms.

I did some googling and they don't support unicode filenames. So we
are back to zero systems using this method again =)

http://www.google.com/search?q=C%2B%2B+unicode+filename&ie=utf-8&oe=utf-8&aq=t&rls=org.mozilla:pt-PT:official&client=firefox-a

I think that we should keep in mind that we already have a fast ansi
set of routines which fits the operating system encoding (but is never
utf-16).

What we wish to have is a unicode set of routines, the ansi routines
will keep existing if you really need speed.

--
Felipe Monteiro de Carvalho
_______________________________________________
fpc-pascal maillist  -  [hidden email]
http://lists.freepascal.org/mailman/listinfo/fpc-pascal
Reply | Threaded
Open this post in threaded view
|

Re: Summary on Re: Unicode file routines proposal

Felipe Monteiro de Carvalho
On Tue, Jul 1, 2008 at 10:50 AM, Felipe Monteiro de Carvalho
> I did some googling and they don't support unicode filenames. So we
> are back to zero systems using this method again =)

Actually I think that Carbon uses a system very similar to the one
proposed by Florian. The string is an opaque type, which can be in any
encoding. All string routines uses this opaque type and if you wish to
get the string contents in a certain encoding you use a routine for
that.

But Carbon is not cross-platform.

A very similar system is used in Cocoa, but the string is a class there.

--
Felipe Monteiro de Carvalho
_______________________________________________
fpc-pascal maillist  -  [hidden email]
http://lists.freepascal.org/mailman/listinfo/fpc-pascal
Reply | Threaded
Open this post in threaded view
|

Re: Summary on Re: Unicode file routines proposal

Michael Van Canneyt
In reply to this post by Marco van de Voort


On Tue, 1 Jul 2008, Marco van de Voort wrote:

> > On Tue, 1 Jul 2008, Felipe Monteiro de Carvalho wrote:
> > I don't see what is difficult about Florians proposition.
> > On the contrary, it is the simplest possible solution,
> > and quite elegant in my eyes.
>
> To be honest, I flabbergasted that the two of you agreed on such a runtime
> construct. It goes IMHO against Pascal principles.

Why ?
In your opinion, we must get rid of Array of Const, Variants as well,
as well as RTTI ? They all serve the same purpose.

No-one will be forced to use the new type, so...
 

> > For the LCL/fpGUI/MSEGui programmers, nothing changes,
> > you can even throw away your own conversion routines.
> > You need only a single call just prior to passing a string
> > to the OS/GUI system: ForceEncoding(). No ifdefs needed,
> > all is transparant.
>
> That's one of the problems. Having to check and insert code for a Tiburon
> solution. (that simply will expect UTF-16). The least it should do is have a
> way to flag a routine (e.g. by directive/ Tiburon mode) to only accept UTF16
> and insert that call itself.

Let's first wait to see what Codegear comes up with, and then worry about
compatibility. In my opinion we'll have to write exactly 0 lines of code
for this compatibility, with the solution of Florian.

Michael.
_______________________________________________
fpc-pascal maillist  -  [hidden email]
http://lists.freepascal.org/mailman/listinfo/fpc-pascal
Reply | Threaded
Open this post in threaded view
|

Re: Summary on Re: Unicode file routines proposal

Jeff Wormsley
In reply to this post by Marco van de Voort

Marco van de Voort wrote:

> I don't understand how this can work, how can I have a compiletime solution
> for a runtime problem?
>
> procedure mystringproc (s:FlorianUnicodeString);
>
> begin
>   if encodingof(s)=utf-16 then
>     begin
>       // utf-16 code here with shiftsize 2 [] needed
>     end
>   else
>     begin
>       // utf-8 code here with shiftsize 1 [] needed
>     end;
> end;
>  
If compiler magic is at work, wouldn't all this reduce to s[1] giving
the first char no matter the char size?  If you do something like c :=
s[1] and c is defined as char, it gets converted to a standard 0-255
value, but c could be defined as FlorianChar and be the native char
size.  Or am I smoking crack?

Jeff.

--
I haven't smoked for 1 year, 10 months and 2 weeks, saving $3,080.02 and
not smoking 20,533.47 cigarettes.
_______________________________________________
fpc-pascal maillist  -  [hidden email]
http://lists.freepascal.org/mailman/listinfo/fpc-pascal
Reply | Threaded
Open this post in threaded view
|

Re: Summary on Re: Unicode file routines proposal

Michael Van Canneyt


On Tue, 1 Jul 2008, Jeff Wormsley wrote:

>
> Marco van de Voort wrote:
> > I don't understand how this can work, how can I have a compiletime solution
> > for a runtime problem?
> >
> > procedure mystringproc (s:FlorianUnicodeString);
> >
> > begin
> >   if encodingof(s)=utf-16 then
> >     begin
> >       // utf-16 code here with shiftsize 2 [] needed
> >     end
> >   else
> >     begin
> >       // utf-8 code here with shiftsize 1 [] needed
> >     end;
> > end;
> >  
> If compiler magic is at work, wouldn't all this reduce to s[1] giving the
> first char no matter the char size?  If you do something like c := s[1] and c
> is defined as char, it gets converted to a standard 0-255 value, but c could
> be defined as FlorianChar and be the native char size.  Or am I smoking crack?

No, you understand it correct.

Obviously, with Florian's type, simple low-level access is out of the question.
That's the price you pay.

Michael.
_______________________________________________
fpc-pascal maillist  -  [hidden email]
http://lists.freepascal.org/mailman/listinfo/fpc-pascal
Reply | Threaded
Open this post in threaded view
|

Re: Summary on Re: Unicode file routines proposal

Marco van de Voort
In reply to this post by Michael Van Canneyt
> On Tue, 1 Jul 2008, Marco van de Voort wrote:
>
> > > On Tue, 1 Jul 2008, Felipe Monteiro de Carvalho wrote:
> > > I don't see what is difficult about Florians proposition.
> > > On the contrary, it is the simplest possible solution,
> > > and quite elegant in my eyes.
> >
> > To be honest, I flabbergasted that the two of you agreed on such a runtime
> > construct. It goes IMHO against Pascal principles.
>
> Why ?
> In your opinion, we must get rid of Array of Const, Variants as well,
> as well as RTTI ? They all serve the same purpose.

Those are explicitely meant as layer over existing compiletime systems to
handle exceptions where that is not possible.  (respectively normal
parameter arrays, normal typed vars and pointer based method execution)

Here we are talking about adding runtime construct without typed
alternative.

> No-one will be forced to use the new type, so...

No one is forced to use FPC, but that is also an open door. We both know
that this is a pretty crucial decision, since the only alternative is
handcoding using pointers.

> > That's one of the problems. Having to check and insert code for a Tiburon
> > solution. (that simply will expect UTF-16). The least it should do is have a
> > way to flag a routine (e.g. by directive/ Tiburon mode) to only accept UTF16
> > and insert that call itself.
>
> Let's first wait to see what Codegear comes up with, and then worry about
> compatibility.

Codegear has an UTF16 type, for .NET compability. See also
http://blogs.codegear.com/abauer/2008/01/28/38853

> In my opinion we'll have to write exactly 0 lines of code
> for this compatibility, with the solution of Florian.

Why don't you point out what I misunderstood above? If I only have a UTF16
routine with a parameter of type "unicodestring" (like T. has), how do I
make sure that it only gets passed UTF-16 system without code changes?
_______________________________________________
fpc-pascal maillist  -  [hidden email]
http://lists.freepascal.org/mailman/listinfo/fpc-pascal
Reply | Threaded
Open this post in threaded view
|

Re: Summary on Re: Unicode file routines proposal

Marco van de Voort
In reply to this post by Jeff Wormsley
> Marco van de Voort wrote:
> >  
> If compiler magic is at work, wouldn't all this reduce to s[1] giving
> the first char no matter the char size?

Where does the "magic" gets its information is my point.

_______________________________________________
fpc-pascal maillist  -  [hidden email]
http://lists.freepascal.org/mailman/listinfo/fpc-pascal
Reply | Threaded
Open this post in threaded view
|

Re: Summary on Re: Unicode file routines proposal

Marco van de Voort
In reply to this post by Michael Van Canneyt
> On Tue, 1 Jul 2008, Jeff Wormsley wrote:
> > is defined as char, it gets converted to a standard 0-255 value, but c could
> > be defined as FlorianChar and be the native char size.  Or am I smoking crack?
>
> No, you understand it correct.
>
> Obviously, with Florian's type, simple low-level access is out of the question.
> That's the price you pay.

Huh, since when is [] lowlevel access? It is a normal string operation. And
Tiburon will keep supporting it.
_______________________________________________
fpc-pascal maillist  -  [hidden email]
http://lists.freepascal.org/mailman/listinfo/fpc-pascal
Reply | Threaded
Open this post in threaded view
|

Re: Summary on Re: Unicode file routines proposal

Michael Van Canneyt


On Tue, 1 Jul 2008, Marco van de Voort wrote:

> > On Tue, 1 Jul 2008, Jeff Wormsley wrote:
> > > is defined as char, it gets converted to a standard 0-255 value, but c could
> > > be defined as FlorianChar and be the native char size.  Or am I smoking crack?
> >
> > No, you understand it correct.
> >
> > Obviously, with Florian's type, simple low-level access is out of the question.
> > That's the price you pay.
>
> Huh, since when is [] lowlevel access? It is a normal string operation. And
> Tiburon will keep supporting it.

You can still do C:=S[i]. What you cannot do is

  P:=PChar(S);
  While (P^<>#0) do
   SomeByteSizedOperation;

if the string type has some multibyte values, obviously the code for
  C:=S[i];
will be rather cumbersome. But it will guarantee you the I-th character
from the string. Since C will be FlorianChar, it'll be at least an integer
(2 bytes encoding info, 2 bytes actual value)

Michael.
_______________________________________________
fpc-pascal maillist  -  [hidden email]
http://lists.freepascal.org/mailman/listinfo/fpc-pascal
Reply | Threaded
Open this post in threaded view
|

Re: Summary on Re: Unicode file routines proposal

Florian Klaempfl
In reply to this post by Marco van de Voort
Marco van de Voort wrote:
>> Marco van de Voort wrote:
>>>  
>> If compiler magic is at work, wouldn't all this reduce to s[1] giving
>> the first char no matter the char size?
>
> Where does the "magic" gets its information is my point.

I described this already in detail in my first mail: just in one of the
four bytes available for storing the encoding.

_______________________________________________
fpc-pascal maillist  -  [hidden email]
http://lists.freepascal.org/mailman/listinfo/fpc-pascal
Reply | Threaded
Open this post in threaded view
|

Re: Summary on Re: Unicode file routines proposal

Paul Ishenin-2
In reply to this post by Michael Van Canneyt
Michael Van Canneyt wrote:
> You can still do C:=S[i]. What you cannot do is
>
>   P:=PChar(S);
>   While (P^<>#0) do
>    SomeByteSizedOperation;
>  
Why you cannot? PChar(S) should represent S as raw bytes. If you know
what you are doing - it will not harm. In other case, if you corrupt the
string then you are responsibile for all problems you get.

Best regards,
Paul Ishenin.
_______________________________________________
fpc-pascal maillist  -  [hidden email]
http://lists.freepascal.org/mailman/listinfo/fpc-pascal
Reply | Threaded
Open this post in threaded view
|

Re: Summary on Re: Unicode file routines proposal

Michael Van Canneyt


On Tue, 1 Jul 2008, Paul Ishenin wrote:

> Michael Van Canneyt wrote:
> > You can still do C:=S[i]. What you cannot do is
> >
> >   P:=PChar(S);
> >   While (P^<>#0) do
> >    SomeByteSizedOperation;
> >  
> Why you cannot? PChar(S) should represent S as raw bytes. If you know what you
> are doing - it will not harm. In other case, if you corrupt the string then
> you are responsibile for all problems you get.

Obviously you can :-)
But what I meant was that you shouldn't expect old code
that relied on 1-byte characters to work.

Michael.
_______________________________________________
fpc-pascal maillist  -  [hidden email]
http://lists.freepascal.org/mailman/listinfo/fpc-pascal
Reply | Threaded
Open this post in threaded view
|

Re: Summary on Re: Unicode file routines proposal

Florian Klaempfl
Michael Van Canneyt wrote:

>
> On Tue, 1 Jul 2008, Paul Ishenin wrote:
>
>> Michael Van Canneyt wrote:
>>> You can still do C:=S[i]. What you cannot do is
>>>
>>>   P:=PChar(S);
>>>   While (P^<>#0) do
>>>    SomeByteSizedOperation;
>>>  
>> Why you cannot? PChar(S) should represent S as raw bytes. If you know what you
>> are doing - it will not harm. In other case, if you corrupt the string then
>> you are responsibile for all problems you get.
>
> Obviously you can :-)
> But what I meant was that you shouldn't expect old code
> that relied on 1-byte characters to work.

It is supposed to break on utf-xx or whatever anyways.
_______________________________________________
fpc-pascal maillist  -  [hidden email]
http://lists.freepascal.org/mailman/listinfo/fpc-pascal
12345