Unicode file routines proposal

classic Classic list List threaded Threaded
98 messages Options
12345
Reply | Threaded
Open this post in threaded view
|

Re: Unicode file routines proposal

Felipe Monteiro de Carvalho
On Tue, Jul 1, 2008 at 9:02 AM, Marco van de Voort <[hidden email]> wrote:
> A solution for unicode should be for everything, not just for UIs and
> filenames. I should be able to carry data within it also, because otherwise
> we are having this dicussion next week again if Joost needs unicode for DB
> related issues etc.

Ok, but how do you know that everyone wants to store data in the
"system" encoding?

What if I want to store data using ansistring in Windows because my
file is UTF-8?

In my system I propose that simply a TWideStringList be implemented,
so both ways of storing data are available everwhere.

> How? I can't express the foreign encoding because I have no type for it. I
> only have ansistring that can mean pretty much everything, and that
> constitutes no compiletime safety.

ansistrings don't mean everything. They mean either ISO or utf-8. They
can never hold a utf-16 string (or at least there are no routines to
cover this case).

>> I bet you would convert automatically from whatever to ansi when going
>> to a ansistring, but Lazarus uses utf-8 in ansistrings.
>
> But that is lazarus specific.

Lazarus is by far the largest project using Free Pascal?

> Because the decision to put utf-8 in ansistrings is too fundamentally flawed
> to implement such a thing, since it perfectly legal if an ansistring does
> not contain utf8

We concluded that utf-8 in ansistrings is a very convenient solution
for us which works very well today. It provided a smooth migration
path and keeps the vast majority of code working.

We may some day migrate to a possible utf8string type when it gets implemented.

--
Felipe Monteiro de Carvalho
_______________________________________________
fpc-pascal maillist  -  [hidden email]
http://lists.freepascal.org/mailman/listinfo/fpc-pascal
Reply | Threaded
Open this post in threaded view
|

Re: Summary on Re: Unicode file routines proposal

Michael Van Canneyt
In reply to this post by Florian Klaempfl


On Tue, 1 Jul 2008, Florian Klaempfl wrote:

> Marco van de Voort wrote:
> >> On Tue, 1 Jul 2008, Florian Klaempfl wrote:
> >>
> >>> I read most of the discussion and I think there is no way around a
> >>> string type containing an encoding field.
> >> [cut]
> >>
> >>> I know this approach contains some hacks and requires some work but I
> >>> think this is the only way to solve things for once and ever.
> >> I think it is the most promising and extensible proposal,
> >> so I'm all for it.
> >
> > I read it shortly, and I still don't like it. I need more time to prepare a
> > reponse though.
>
> Keep in mind in your response, that we want also handle other formats
> than utf-8 or utf-16 if needed :)

I think that if you put the encoding field at a negative offset, as length
for ansistrings, that this code should be relatively compatible to current
code if you assume that encoding=0 (or whatever tag value) means ansistring:
You just have an extra field; you could even make that 2 fields: in addition
to byte length, add character length: it should keep operations fast, as most
conversions and other operations will end up with a character length of some
kind anyway.

Michael.
_______________________________________________
fpc-pascal maillist  -  [hidden email]
http://lists.freepascal.org/mailman/listinfo/fpc-pascal
Reply | Threaded
Open this post in threaded view
|

Re: Unicode file routines proposal

Florian Klaempfl
In reply to this post by Felipe Monteiro de Carvalho
Felipe Monteiro de Carvalho wrote:
>
> ansistrings don't mean everything. They mean either ISO or utf-8.

This assumption is wrong. ansistring means the system encoding which
uses 8 bit chars.
_______________________________________________
fpc-pascal maillist  -  [hidden email]
http://lists.freepascal.org/mailman/listinfo/fpc-pascal
Reply | Threaded
Open this post in threaded view
|

Re: Unicode file routines proposal

Vincent Snijders
Florian Klaempfl schreef:
> Felipe Monteiro de Carvalho wrote:
>> ansistrings don't mean everything. They mean either ISO or utf-8.
>
> This assumption is wrong. ansistring means the system encoding which
> uses 8 bit chars.

Even if the system encoding is UTF8?

Vincent
_______________________________________________
fpc-pascal maillist  -  [hidden email]
http://lists.freepascal.org/mailman/listinfo/fpc-pascal
Reply | Threaded
Open this post in threaded view
|

Re: Unicode file routines proposal

Graeme Geldenhuys-2
In reply to this post by Felipe Monteiro de Carvalho
2008/7/1 Felipe Monteiro de Carvalho <[hidden email]>:
>
> In my system I propose that simply a TWideStringList be implemented,
> so both ways of storing data are available everwhere.

I have a TWideStringList implementation if you are interrested. I got
the code somewhere and kept it for a rainy day.

>> But that is lazarus specific.
>
> Lazarus is by far the largest project using Free Pascal?

If I dare comment - that's a bold statement to make. ;-) I know a few
more "large" projects using Free Pascal.  Off the top of my head
Pixel32 being one.


> We may some day migrate to a possible utf8string type when it gets implemented.

In which case I suggest Lazarus start using a custom/alias string type
to ease mirgation. Something like was done with TTranslateString.
Maybe TLCLString = String


Regards,
 - Graeme -


_______________________________________________
fpGUI - a cross-platform Free Pascal GUI toolkit
http://opensoft.homeip.net/fpgui/
_______________________________________________
fpc-pascal maillist  -  [hidden email]
http://lists.freepascal.org/mailman/listinfo/fpc-pascal
Reply | Threaded
Open this post in threaded view
|

Re: Unicode file routines proposal

Marco van de Voort
In reply to this post by Felipe Monteiro de Carvalho
> On Tue, Jul 1, 2008 at 9:02 AM, Marco van de Voort <[hidden email]> wrote:
> > A solution for unicode should be for everything, not just for UIs and
> > filenames. I should be able to carry data within it also, because otherwise
> > we are having this dicussion next week again if Joost needs unicode for DB
> > related issues etc.
>
> Ok, but how do you know that everyone wants to store data in the
> "system" encoding?

Well, euh, the main reason is that euh, most programs and data on the system uses
the system encoding?

> What if I want to store data using ansistring in Windows because my
> file is UTF-8?

Then I'd say you convert. But that is the point. The need for conversion should be
the exception (different from the default system encoding), not the rule.
 
> In my system I propose that simply a TWideStringList be implemented,
> so both ways of storing data are available everwhere.

But I don't have an utf-8 type in your system to operate on.

> > How? I can't express the foreign encoding because I have no type for it. I
> > only have ansistring that can mean pretty much everything, and that
> > constitutes no compiletime safety.
>
> ansistrings don't mean everything. They mean either ISO or utf-8.

Yes. Which is why there is a need for a separate UTF-8 type as well as the
UTF-16 type. So that the compiler knows for sure something is UTF-8, and can
insert conversions. And can error/hint/warn you to insert manual conversions if you
assign an unicode type (either) to an ansistring.

> >> I bet you would convert automatically from whatever to ansi when going
> >> to a ansistring, but Lazarus uses utf-8 in ansistrings.
> >
> > But that is lazarus specific.
>
> Lazarus is by far the largest project using Free Pascal?

FPC itself?

Anyway that doesn't matter. A solution for FPC must be carried broadly, not
just by lazarus.
 
> > Because the decision to put utf-8 in ansistrings is too fundamentally flawed
> > to implement such a thing, since it perfectly legal if an ansistring does
> > not contain utf8
>
> We concluded that utf-8 in ansistrings is a very convenient solution
> for us which works very well today. It provided a smooth migration
> path and keeps the vast majority of code working.

Because you had no choice.
 
_______________________________________________
fpc-pascal maillist  -  [hidden email]
http://lists.freepascal.org/mailman/listinfo/fpc-pascal
Reply | Threaded
Open this post in threaded view
|

Re: Unicode file routines proposal

Florian Klaempfl
In reply to this post by Vincent Snijders
Vincent Snijders wrote:
> Florian Klaempfl schreef:
>> Felipe Monteiro de Carvalho wrote:
>>> ansistrings don't mean everything. They mean either ISO or utf-8.
>>
>> This assumption is wrong. ansistring means the system encoding which
>> uses 8 bit chars.
>
> Even if the system encoding is UTF8?

Then it means utf-8 of course. What I wanted to say, it _means_
something: 8 bit system encoding.
_______________________________________________
fpc-pascal maillist  -  [hidden email]
http://lists.freepascal.org/mailman/listinfo/fpc-pascal
Reply | Threaded
Open this post in threaded view
|

Re: Summary on Re: Unicode file routines proposal

Felipe Monteiro de Carvalho
In reply to this post by Michael Van Canneyt
Why not just introduce a set of utf-16 routines with utf16string type
like the new Delphi?

This proposal is at least better then the one from Marco as we at
least can get the encoding somehow, but is still inconvenient for
cross-platform software.

--
Felipe Monteiro de Carvalho
_______________________________________________
fpc-pascal maillist  -  [hidden email]
http://lists.freepascal.org/mailman/listinfo/fpc-pascal
Reply | Threaded
Open this post in threaded view
|

Re: Summary on Re: Unicode file routines proposal

Florian Klaempfl
Felipe Monteiro de Carvalho wrote:
> Why not just introduce a set of utf-16 routines with utf16string type
> like the new Delphi?

Because it's not cross platform.
_______________________________________________
fpc-pascal maillist  -  [hidden email]
http://lists.freepascal.org/mailman/listinfo/fpc-pascal
Reply | Threaded
Open this post in threaded view
|

Re: Unicode file routines proposal

Martin Schreiber
In reply to this post by Felipe Monteiro de Carvalho
On Tuesday 01 July 2008 14.03:19 Felipe Monteiro de Carvalho wrote:

> About UCS-2 this is absurd. We certainlly cannot have half the chinese
> characters ignored in the Free Pascal RTL.

???
Where did you get the information that half of the Chinese characters won't
fit in base plane? And utf-16 surrogate handling is much simpler than utf-8
variable character length handling.

Martin
_______________________________________________
fpc-pascal maillist  -  [hidden email]
http://lists.freepascal.org/mailman/listinfo/fpc-pascal
Reply | Threaded
Open this post in threaded view
|

Re: Unicode file routines proposal

Felipe Monteiro de Carvalho
On Tue, Jul 1, 2008 at 9:28 AM, Martin Schreiber <[hidden email]> wrote:
> Where did you get the information that half of the Chinese characters won't
> fit in base plane?

http://unicode.org/roadmaps/sip/index.html

CJK means Chinese Japanese Korean

--
Felipe Monteiro de Carvalho
_______________________________________________
fpc-pascal maillist  -  [hidden email]
http://lists.freepascal.org/mailman/listinfo/fpc-pascal
Reply | Threaded
Open this post in threaded view
|

Re: Summary on Re: Unicode file routines proposal

Marco van de Voort
In reply to this post by Felipe Monteiro de Carvalho
> Why not just introduce a set of utf-16 routines with utf16string type
> like the new Delphi?
>
> This proposal is at least better then the one from Marco

My is having both an UTF8string and a UTF16string, on all platforms that support
unicode. So I don't get this remark.

It is just that on unix, the fileroutines will be defined as utf8string and
on windows with utf16 strings.

SO if I pass a different type, the compiler will insert a conversion.

The only problem with this scheme is that it doesn't support utf-32, or it
needs yet another additional type. But at least it is a proper compiletime
option, and in a procedure I can see from the declaration what unicode type
I'm gonna get.

I had hoped that the implementation of a clean orthogonal set of unicode
types (2 or 3, 4 if you count the COMstring) might be not as bad as 3 new
string types, since we know they can always be converted to eachother, so
maybe we could reuse and share implementation here and there.

But since Florian tries to cast it out of the compiler and to the runtime, I
suspect he thinks that is not possible.
_______________________________________________
fpc-pascal maillist  -  [hidden email]
http://lists.freepascal.org/mailman/listinfo/fpc-pascal
Reply | Threaded
Open this post in threaded view
|

Re: Unicode file routines proposal

Felipe Monteiro de Carvalho
In reply to this post by Marco van de Voort
On Tue, Jul 1, 2008 at 9:21 AM, Marco van de Voort <[hidden email]> wrote:
> Well, euh, the main reason is that euh, most programs and data on the system uses
> the system encoding?

So you are saying that FPC should privilege platform-specific software
development to cross-platform software development? This is in the
inverse direction of all other cross-platform development platforms in
existence.

If you are writting cross-platform software you will wish to avoid as
much as possible the system routines, and a known encoding is good.

Florian's proposal shines here. You get the string with no conversion
and a marker for the encoding, so you can convert it to whatever you
want easily.

But it doesn't solve the TStringList problem, because there you have
no parameters to know the encoding of the file being loaded.

> Then I'd say you convert. But that is the point. The need for conversion should be
> the exception (different from the default system encoding), not the rule.

I think there should be no conversion at all (unless explicitly asked)
in the contents of the stringlist.

>> In my system I propose that simply a TWideStringList be implemented,
>> so both ways of storing data are available everwhere.
>
> But I don't have an utf-8 type in your system to operate on.

How do you know what I want to do with the data? What if I just want
to use some string routines in it to extract data? Or save them back
to another file? (or any operations which don't involve system
routines which need a specific string encoding)

--
Felipe Monteiro de Carvalho
_______________________________________________
fpc-pascal maillist  -  [hidden email]
http://lists.freepascal.org/mailman/listinfo/fpc-pascal
Reply | Threaded
Open this post in threaded view
|

Re: Summary on Re: Unicode file routines proposal

Marco van de Voort
In reply to this post by Florian Klaempfl

Ok a quick pointwise comment then.

> I read most of the discussion and I think there is no way around a
> string type containing an encoding field. First, it allows also to
> support non utf encodings or utf-32 encoding. Having the encoding field
> does not mean that all target support all encoding. In case an encoding
> is not supported, the target might either use some default operation as
> the current widestring manager does or it might spite out an exception.
> Having such a string type requires some manager which does not only
> store the procedures to handle this string type but which also contains
> some information which encoding to prefer or even use solely. Combining
> this with several ifdefs  and compiler switches makes this approach very
> flexible and fast and allows everybody (FPC people, Lazarus, MSE) to
> adapt things to their needs.

I don't like the runtime nature. At all. I want to be able to say "hey look,
I've a bunch of units here, and they only accept utf16, (e.g. because they were
ported Tiburon code). Convert if necessary"

So we need at least one directive in that case, one that says "all
unicodestrings under this directive are in encoding type <n>, convert if
necessary".

> Just an example: to overcome the indexing problem efficiently when using
> an encoding field (this is not about surrogates), we could do the
> following: introduce a compiler switch {$unicodestringindex
> default,byte,word,dword}. In default mode the compiler gets a shifting
> value from the encoding field (this is 4 bytes anyways and could be
> split into 1 byte shifting, 2 bytes encoding, 1 bytes reserved). In the
> other modes the compiler uses the given size when indexing. For example,
> a Tuberion (or how is it called?) switch could set this to word.

I don't understand how this can work, how can I have a compiletime solution
for a runtime problem?

procedure mystringproc (s:FlorianUnicodeString);

begin
  if encodingof(s)=utf-16 then
    begin
      // utf-16 code here with shiftsize 2 [] needed
    end
  else
    begin
      // utf-8 code here with shiftsize 1 [] needed
    end;
end;

> The approach has the big advantage, that you really need all procedures
> only once if desired. For example e.g. linux would get only utf-8
> routines by default, utf-16 is converted to utf-8 at the entry of the
> helper procedures if needed. Usually, no conversion would be necessary
> because you see seldomly utf-16 in linux applications so only the check
> if the input strings are really utf-8 is necessary, this is very cheap
> because the data is anyways already in a cache line.

> Even more, this variable encoding approach allows also people using
> languages where utf-8 is more memory expensive than utf-16 (this is in
> numbers the majority of mankind) to use utf-8/utf-16 as needed to save
> memory only with a few modifications.
>
> I know this approach contains some hacks and requires some work but I
> think this is the only way to solve things for once and ever.

I wonder if having 2,3 (utf-8,16 and maybe -32) straight simple unicode types
isn't easier than this polymorphic beast.

At least then you have one procedure, one encoding, and since they all
guaranteedly convert (and to comstring too), the conversion code might be
not as painful as when ansistring and widestring were introduced. It could
be parameterisable in the compiler. With the added advantage of compiletime
decisions.
_______________________________________________
fpc-pascal maillist  -  [hidden email]
http://lists.freepascal.org/mailman/listinfo/fpc-pascal
Reply | Threaded
Open this post in threaded view
|

Re: Summary on Re: Unicode file routines proposal

Felipe Monteiro de Carvalho
In reply to this post by Florian Klaempfl
On Tue, Jul 1, 2008 at 9:24 AM, Florian Klaempfl <[hidden email]> wrote:
>> Why not just introduce a set of utf-16 routines with utf16string type
>> like the new Delphi?
>
> Because it's not cross platform.

Why isn't is cross-platform?

--
Felipe Monteiro de Carvalho
_______________________________________________
fpc-pascal maillist  -  [hidden email]
http://lists.freepascal.org/mailman/listinfo/fpc-pascal
Reply | Threaded
Open this post in threaded view
|

Re: Summary on Re: Unicode file routines proposal

Felipe Monteiro de Carvalho
In reply to this post by Marco van de Voort
On Tue, Jul 1, 2008 at 9:42 AM, Marco van de Voort <[hidden email]> wrote:
> I don't like the runtime nature. At all. I want to be able to say "hey look,
> I've a bunch of units here, and they only accept utf16, (e.g. because they were
> ported Tiburon code). Convert if necessary"

Tiburon code will never run in this case because in var parameters the
exact type must match. And it will not match, doesn't matter how many
compile directives you use.

--
Felipe Monteiro de Carvalho
_______________________________________________
fpc-pascal maillist  -  [hidden email]
http://lists.freepascal.org/mailman/listinfo/fpc-pascal
Reply | Threaded
Open this post in threaded view
|

Re: Unicode file routines proposal

Marco van de Voort
In reply to this post by Felipe Monteiro de Carvalho
> On Tue, Jul 1, 2008 at 9:21 AM, Marco van de Voort <[hidden email]> wrote:
> > Well, euh, the main reason is that euh, most programs and data on the system uses
> > the system encoding?
>
> So you are saying that FPC should privilege platform-specific software
> development to cross-platform software development?

No, we should privilege cross-platform software development over an portable emulation
of a 3rd platform (the Java principle).

> This is in the inverse direction of all other cross-platform development
> platforms in existence.

I'm only having FPC and Lazarus requirements on the table here. I don't care
about the others. They have other starting points (being very unix or
windows centric, or work with portable sandboxes)
 
> If you are writting cross-platform software you will wish to avoid as
> much as possible the system routines, and a known encoding is good.
>
> Florian's proposal shines here. You get the string with no conversion
> and a marker for the encoding, so you can convert it to whatever you
> want easily.

And in my case you specify you want UTF-16 by making the parameter
"utfstring16", and the compiler inserts a conversion for you if sb calls it
with a utfstring8. No manual runtime check necessary.
 
> But it doesn't solve the TStringList problem, because there you have
> no parameters to know the encoding of the file being loaded.

No there is no solution for that except making the string type really fat.
Which is not our way.
 
> > Then I'd say you convert. But that is the point. The need for conversion should be
> > the exception (different from the default system encoding), not the rule.
>
> I think there should be no conversion at all (unless explicitly asked)
> in the contents of the stringlist.

Well, that means the tstringlist is a blind store without any methods. It
isn't since any operation requires knowledge about the insides.

> >> In my system I propose that simply a TWideStringList be implemented,
> >> so both ways of storing data are available everwhere.
> >
> > But I don't have an utf-8 type in your system to operate on.
>
> How do you know what I want to do with the data?

Does it matter? I just want to be able to tailor to the most common
scenario's. See my other msg that restates the proposal in simpler terms.

> Or save them back to another file? (or any operations which don't involve
> system routines which need a specific string encoding)

You've really lost me now. I think you are still confusing general
unicodestrings with unicodifying a few filename using routines.
_______________________________________________
fpc-pascal maillist  -  [hidden email]
http://lists.freepascal.org/mailman/listinfo/fpc-pascal
Reply | Threaded
Open this post in threaded view
|

Re: Summary on Re: Unicode file routines proposal

Felipe Monteiro de Carvalho
In reply to this post by Marco van de Voort
On Tue, Jul 1, 2008 at 9:30 AM, Marco van de Voort <[hidden email]> wrote:
> My is having both an UTF8string and a UTF16string, on all platforms that support
> unicode. So I don't get this remark.

Unless I understood your proposal wrong it involves a TMarcoString
which will be declared like this:

{$ifdef Linux}
  If SystemEncoding = utf-8 then TMarcoString = Utf8String
  else TMarcoString = ansistring;
{$endif}
{$ifdef Windows}
  If WindowsNT then TMarcoString = utf16string
  else TMarcoString = ansistring;
{$endif}
{$ifdef Darwin}
  TMarcoString = utf8string;
{$endif}

Just how do you implement a string routine with TMarcoString? Fill it
with ifdefs?

> It is just that on unix, the fileroutines will be defined as utf8string

So you are going to convert in non utf8 unix?

--
Felipe Monteiro de Carvalho
_______________________________________________
fpc-pascal maillist  -  [hidden email]
http://lists.freepascal.org/mailman/listinfo/fpc-pascal
Reply | Threaded
Open this post in threaded view
|

Re: Summary on Re: Unicode file routines proposal

Marco van de Voort
In reply to this post by Felipe Monteiro de Carvalho
> On Tue, Jul 1, 2008 at 9:42 AM, Marco van de Voort <[hidden email]> wrote:
> > I don't like the runtime nature. At all. I want to be able to say "hey look,
> > I've a bunch of units here, and they only accept utf16, (e.g. because they were
> > ported Tiburon code). Convert if necessary"
>
> Tiburon code will never run in this case because in var parameters the
> exact type must match. And it will not match, doesn't matter how many
> compile directives you use.

So I must add some glue to the outside of the system for those few cases.
When you start tying code together with different encodings this is always
the case.

But at least you can do this, contrary to your proposal, where I must do all
my communication with an UTF-8 system with heaps of slow manual conversions,
or even worse, manually using pchars.
_______________________________________________
fpc-pascal maillist  -  [hidden email]
http://lists.freepascal.org/mailman/listinfo/fpc-pascal
Reply | Threaded
Open this post in threaded view
|

Re: Summary on Re: Unicode file routines proposal

Florian Klaempfl
In reply to this post by Felipe Monteiro de Carvalho
Felipe Monteiro de Carvalho wrote:
> On Tue, Jul 1, 2008 at 9:24 AM, Florian Klaempfl <[hidden email]> wrote:
>>> Why not just introduce a set of utf-16 routines with utf16string type
>>> like the new Delphi?
>> Because it's not cross platform.
>
> Why isn't is cross-platform?
>

Because using utf-16 on linux is very unnatural, same for utf-8 on
windows. Platforms like go32 even don't have any unicode. Coding
platform independent but fast applications is really ugly having fixed
types.
_______________________________________________
fpc-pascal maillist  -  [hidden email]
http://lists.freepascal.org/mailman/listinfo/fpc-pascal
12345