Unicode file routines proposal

classic Classic list List threaded Threaded
98 messages Options
12345
Reply | Threaded
Open this post in threaded view
|

Re: Unicode file routines proposal

Mattias Gaertner
On Tue, 01 Jul 2008 09:35:35 +0200
Luca Olivetti <[hidden email]> wrote:

> En/na Marco van de Voort ha escrit:
> >>> They have a UTF-16/UCS-2 internal representation, same as MSEgui
> >>> which works very well and is fast and handy BTW.
> >> And len, slicing, etc. work as expected.
> >> Note that if you need characters beyond $ffff you have to compile
> >> it with wide unicode support, and in that case every character
> >> will use 4 bytes.
> >>
> > That's IMHO a faulty system. It requires you to choose between an
> > incomplete solution or making strings a horrible memory hog.
>
> OTOH using variable length characters will make string operations
> expensive (since you can't just multiply the index by 2 or 4 but you
> have to examine the string from the beginning, and the length in
> bytes isn't the same as the length in characters).

It's amazing that this argument come up again and again. But I know
hardly any code that need this index to char mapping. And the code,
that need it is seldom time critical.
(I must admit, I feared the same some years ago. But the extra cost is
practically a myth.)


> > But maybe that doesn't
> > matter for mere scripting languages (though I wonder then why they
> > didn't chose UTF-32 directly)
> >
> > Surrogates are not nice, but they were invented for a reason.
>
> Well, yes, they're a trade-off between performance and memory
> consumption, but I fear we're losing one of the advantages that
> pascal has over C: fast and simple string handling.

Most code only needs the number of bytes. And this still cost under
pascal O(1).
In fact if a UTF8String or UTF16String would be added, then I would
say, it would be a waste of memory to store an extra PtrInt for the
number of characters.


Mattias
_______________________________________________
fpc-pascal maillist  -  [hidden email]
http://lists.freepascal.org/mailman/listinfo/fpc-pascal
Reply | Threaded
Open this post in threaded view
|

Re: Unicode file routines proposal

Marco van de Voort
In reply to this post by Luca Olivetti-2
> En/na Marco van de Voort ha escrit:
> >> with wide unicode support, and in that case every character will use 4
> >> bytes.
> >>
> > That's IMHO a faulty system. It requires you to choose between an incomplete
> > solution or making strings a horrible memory hog.
>
> OTOH using variable length characters will make string operations
> expensive (since you can't just multiply the index by 2 or 4 but you
> have to examine the string from the beginning, and the length in bytes
> isn't the same as the length in characters).

Yes. In the routines where you do random access on elements. Half of that
can be gained back since most string routines iterate from start to end
anyway.
 
> > But maybe that doesn't
> > matter for mere scripting languages (though I wonder then why they didn't
> > chose UTF-32 directly)
> >
> > Surrogates are not nice, but they were invented for a reason.
>
> Well, yes, they're a trade-off between performance and memory
> consumption, but I fear we're losing one of the advantages that pascal
> has over C: fast and simple string handling.

We also don't want to slip of the other end and turn into a scripting
language unsuitable for major programming.
_______________________________________________
fpc-pascal maillist  -  [hidden email]
http://lists.freepascal.org/mailman/listinfo/fpc-pascal
Reply | Threaded
Open this post in threaded view
|

Re: Unicode file routines proposal

Mattias Gaertner
In reply to this post by Marco van de Voort
On Tue, 1 Jul 2008 09:23:52 +0200 (CEST)
[hidden email] (Marco van de Voort) wrote:

>[...]
> > multiple encodings:

Are we talking about one encoding per platform or two encodings for
all platforms?
Under Unix the encoding preference is clear: UTF-8.
Under Windows there are a lot of current code page texts and the
UTF-16 W functions. So, what encoding is the preference under windows?
UTF-16 plus Ansi like the A and W functions?


> > * More complex
> > * Innovative solution, no known example of a implementation of this
> > system exists = uncertainty if it works at all, or if it is
> > convenient for developers
> > * Depends on a not yet implemented string type
>
> Needs to be done anyway, since widestring on windows is COM, and that
> must be also retained. So it is about adding 1 vs 2, and the work
> will be huge, with UTF-16 too, and to make it worthwhile the best,
> not the quikest solution should be sought.
>
> > * Potentially will have a higher performance then a single encoding
> > system, but only if you use this new special string type
>
> Certainly. Can you imagine loading a non trivial file in a
> tstringlist and saving it again and the heaps of conversions?

Auto conversion of the strings in a TStringList does not make much
sense (and will break a lot of code). That's why I propose to keep one
default string type. If almost everything uses one string type, then no
conversion will take place.

I think the main problem is that the RTL calls the Ansi functions
under windows. Maybe we should not loose the focus.

 
> Moreover, there is an important reason missing:
>
> * Being able to declare the outside world in the right encoding,
> without manually inserting conversions in each header.
>
> * Does not make one of the two core platforms (Unix/windows)
> effectively second rate.

Windows need per se at least two encodings. So whatever is decided, the
windows part need some more work.

 

> * Can be done phased, IOW in the beginning lots of conversion, but
> later have more and more routines in the right encoding ready.
>
> > Single encoding:
> >
> > * Simple, proved solution
>
> Simple solution, complex implementation (needs conversions anywhere).
>
> > * Does not need any new string type, can start being implemented
> > immediately
>
> It does. And you can start making UTF-16 routines anyway
>
> > * Potentially has a lower performance due to string conversions.


Mattias
_______________________________________________
fpc-pascal maillist  -  [hidden email]
http://lists.freepascal.org/mailman/listinfo/fpc-pascal
Reply | Threaded
Open this post in threaded view
|

Re: Unicode file routines proposal

Martin Schreiber
In reply to this post by Mattias Gaertner
On Tuesday 01 July 2008 09.56:29 Mattias Gaertner wrote:

> On Tue, 01 Jul 2008 09:35:35 +0200
>
> Luca Olivetti <[hidden email]> wrote:
> > OTOH using variable length characters will make string operations
> > expensive (since you can't just multiply the index by 2 or 4 but you
> > have to examine the string from the beginning, and the length in
> > bytes isn't the same as the length in characters).
>
> It's amazing that this argument come up again and again. But I know
> hardly any code that need this index to char mapping. And the code,
> that need it is seldom time critical.
> (I must admit, I feared the same some years ago. But the extra cost is
> practically a myth.)
>
A good example is text layout calculation where it is necessary to iterate
over characters (glyphs) over and over again. MSEgui uses widestrings
directly, fpGUI converts to widestrings before processing (or use they the
slow utf-8 routines ?). I once switched MSEgui to utf-8 because of the
widestring problems in FPC, one or two months later when I implemented
complex layout calculation with tabulators and justified text I switched back
to widestrings...
This belongs to a GUI framework, for a RTL are possibly other priorities.

>
> Most code only needs the number of bytes. And this still cost under
> pascal O(1).
> In fact if a UTF8String or UTF16String would be added, then I would
> say, it would be a waste of memory to store an extra PtrInt for the
> number of characters.
>
Agreed.
I think the best compromise for a GUI framework are referencecounted
widestrings where normally physical index = code point index. If one needs
characters which are not in the base plane, he must use surrogate pairs and
more complicated and slower processing. I assume this will be seldom used.

Martin
_______________________________________________
fpc-pascal maillist  -  [hidden email]
http://lists.freepascal.org/mailman/listinfo/fpc-pascal
Reply | Threaded
Open this post in threaded view
|

Re: Unicode file routines proposal

Marco van de Voort
In reply to this post by Mattias Gaertner
> On Tue, 1 Jul 2008 09:23:52 +0200 (CEST)

(note that this is all IMHO, not necessarily core viewpoint)

> Are we talking about one encoding per platform or two encodings for
> all platforms?

My proposition was: Two encodings, two stringtypes for all.

Florian's stand was thinking about one stringtype that supports both
encodings. I don't like this, but we can only discuss that if Florian has
more details about his ideas.

Maybe to not get the RTL unwieldy don't implement the more
exotic routines (like soundex and some of the more complicated formatting
routines) for all encodings. This will also allow to optimize binary size a
bit. (don't want all the routines two fold? Add a define when compiling the
RTL, and the relevant's encoding includefiles are not included and the
compiler inserts conversions). Not that I think that is that important
(except for CE maybe).

Note that I want to do some (if not most) of the RTL stringroutine work. I
like doing them, they are like little puzzles. Though it will take some time
to get proficient in unicode string coding.

> Under Unix the encoding preference is clear: UTF-8.
> Under Windows there are a lot of current code page texts and the
> UTF-16 W functions. So, what encoding is the preference under windows?
> UTF-16 plus Ansi like the A and W functions?

Split the win32 target into
- a win9x compat + legacy codepages, using A
- NT unicode port that strictly -W.

The ports can share nearly all code, and use/perfect the already remaining
IFDEF UNICODE remains. (IOW the NT/UNICODE port defines UNICODE, the other not)

> > > * Potentially will have a higher performance then a single encoding
> > > system, but only if you use this new special string type
> >
> > Certainly. Can you imagine loading a non trivial file in a
> > tstringlist and saving it again and the heaps of conversions?
>
> Auto conversion of the strings in a TStringList does not make much
> sense (and will break a lot of code). That's why I propose to keep one
> default string type.

> If almost everything uses one string type, then no
> conversion will take place.

It will on every communication with the external world. IOW all my db
exports will generally be UTF-8 on Unix and UTf-16 on Windows.

This one size fits all attitude works fine for Lazarus, with only human
latency to worry about, and small amounts of data, (and that is already a
challenge to keep performant) but not for FPC as a whole, as all
processing is hit severely by it.

Most notably, in the single string case, the only way to avoid the forced
encoding is to everything OS specific and manual. That is IMHO too poor.

> I think the main problem is that the RTL calls the Ansi functions
> under windows. Maybe we should not loose the focus.

This is not about loosing focus but gaining it. Out with the evolutionary
workarounds and start making decisions.

> > * Does not make one of the two core platforms (Unix/windows)
> > effectively second rate.
>
> Windows need per se at least two encodings. So whatever is decided, the
> windows part need some more work.

See above. If we have to support two totally different OS api's (A and W)
they are two different targets. Period.

This also avoids the mess of changing all windows routines to be dynloaded,
and hopefully lessen the mutual breaking a bit.

_______________________________________________
fpc-pascal maillist  -  [hidden email]
http://lists.freepascal.org/mailman/listinfo/fpc-pascal
Reply | Threaded
Open this post in threaded view
|

Re: Unicode file routines proposal

Mattias Gaertner
In reply to this post by Martin Schreiber
On Tue, 1 Jul 2008 10:23:32 +0200
Martin Schreiber <[hidden email]> wrote:

> On Tuesday 01 July 2008 09.56:29 Mattias Gaertner wrote:
> > On Tue, 01 Jul 2008 09:35:35 +0200
> >
> > Luca Olivetti <[hidden email]> wrote:
> > > OTOH using variable length characters will make string operations
> > > expensive (since you can't just multiply the index by 2 or 4 but
> > > you have to examine the string from the beginning, and the length
> > > in bytes isn't the same as the length in characters).
> >
> > It's amazing that this argument come up again and again. But I know
> > hardly any code that need this index to char mapping. And the code,
> > that need it is seldom time critical.
> > (I must admit, I feared the same some years ago. But the extra cost
> > is practically a myth.)
> >
> A good example is text layout calculation where it is necessary to
> iterate over characters (glyphs) over and over again.

Text layout nowadays need to consider font widths and unicode specials.
Iterating from character to character should be hardly measurable
compared to this. For example synedit does not yet care much about font
widths and unicode specials and the UTF-8 stepping is negligible.


> MSEgui uses
> widestrings directly, fpGUI converts to widestrings before processing
> (or use they the slow utf-8 routines ?). I once switched MSEgui to
> utf-8 because of the widestring problems in FPC, one or two months
> later when I implemented complex layout calculation with tabulators
> and justified text I switched back to widestrings...
> This belongs to a GUI framework, for a RTL are possibly other
> priorities.
>
> >
> > Most code only needs the number of bytes. And this still cost under
> > pascal O(1).
> > In fact if a UTF8String or UTF16String would be added, then I would
> > say, it would be a waste of memory to store an extra PtrInt for the
> > number of characters.
> >
> Agreed.
> I think the best compromise for a GUI framework are referencecounted
> widestrings where normally physical index = code point index. If one
> needs characters which are not in the base plane, he must use
> surrogate pairs and more complicated and slower processing. I assume
> this will be seldom used.

It depends if your code should solve a special problem or if you
write a library that should work for all. The RTL and FCL should work
for all. So they must support UTF-16 and can not use a
limited widestring.


Mattias
_______________________________________________
fpc-pascal maillist  -  [hidden email]
http://lists.freepascal.org/mailman/listinfo/fpc-pascal
Reply | Threaded
Open this post in threaded view
|

Re: Unicode file routines proposal

Mattias Gaertner
In reply to this post by Marco van de Voort
On Tue, 1 Jul 2008 10:33:28 +0200 (CEST)
[hidden email] (Marco van de Voort) wrote:

> > On Tue, 1 Jul 2008 09:23:52 +0200 (CEST)
>
> (note that this is all IMHO, not necessarily core viewpoint)

Same for me: mine are not lazarus core.

 
> > Are we talking about one encoding per platform or two encodings for
> > all platforms?
>
> My proposition was: Two encodings, two stringtypes for all.

Both at the same time?


> Florian's stand was thinking about one stringtype that supports both
> encodings. I don't like this, but we can only discuss that if Florian
> has more details about his ideas.

I think, Marc had a similar idea. Adding an encoding field (e.g. in
front of the length). But IMO it has some drawbacks.

 

> Maybe to not get the RTL unwieldy don't implement the more
> exotic routines (like soundex and some of the more complicated
> formatting routines) for all encodings. This will also allow to
> optimize binary size a bit. (don't want all the routines two fold?
> Add a define when compiling the RTL, and the relevant's encoding
> includefiles are not included and the compiler inserts conversions).
> Not that I think that is that important (except for CE maybe).
>
> Note that I want to do some (if not most) of the RTL stringroutine
> work. I like doing them, they are like little puzzles. Though it will
> take some time to get proficient in unicode string coding.
>
> > Under Unix the encoding preference is clear: UTF-8.
> > Under Windows there are a lot of current code page texts and the
> > UTF-16 W functions. So, what encoding is the preference under
> > windows? UTF-16 plus Ansi like the A and W functions?
>
> Split the win32 target into
> - a win9x compat + legacy codepages, using A
> - NT unicode port that strictly -W.
>
> The ports can share nearly all code, and use/perfect the already
> remaining IFDEF UNICODE remains. (IOW the NT/UNICODE port defines
> UNICODE, the other not)

I guess, that means only one at a time.

 

>[...]
> > Auto conversion of the strings in a TStringList does not make much
> > sense (and will break a lot of code). That's why I propose to keep
> > one default string type.
>
> > If almost everything uses one string type, then no
> > conversion will take place.
>
> It will on every communication with the external world. IOW all my db
> exports will generally be UTF-8 on Unix and UTf-16 on Windows.

Maybe you misunderstood me here. This section is about multiple encoding
proposal. So I was proposing to use only one string type in
RTL/FCL. It can be a different one for each platform.
As long as almost everywhere only one string is used no conversion can
take place and you can therefore store UTF8 in widestrings or UTF-16 in
strings or whatever binary data. Just as it is at the moment. Strings
are not only text. I think this concept is very important in pascal and
breaking this will create a bigger incompatibility than Codegear does
with it string to widestring move.

 

> This one size fits all attitude works fine for Lazarus, with only
> human latency to worry about, and small amounts of data, (and that is
> already a challenge to keep performant) but not for FPC as a whole,
> as all processing is hit severely by it.
>
> Most notably, in the single string case, the only way to avoid the
> forced encoding is to everything OS specific and manual. That is IMHO
> too poor.
>
> > I think the main problem is that the RTL calls the Ansi functions
> > under windows. Maybe we should not loose the focus.
>
> This is not about loosing focus but gaining it. Out with the
> evolutionary workarounds and start making decisions.

ok

 

> > > * Does not make one of the two core platforms (Unix/windows)
> > > effectively second rate.
> >
> > Windows need per se at least two encodings. So whatever is decided,
> > the windows part need some more work.
>
> See above. If we have to support two totally different OS api's (A
> and W) they are two different targets. Period.
>
> This also avoids the mess of changing all windows routines to be
> dynloaded, and hopefully lessen the mutual breaking a bit.

Two different windows targets. Wow, a big step.

Mattias
_______________________________________________
fpc-pascal maillist  -  [hidden email]
http://lists.freepascal.org/mailman/listinfo/fpc-pascal
Reply | Threaded
Open this post in threaded view
|

Re: Unicode file routines proposal

Martin Schreiber
In reply to this post by Mattias Gaertner
On Tuesday 01 July 2008 10.35:00 Mattias Gaertner wrote:
> > A good example is text layout calculation where it is necessary to
> > iterate over characters (glyphs) over and over again.
>
> Text layout nowadays need to consider font widths and unicode specials.
> Iterating from character to character should be hardly measurable
> compared to this. For example synedit does not yet care much about font
> widths and unicode specials and the UTF-8 stepping is negligible.
>
I did it with utf-8 and UCS-2, beleave me, it was not negligible.

> > I think the best compromise for a GUI framework are referencecounted
> > widestrings where normally physical index = code point index. If one
> > needs characters which are not in the base plane, he must use
> > surrogate pairs and more complicated and slower processing. I assume
> > this will be seldom used.
>
> It depends if your code should solve a special problem or if you
> write a library that should work for all. The RTL and FCL should work
> for all. So they must support UTF-16 and can not use a
> limited widestring.
>
That's why I wrote "for a GUI framework". There we have always the possibility
to access the OS with optimized routines independent from RTL and FCL and to
provide the optimozed stringhandling routines for the chosen internal string
representation. What is necessary for the toolkit user is automatic
conversion from the GUI framework internal string type to the system
encoding. That already exists for widestrings.

Martin
_______________________________________________
fpc-pascal maillist  -  [hidden email]
http://lists.freepascal.org/mailman/listinfo/fpc-pascal
Reply | Threaded
Open this post in threaded view
|

Re: Unicode file routines proposal

Marco van de Voort
In reply to this post by Mattias Gaertner
> On Tue, 1 Jul 2008 10:33:28 +0200 (CEST)
> > > all platforms?
> >
> > My proposition was: Two encodings, two stringtypes for all.
>
> Both at the same time?

Yes, utf8string and utf16string. Whatever Tiburon introduces aliased to
utf16string, so that will be compat on non-windows too. And the utf16
tiburon code can easily communicate with the outside world.

> > Florian's stand was thinking about one stringtype that supports both
> > encodings. I don't like this, but we can only discuss that if Florian
> > has more details about his ideas.
>
> I think, Marc had a similar idea. Adding an encoding field (e.g. in
> front of the length). But IMO it has some drawbacks.

Yes. Any manual string handling, that already gets more difficult, gets more
expensive.  Also because array dereference (which ignores surrogates, but is
still a baseblock for string routine implementation) becomes expensive, or
needs to be done with pointers.

> > It will on every communication with the external world. IOW all my db
> > exports will generally be UTF-8 on Unix and UTf-16 on Windows.
>
> Maybe you misunderstood me here. This section is about multiple encoding
> proposal. So I was proposing to use only one string type in
> RTL/FCL.

> It can be a different one for each platform.

Ok. That is somewhat different. One size fits all (UTF-16 everywhere) is not
an option for me. It's the way of the least resistance, but is more for
languages that have an ivory tower concept and want to keep the real world
at arms length.

So then different platforms, different encodings. Actually that was my first
thought/proposal too, but that precludes any possible solution for Tiburon
compability before we even start, and introduce a portability barrier. (want
to recompile for linux ? First fix all your UTF16 string routines so that
they support UTF-8 under ifdef. That is a hard sale)

IMHO that is no long term sustainable situation, so which is why I changed
to the two stringtypes solution.

That has some disadvantages too, most notably adding even more string types
and possible auto-conversion pitfalls. But I think it is an experiment that
should at least have been tried.

Note that this is totally separate from what Lazarus should do. Lazarus can
IMHO happily use the UTF16 string type exclusively. I'm concerned with the
base system.

> As long as almost everywhere only one string is used no conversion can
> take place and you can therefore store UTF8 in widestrings or UTF-16 in
> strings or whatever binary data.

It still requires manual conversion at the borders (any input or output to
system, libraries,disk). But a lot less since only sources in an encoding
"foreign" to the system need manually conversion code inserted.

> Just as it is at the moment. Strings are not only text. I think this
> concept is very important in pascal and breaking this will create a bigger
> incompatibility than Codegear does with it string to widestring move.

???

> > See above. If we have to support two totally different OS api's (A
> > and W) they are two different targets. Period.
> >
> > This also avoids the mess of changing all windows routines to be
> > dynloaded, and hopefully lessen the mutual breaking a bit.
>
> Two different windows targets. Wow, a big step.

Yes, but longterm unavoidable IMHO, to avoid the situation we had with Dos
in years past, where the port is always trailing the Tier 1 ports.
(though Giulio and Tomas managed to keep it working again I saw, but only
after releases of it were postponed)

W9x support is being dropped on all sides. However for me that is not
necesary if we split the stuff now, while the w9x support is still
qualitively ok. Even though w9x and NT are both windows, in some ways they
differ more than e.g. FreeBSD and Linux.

Doing the split before major NT requiring changes (read:unicode, but also
e.g. symlink support?) will make the change more evolutionary, and the
branching from a moment where the codebase is still proven to work on w32
will assure that it will have decent quality for quite some time.

In the long term it will also save a lot of work, like crazy attempts
tomaintain the status quo with insane workarounds like dynloading all api
routines etc.
_______________________________________________
fpc-pascal maillist  -  [hidden email]
http://lists.freepascal.org/mailman/listinfo/fpc-pascal
Reply | Threaded
Open this post in threaded view
|

Re: Unicode file routines proposal

Mattias Gaertner
In reply to this post by Martin Schreiber
Zitat von Martin Schreiber <[hidden email]>:

> On Tuesday 01 July 2008 10.35:00 Mattias Gaertner wrote:
> > > A good example is text layout calculation where it is necessary to
> > > iterate over characters (glyphs) over and over again.
> >
> > Text layout nowadays need to consider font widths and unicode specials.
> > Iterating from character to character should be hardly measurable
> > compared to this. For example synedit does not yet care much about font
> > widths and unicode specials and the UTF-8 stepping is negligible.
> >
> I did it with utf-8 and UCS-2, beleave me, it was not negligible.

Where is the code in msegui? (the code that was formerly UTF-8, not the old
UTF-8 code)


> > > I think the best compromise for a GUI framework are referencecounted
> > > widestrings where normally physical index = code point index. If one
> > > needs characters which are not in the base plane, he must use
> > > surrogate pairs and more complicated and slower processing. I assume
> > > this will be seldom used.
> >
> > It depends if your code should solve a special problem or if you
> > write a library that should work for all. The RTL and FCL should work
> > for all. So they must support UTF-16 and can not use a
> > limited widestring.
> >
> That's why I wrote "for a GUI framework". There we have always the
> possibility
> to access the OS with optimized routines independent from RTL and FCL and to
> provide the optimozed stringhandling routines for the chosen internal string
> representation. What is necessary for the toolkit user is automatic
> conversion from the GUI framework internal string type to the system
> encoding. That already exists for widestrings.

Ah, ok. Yes, a gui framework is a special case.


Mattias

_______________________________________________
fpc-pascal maillist  -  [hidden email]
http://lists.freepascal.org/mailman/listinfo/fpc-pascal
Reply | Threaded
Open this post in threaded view
|

Re: Unicode file routines proposal

Martin Schreiber
On Tuesday 01 July 2008 12.19:26 Mattias Gärtner wrote:
> Zitat von Martin Schreiber <[hidden email]>:
> >
> > I did it with utf-8 and UCS-2, beleave me, it was not negligible.
>
> Where is the code in msegui? (the code that was formerly UTF-8, not the old
> UTF-8 code)
>
lib/common/kernel/msedrawtext.pas, mserichstring.pas, msestrings.pas.

http://sourceforge.net/projects/mseide-msegui

Martin
_______________________________________________
fpc-pascal maillist  -  [hidden email]
http://lists.freepascal.org/mailman/listinfo/fpc-pascal
Reply | Threaded
Open this post in threaded view
|

Re: Unicode file routines proposal

Mattias Gaertner
Zitat von Martin Schreiber <[hidden email]>:

> On Tuesday 01 July 2008 12.19:26 Mattias Gärtner wrote:
> > Zitat von Martin Schreiber <[hidden email]>:
> > >
> > > I did it with utf-8 and UCS-2, beleave me, it was not negligible.
> >
> > Where is the code in msegui? (the code that was formerly UTF-8, not the old
> > UTF-8 code)
> >
> lib/common/kernel/msedrawtext.pas, mserichstring.pas, msestrings.pas.
>
> http://sourceforge.net/projects/mseide-msegui

Thanks. Can you be little bit more specific? I see a lot of functions. Most of
them can treat UTF-8 as 8bit encoding. Unless you want to do something special.


Mattias

_______________________________________________
fpc-pascal maillist  -  [hidden email]
http://lists.freepascal.org/mailman/listinfo/fpc-pascal
Reply | Threaded
Open this post in threaded view
|

Summary on Re: Unicode file routines proposal

Florian Klaempfl
In reply to this post by Marco van de Voort
I read most of the discussion and I think there is no way around a
string type containing an encoding field. First, it allows also to
support non utf encodings or utf-32 encoding. Having the encoding field
does not mean that all target support all encoding. In case an encoding
is not supported, the target might either use some default operation as
the current widestring manager does or it might spite out an exception.
Having such a string type requires some manager which does not only
store the procedures to handle this string type but which also contains
some information which encoding to prefer or even use solely. Combining
this with several ifdefs  and compiler switches makes this approach very
flexible and fast and allows everybody (FPC people, Lazarus, MSE) to
adapt things to their needs.

Just an example: to overcome the indexing problem efficiently when using
an encoding field (this is not about surrogates), we could do the
following: introduce a compiler switch {$unicodestringindex
default,byte,word,dword}. In default mode the compiler gets a shifting
value from the encoding field (this is 4 bytes anyways and could be
split into 1 byte shifting, 2 bytes encoding, 1 bytes reserved). In the
other modes the compiler uses the given size when indexing. For example,
a Tuberion (or how is it called?) switch could set this to word.

The approach has the big advantage, that you really need all procedures
only once if desired. For example e.g. linux would get only utf-8
routines by default, utf-16 is converted to utf-8 at the entry of the
helper procedures if needed. Usually, no conversion would be necessary
because you see seldomly utf-16 in linux applications so only the check
if the input strings are really utf-8 is necessary, this is very cheap
because the data is anyways already in a cache line.

Even more, this variable encoding approach allows also people using
languages where utf-8 is more memory expensive than utf-16 (this is in
numbers the majority of mankind) to use utf-8/utf-16 as needed to save
memory only with a few modifications.

I know this approach contains some hacks and requires some work but I
think this is the only way to solve things for once and ever.

_______________________________________________
fpc-pascal maillist  -  [hidden email]
http://lists.freepascal.org/mailman/listinfo/fpc-pascal
Reply | Threaded
Open this post in threaded view
|

Re: Unicode file routines proposal

Martin Schreiber
In reply to this post by Mattias Gaertner
On Tuesday 01 July 2008 13.13:19 Mattias Gärtner wrote:

> Zitat von Martin Schreiber <[hidden email]>:
> > On Tuesday 01 July 2008 12.19:26 Mattias Gärtner wrote:
> > > Zitat von Martin Schreiber <[hidden email]>:
> > > > I did it with utf-8 and UCS-2, beleave me, it was not negligible.
> > >
> > > Where is the code in msegui? (the code that was formerly UTF-8, not the
> > > old UTF-8 code)
> >
> > lib/common/kernel/msedrawtext.pas, mserichstring.pas, msestrings.pas.
> >
> > http://sourceforge.net/projects/mseide-msegui
>
> Thanks. Can you be little bit more specific? I see a lot of functions. Most
> of them can treat UTF-8 as 8bit encoding. Unless you want to do something
> special.
>
In this routines length(widestring), widestring[index], pwidechar^,
pwidechar[index], pwidechar + offset, pwidechar - pwidechar and
inc(pwidechar)/dec(pwidechar) are used often. This can't be done with utf-8
strings.

Martin
_______________________________________________
fpc-pascal maillist  -  [hidden email]
http://lists.freepascal.org/mailman/listinfo/fpc-pascal
Reply | Threaded
Open this post in threaded view
|

Re: Unicode file routines proposal

Felipe Monteiro de Carvalho
In reply to this post by Marco van de Voort
On Tue, Jul 1, 2008 at 4:23 AM, Marco van de Voort <[hidden email]> wrote:
> Certainly. Can you imagine loading a non trivial file in a tstringlist and
> saving it again and the heaps of conversions?

And how do you know that the file to be loaded will be in the system
encoding? We should simply not do any conversion or any assumption
when loading a file in a TStringList, so nothing changes here.

We are talking about strings like the filename in LoadFromFile, and
not about the string to hold the contents. This would always be an
ansistring and if someone needs to load a utf-16 file he needs to
build a TWideStringList.

> Moreover, there is an important reason missing:
>
> * Being able to declare the outside world in the right encoding, without
>  manually inserting conversions in each header.

This has nothing to do with this. With a fixed encoding you can also
have automatic conversions.

I bet you would convert automatically from whatever to ansi when going
to a ansistring, but Lazarus uses utf-8 in ansistrings.

We do manual conversions in Lazarus because FPC misses a solution for
automatic conversion using utf-8 in ansistrings.

--
Felipe Monteiro de Carvalho
_______________________________________________
fpc-pascal maillist  -  [hidden email]
http://lists.freepascal.org/mailman/listinfo/fpc-pascal
Reply | Threaded
Open this post in threaded view
|

Re: Summary on Re: Unicode file routines proposal

Michael Van Canneyt
In reply to this post by Florian Klaempfl


On Tue, 1 Jul 2008, Florian Klaempfl wrote:

> I read most of the discussion and I think there is no way around a
> string type containing an encoding field.

[cut]

> I know this approach contains some hacks and requires some work but I
> think this is the only way to solve things for once and ever.

I think it is the most promising and extensible proposal,
so I'm all for it.

Michael.

_______________________________________________
fpc-pascal maillist  -  [hidden email]
http://lists.freepascal.org/mailman/listinfo/fpc-pascal
Reply | Threaded
Open this post in threaded view
|

Re: Unicode file routines proposal

Marco van de Voort
In reply to this post by Felipe Monteiro de Carvalho
> On Tue, Jul 1, 2008 at 4:23 AM, Marco van de Voort <[hidden email]> wrote:
> > Certainly. Can you imagine loading a non trivial file in a tstringlist and
> > saving it again and the heaps of conversions?
>
> And how do you know that the file to be loaded will be in the system
> encoding?

Not at all. that is the problem of the programmer, and depends on what he
knows (or can detect), it is not the systems choice.

Point is that if I only have UTF-16, I must convert to UTF-8 data coming in
manually, do some minor processing (like append a string to a tstringlist),
and then convert it back before writing.

And I have to, since UTf16 is my only type.

> We should simply not do any conversion or any assumption when loading a
> file in a TStringList, so nothing changes here.

You have an assumption, since your TStringList would be UTF16 strings only.
 
> We are talking about strings like the filename in LoadFromFile, and
> not about the string to hold the contents. This would always be an
> ansistring and if someone needs to load a utf-16 file he needs to
> build a TWideStringList.

A solution for unicode should be for everything, not just for UIs and
filenames. I should be able to carry data within it also, because otherwise
we are having this dicussion next week again if Joost needs unicode for DB
related issues etc.
 
> > Moreover, there is an important reason missing:
> >
> > * Being able to declare the outside world in the right encoding, without
> >  manually inserting conversions in each header.
>
> This has nothing to do with this. With a fixed encoding you can also
> have automatic conversions.

How? I can't express the foreign encoding because I have no type for it. I
only have ansistring that can mean pretty much everything, and that
constitutes no compiletime safety.
 
> I bet you would convert automatically from whatever to ansi when going
> to a ansistring, but Lazarus uses utf-8 in ansistrings.

But that is lazarus specific.
 
> We do manual conversions in Lazarus because FPC misses a solution for
> automatic conversion using utf-8 in ansistrings.

Because the decision to put utf-8 in ansistrings is too fundamentally flawed
to implement such a thing, since it perfectly legal if an ansistring does
not contain utf8
_______________________________________________
fpc-pascal maillist  -  [hidden email]
http://lists.freepascal.org/mailman/listinfo/fpc-pascal
Reply | Threaded
Open this post in threaded view
|

Re: Summary on Re: Unicode file routines proposal

Marco van de Voort
In reply to this post by Michael Van Canneyt
> On Tue, 1 Jul 2008, Florian Klaempfl wrote:
>
> > I read most of the discussion and I think there is no way around a
> > string type containing an encoding field.
>
> [cut]
>
> > I know this approach contains some hacks and requires some work but I
> > think this is the only way to solve things for once and ever.
>
> I think it is the most promising and extensible proposal,
> so I'm all for it.

I read it shortly, and I still don't like it. I need more time to prepare a
reponse though.
_______________________________________________
fpc-pascal maillist  -  [hidden email]
http://lists.freepascal.org/mailman/listinfo/fpc-pascal
Reply | Threaded
Open this post in threaded view
|

Re: Unicode file routines proposal

Felipe Monteiro de Carvalho
In reply to this post by Felipe Monteiro de Carvalho
A string type which you don't know the encoding is very inconvenient,
because you need to convert it to something else anytime you wish to
do any routine which will require knowing the encoding.

How will Pos be implemented? And UpperCase? Any cross-platform string
manipulation routine will sudenly become single-platform. Any one
implementing a string routine will need to implement it several type,
not to mention that no string routines currently exist for such new
type.

The more I think about it the more I am sure that a string type whose
encoding we know nothing about is very inconvenient for developers.

About UCS-2 this is absurd. We certainlly cannot have half the chinese
characters ignored in the Free Pascal RTL.

--
Felipe Monteiro de Carvalho
_______________________________________________
fpc-pascal maillist  -  [hidden email]
http://lists.freepascal.org/mailman/listinfo/fpc-pascal
Reply | Threaded
Open this post in threaded view
|

Re: Summary on Re: Unicode file routines proposal

Florian Klaempfl
In reply to this post by Marco van de Voort
Marco van de Voort wrote:

>> On Tue, 1 Jul 2008, Florian Klaempfl wrote:
>>
>>> I read most of the discussion and I think there is no way around a
>>> string type containing an encoding field.
>> [cut]
>>
>>> I know this approach contains some hacks and requires some work but I
>>> think this is the only way to solve things for once and ever.
>> I think it is the most promising and extensible proposal,
>> so I'm all for it.
>
> I read it shortly, and I still don't like it. I need more time to prepare a
> reponse though.

Keep in mind in your response, that we want also handle other formats
than utf-8 or utf-16 if needed :)
_______________________________________________
fpc-pascal maillist  -  [hidden email]
http://lists.freepascal.org/mailman/listinfo/fpc-pascal
12345