Unicode file routines proposal

classic Classic list List threaded Threaded
98 messages Options
12345
Reply | Threaded
Open this post in threaded view
|

Re: Summary on Re: Unicode file routines proposal

Felipe Monteiro de Carvalho
I think you can still do the byte-size operations this way:

ForceEncoding(S, iso-xxxx)
P:=PChar(S);
While (P^<>#0) do
 SomeByteSizedOperation;

Similarly for any other code supposing an encoding.

--
Felipe Monteiro de Carvalho
_______________________________________________
fpc-pascal maillist  -  [hidden email]
http://lists.freepascal.org/mailman/listinfo/fpc-pascal
Reply | Threaded
Open this post in threaded view
|

Re: Summary on Re: Unicode file routines proposal

Michael Van Canneyt


On Tue, 1 Jul 2008, Felipe Monteiro de Carvalho wrote:

> I think you can still do the byte-size operations this way:
>
> ForceEncoding(S, iso-xxxx)
> P:=PChar(S);
> While (P^<>#0) do
>  SomeByteSizedOperation;
>
> Similarly for any other code supposing an encoding.

Absolutely.

Michael.
_______________________________________________
fpc-pascal maillist  -  [hidden email]
http://lists.freepascal.org/mailman/listinfo/fpc-pascal
Reply | Threaded
Open this post in threaded view
|

Re: Summary on Re: Unicode file routines proposal

Martin Schreiber
In reply to this post by Florian Klaempfl
On Tuesday 01 July 2008 17.06:34 Florian Klaempfl wrote:

> Michael Van Canneyt wrote:
> > On Tue, 1 Jul 2008, Paul Ishenin wrote:
> >> Michael Van Canneyt wrote:
> >>> You can still do C:=S[i]. What you cannot do is
> >>>
> >>>   P:=PChar(S);
> >>>   While (P^<>#0) do
> >>>    SomeByteSizedOperation;
> >>
> >> Why you cannot? PChar(S) should represent S as raw bytes. If you know
> >> what you are doing - it will not harm. In other case, if you corrupt the
> >> string then you are responsibile for all problems you get.
> >
> > Obviously you can :-)
> > But what I meant was that you shouldn't expect old code
> > that relied on 1-byte characters to work.
>
> It is supposed to break on utf-xx or whatever anyways.

Would this new multiencoding string replace a reference counted widestring
type on Windows?
I'd like to repeat the need for a "as fast as possible" (reference counted)  
widestring on all platforms which offers all possibilities of optimized low
level pointer stuff like widestrings on Linux, which are ideal for MSEgui
unicode handling.

Martin
_______________________________________________
fpc-pascal maillist  -  [hidden email]
http://lists.freepascal.org/mailman/listinfo/fpc-pascal
Reply | Threaded
Open this post in threaded view
|

Re: Summary on Re: Unicode file routines proposal

Mattias Gaertner
In reply to this post by Florian Klaempfl
Zitat von Florian Klaempfl <[hidden email]>:

> Michael Van Canneyt wrote:
> >
> > On Tue, 1 Jul 2008, Paul Ishenin wrote:
> >
> >> Michael Van Canneyt wrote:
> >>> You can still do C:=S[i]. What you cannot do is
> >>>
> >>>   P:=PChar(S);
> >>>   While (P^<>#0) do
> >>>    SomeByteSizedOperation;
> >>>
> >> Why you cannot? PChar(S) should represent S as raw bytes. If you know what
> you
> >> are doing - it will not harm. In other case, if you corrupt the string
> then
> >> you are responsibile for all problems you get.
> >
> > Obviously you can :-)
> > But what I meant was that you shouldn't expect old code
> > that relied on 1-byte characters to work.
>
> It is supposed to break on utf-xx or whatever anyways.

The above works normally for UTF-8. UTF-8 was designed for this. That's why most
ansistring code works with UTF-8. Switching to UTF-8 was easy. Switching to
UTF-16 needs more work.
And a multi encoded string will break even more things. Means: more work. The
question is: how much more?


Mattias

_______________________________________________
fpc-pascal maillist  -  [hidden email]
http://lists.freepascal.org/mailman/listinfo/fpc-pascal
Reply | Threaded
Open this post in threaded view
|

Re: Unicode file routines proposal

Mattias Gaertner
In reply to this post by Martin Schreiber
Zitat von Martin Schreiber <[hidden email]>:

> On Tuesday 01 July 2008 13.13:19 Mattias Gärtner wrote:
> > Zitat von Martin Schreiber <[hidden email]>:
> > > On Tuesday 01 July 2008 12.19:26 Mattias Gärtner wrote:
> > > > Zitat von Martin Schreiber <[hidden email]>:
> > > > > I did it with utf-8 and UCS-2, beleave me, it was not negligible.
> > > >
> > > > Where is the code in msegui? (the code that was formerly UTF-8, not the
> > > > old UTF-8 code)
> > >
> > > lib/common/kernel/msedrawtext.pas, mserichstring.pas, msestrings.pas.
> > >
> > > http://sourceforge.net/projects/mseide-msegui
> >
> > Thanks. Can you be little bit more specific? I see a lot of functions. Most
> > of them can treat UTF-8 as 8bit encoding. Unless you want to do something
> > special.
> >
> In this routines length(widestring), widestring[index], pwidechar^,
> pwidechar[index], pwidechar + offset, pwidechar - pwidechar and
> inc(pwidechar)/dec(pwidechar) are used often. This can't be done with utf-8
> strings.

Ehm, do you know, that UTF-8 has the advantage, that many ascii functions work
without change?
For example ReplaceChar or searching a substring?

Mattias

_______________________________________________
fpc-pascal maillist  -  [hidden email]
http://lists.freepascal.org/mailman/listinfo/fpc-pascal
Reply | Threaded
Open this post in threaded view
|

Re: Summary on Re: Unicode file routines proposal

Florian Klaempfl
In reply to this post by Mattias Gaertner
Mattias Gärtner wrote:

> Zitat von Florian Klaempfl <[hidden email]>:
>
>> Michael Van Canneyt wrote:
>>> On Tue, 1 Jul 2008, Paul Ishenin wrote:
>>>
>>>> Michael Van Canneyt wrote:
>>>>> You can still do C:=S[i]. What you cannot do is
>>>>>
>>>>>   P:=PChar(S);
>>>>>   While (P^<>#0) do
>>>>>    SomeByteSizedOperation;
>>>>>
>>>> Why you cannot? PChar(S) should represent S as raw bytes. If you know what
>> you
>>>> are doing - it will not harm. In other case, if you corrupt the string
>> then
>>>> you are responsibile for all problems you get.
>>> Obviously you can :-)
>>> But what I meant was that you shouldn't expect old code
>>> that relied on 1-byte characters to work.
>> It is supposed to break on utf-xx or whatever anyways.
>
> The above works normally for UTF-8. UTF-8 was designed for this. That's why most
> ansistring code works with UTF-8. Switching to UTF-8 was easy. Switching to
> UTF-16 needs more work.
> And a multi encoded string will break even more things. Means: more work. The
> question is: how much more?

What will break? As I said, the tflorianstring manager will get some
variables which allow to controll the behaviour of this string. For
example you could tell it that all strings should be utf-8 encoded. Of
course, you get into trouble if some user plays unfair but you could
still protect your code with some EnforceUTF8Encoding. It's exactly the
same as with the current lazarus solution. If the user messes with the
abused ansistrings, you're in trouble but with the tflorianstring you
have a runtime mean to detect the mess (wrong encoding).
_______________________________________________
fpc-pascal maillist  -  [hidden email]
http://lists.freepascal.org/mailman/listinfo/fpc-pascal
Reply | Threaded
Open this post in threaded view
|

Re: Unicode file routines proposal

Martin Schreiber
In reply to this post by Mattias Gaertner
On Tuesday 01 July 2008 18.32:30 Mattias Gärtner wrote:

> >
> > In this routines length(widestring), widestring[index], pwidechar^,
> > pwidechar[index], pwidechar + offset, pwidechar - pwidechar and
> > inc(pwidechar)/dec(pwidechar) are used often. This can't be done with
> > utf-8 strings.
>
> Ehm, do you know, that UTF-8 has the advantage, that many ascii functions
> work without change?
> For example ReplaceChar or searching a substring?
>
Sure, but for layout calculation and the like we need fast access to
codepoints.

Martin
_______________________________________________
fpc-pascal maillist  -  [hidden email]
http://lists.freepascal.org/mailman/listinfo/fpc-pascal
Reply | Threaded
Open this post in threaded view
|

Re: Summary on Re: Unicode file routines proposal

Marco van de Voort
In reply to this post by Florian Klaempfl
> Mattias G?rtner wrote:
> example you could tell it that all strings should be utf-8 encoded. Of
> course, you get into trouble if some user plays unfair but you could
> still protect your code with some EnforceUTF8Encoding. It's exactly the

See earlier mail. Tiburon code shouldn't need mods. That will make sharing
projects with Delphi too difficult. (and make FPC look cumbersome).

Note that while I don't like the polymorphic string, it doesn't have to kill
that, just in some compiler mode (possibly Tib compat) it should be possible
to control the encoding that goes into routine. IOW if a module is declared
in this mode, the compiler will call the force routine before the call or in
the routine.
_______________________________________________
fpc-pascal maillist  -  [hidden email]
http://lists.freepascal.org/mailman/listinfo/fpc-pascal
Reply | Threaded
Open this post in threaded view
|

Re: Summary on Re: Unicode file routines proposal

Marc Weustink
In reply to this post by Florian Klaempfl
Florian Klaempfl wrote:

[..some of my thoughts..]


this suits a construct I saw somewhere:

type
   SomeString = type String(CP_KOI8);

Marc



_______________________________________________
fpc-pascal maillist  -  [hidden email]
http://lists.freepascal.org/mailman/listinfo/fpc-pascal
Reply | Threaded
Open this post in threaded view
|

Re: Unicode file routines proposal

Marc Weustink
In reply to this post by Martin Schreiber
Martin Schreiber wrote:

> On Tuesday 01 July 2008 18.32:30 Mattias Gärtner wrote:
>>> In this routines length(widestring), widestring[index], pwidechar^,
>>> pwidechar[index], pwidechar + offset, pwidechar - pwidechar and
>>> inc(pwidechar)/dec(pwidechar) are used often. This can't be done with
>>> utf-8 strings.
>> Ehm, do you know, that UTF-8 has the advantage, that many ascii functions
>> work without change?
>> For example ReplaceChar or searching a substring?
>>
> Sure, but for layout calculation and the like we need fast access to
> codepoints.

The only way to be sure is using utf-32 in this case. (or not supporting
unicode)

Marc


_______________________________________________
fpc-pascal maillist  -  [hidden email]
http://lists.freepascal.org/mailman/listinfo/fpc-pascal
Reply | Threaded
Open this post in threaded view
|

Re: Unicode file routines proposal

Martin Schreiber
On Tuesday 01 July 2008 22.23:12 Marc Weustink wrote:

> Martin Schreiber wrote:
> > On Tuesday 01 July 2008 18.32:30 Mattias Gärtner wrote:
> >>> In this routines length(widestring), widestring[index], pwidechar^,
> >>> pwidechar[index], pwidechar + offset, pwidechar - pwidechar and
> >>> inc(pwidechar)/dec(pwidechar) are used often. This can't be done with
> >>> utf-8 strings.
> >>
> >> Ehm, do you know, that UTF-8 has the advantage, that many ascii
> >> functions work without change?
> >> For example ReplaceChar or searching a substring?
> >
> > Sure, but for layout calculation and the like we need fast access to
> > codepoints.
>
> The only way to be sure is using utf-32 in this case. (or not supporting
> unicode)
>
I'd like to repeat:
We talk about the MSEgui framework here, not about FPC RTL or FCL.
In MSEgui we need fast internal string and character handling routines which
support UCS-2. UCS-2 is enough even for our single active Chinese user I know
of. I don't want to slow down MSEgui for 100% of the MSEgui users because of
the theoretical possibility that someone needs code points which don't fit
into the base plane. If someone needs the whole unicode range he can use
surrogate pairs. They will not show correct on screen, but all other tasks
can be done. It is the same situation as with ansistring/utf8string.
The use of 16bit instead of 8bit as storage base of the MSEgui string
representation has the big advantage, that 100% of the MSEgui users can
access characters by a simple linear index. Because MSEgui is mainly used by
Russian speaking people, this would probably be less than 20% in case of
8bit. Most of the European users wold be out of luck because of the umlauts
and accents.
Another need of the MSEgui users and the MSEgui routines is converting
internal string representation to the current 8bit system encoding. FPC
supports this perfectly by the widestringmanager already.
Xlib and gdi both have a widestring interface. The only drawback I see is that
there is no reference counted FPC widestring type in Windows at the moment.
The upcoming new Delphi version uses a simple reference counted widestring as
string base type too AFAIK.
So if FPC decides to implement a referencecounted widestring on Windows for
Delphi compatibility, it should be available in OBJFPC mode too.
Conclusion:
MSEgui, and propably most of the MSEgui users too, has no need for a multi
encoding string type at the expense of slower code and more memory
consumption, a referencecounted widestring on Windows would be enough.

Martin
_______________________________________________
fpc-pascal maillist  -  [hidden email]
http://lists.freepascal.org/mailman/listinfo/fpc-pascal
Reply | Threaded
Open this post in threaded view
|

Re: Unicode file routines proposal

Mattias Gaertner
In reply to this post by Martin Schreiber
On Tue, 1 Jul 2008 18:55:44 +0200
Martin Schreiber <[hidden email]> wrote:

> On Tuesday 01 July 2008 18.32:30 Mattias Gärtner wrote:
> > >
> > > In this routines length(widestring), widestring[index],
> > > pwidechar^, pwidechar[index], pwidechar + offset, pwidechar -
> > > pwidechar and inc(pwidechar)/dec(pwidechar) are used often. This
> > > can't be done with utf-8 strings.
> >
> > Ehm, do you know, that UTF-8 has the advantage, that many ascii
> > functions work without change?
> > For example ReplaceChar or searching a substring?
> >
> Sure, but for layout calculation and the like we need fast access to
> codepoints.

Can you point me to an example function, where this is critical?

Mattias
_______________________________________________
fpc-pascal maillist  -  [hidden email]
http://lists.freepascal.org/mailman/listinfo/fpc-pascal
Reply | Threaded
Open this post in threaded view
|

Re: Summary on Re: Unicode file routines proposal

Marco van de Voort
In reply to this post by Marc Weustink
> Florian Klaempfl wrote:
>
> [..some of my thoughts..]
>
> this suits a construct I saw somewhere:
>
> type
>    SomeString = type String(CP_KOI8);

This isn't the case with florian's type?. Because the first copy from a
source with an other encoding would force it to the encoding of the source?
_______________________________________________
fpc-pascal maillist  -  [hidden email]
http://lists.freepascal.org/mailman/listinfo/fpc-pascal
Reply | Threaded
Open this post in threaded view
|

Re: Unicode file routines proposal

Martin Schreiber
In reply to this post by Mattias Gaertner
On Wednesday 02 July 2008 09.32:17 Mattias Gaertner wrote:

> On Tue, 1 Jul 2008 18:55:44 +0200
>
> Martin Schreiber <[hidden email]> wrote:
> > On Tuesday 01 July 2008 18.32:30 Mattias Gärtner wrote:
> > > > In this routines length(widestring), widestring[index],
> > > > pwidechar^, pwidechar[index], pwidechar + offset, pwidechar -
> > > > pwidechar and inc(pwidechar)/dec(pwidechar) are used often. This
> > > > can't be done with utf-8 strings.
> > >
> > > Ehm, do you know, that UTF-8 has the advantage, that many ascii
> > > functions work without change?
> > > For example ReplaceChar or searching a substring?
> >
> > Sure, but for layout calculation and the like we need fast access to
> > codepoints.
>
> Can you point me to an example function, where this is critical?
>
For example lib/common/kernel/msedrawtext.pas:223, procedure layouttext.

Martin
_______________________________________________
fpc-pascal maillist  -  [hidden email]
http://lists.freepascal.org/mailman/listinfo/fpc-pascal
Reply | Threaded
Open this post in threaded view
|

Re: Unicode file routines proposal

Mattias Gaertner
Zitat von Martin Schreiber <[hidden email]>:

> On Wednesday 02 July 2008 09.32:17 Mattias Gaertner wrote:
> > On Tue, 1 Jul 2008 18:55:44 +0200
> >
> > Martin Schreiber <[hidden email]> wrote:
> > > On Tuesday 01 July 2008 18.32:30 Mattias Gärtner wrote:
> > > > > In this routines length(widestring), widestring[index],
> > > > > pwidechar^, pwidechar[index], pwidechar + offset, pwidechar -
> > > > > pwidechar and inc(pwidechar)/dec(pwidechar) are used often. This
> > > > > can't be done with utf-8 strings.
> > > >
> > > > Ehm, do you know, that UTF-8 has the advantage, that many ascii
> > > > functions work without change?
> > > > For example ReplaceChar or searching a substring?
> > >
> > > Sure, but for layout calculation and the like we need fast access to
> > > codepoints.
> >
> > Can you point me to an example function, where this is critical?
> >
> For example lib/common/kernel/msedrawtext.pas:223, procedure layouttext.

Nice code.
As far as I can see, it handles tabs, linebreaks, c_softhyphen and charwidth. It
uses single array element per character optimizations, like the charwidths
array.
I think I would simply keep that and define the characterwidth for the follow up
elements as 0.
Then you only need to change the places where you check for the c_softhyphen.
And because this is a constant you can even use some tricks here.
I don't see how this have a big impact on the performance.


Mattias

_______________________________________________
fpc-pascal maillist  -  [hidden email]
http://lists.freepascal.org/mailman/listinfo/fpc-pascal
Reply | Threaded
Open this post in threaded view
|

Re: Unicode file routines proposal

Martin Schreiber
On Wednesday 02 July 2008 11.08:31 Mattias Gärtner wrote:

> Zitat von Martin Schreiber <[hidden email]>:
> > For example lib/common/kernel/msedrawtext.pas:223, procedure layouttext.
>
> Nice code.
> As far as I can see, it handles tabs, linebreaks, c_softhyphen and
> charwidth. It uses single array element per character optimizations, like
> the charwidths array.
> I think I would simply keep that and define the characterwidth for the
> follow up elements as 0.
> Then you only need to change the places where you check for the
> c_softhyphen. And because this is a constant you can even use some tricks
> here.
> I don't see how this have a big impact on the performance.
>
The code is complicated enough, don't you think? ;-)

Martin
_______________________________________________
fpc-pascal maillist  -  [hidden email]
http://lists.freepascal.org/mailman/listinfo/fpc-pascal
Reply | Threaded
Open this post in threaded view
|

Re: Unicode file routines proposal

Mattias Gaertner
Zitat von Martin Schreiber <[hidden email]>:

> On Wednesday 02 July 2008 11.08:31 Mattias Gärtner wrote:
> > Zitat von Martin Schreiber <[hidden email]>:
> > > For example lib/common/kernel/msedrawtext.pas:223, procedure layouttext.
> >
> > Nice code.
> > As far as I can see, it handles tabs, linebreaks, c_softhyphen and
> > charwidth. It uses single array element per character optimizations, like
> > the charwidths array.
> > I think I would simply keep that and define the characterwidth for the
> > follow up elements as 0.
> > Then you only need to change the places where you check for the
> > c_softhyphen. And because this is a constant you can even use some tricks
> > here.
> > I don't see how this have a big impact on the performance.
> >
> The code is complicated enough, don't you think? ;-)

Ah, sorry, now I understand.
You meant, the performance penalty of the *programmer* is not negligible
comparing ASCII/UCS-2 and UTF-8/16.
Yes, it's true, that for layout code it is often better (readability, maintance)
to refactor the code first, before extending it for UTF. Especially if it is
optimized code.
BTW, where is the layout code that handles Right-To-Left, kerning and sub pixel
rendering?


Mattias

_______________________________________________
fpc-pascal maillist  -  [hidden email]
http://lists.freepascal.org/mailman/listinfo/fpc-pascal
Reply | Threaded
Open this post in threaded view
|

Re: Unicode file routines proposal

Martin Schreiber
On Wednesday 02 July 2008 12.44:46 Mattias Gärtner wrote:
> > > I don't see how this have a big impact on the performance.
> >
> > The code is complicated enough, don't you think? ;-)
>
> Ah, sorry, now I understand.
> You meant, the performance penalty of the *programmer* is not negligible
> comparing ASCII/UCS-2 and UTF-8/16.

:-))))

> Yes, it's true, that for layout code it is often better (readability,
> maintance) to refactor the code first, before extending it for UTF.
> Especially if it is optimized code.
> BTW, where is the layout code that handles Right-To-Left, kerning and sub
> pixel rendering?

Not supported. For wisiwig layout calculations of Truetype and Postscript
printer fonts a scaled up shaddow font is used for emulation of sub pixel
placement.
Full Unicode handling is much more than to have the ability to encode all
possible Unicode points...

Martin
_______________________________________________
fpc-pascal maillist  -  [hidden email]
http://lists.freepascal.org/mailman/listinfo/fpc-pascal
12345