Weird string behavior

classic Classic list List threaded Threaded
37 messages Options
12
Reply | Threaded
Open this post in threaded view
|

Re: Weird string behavior

Petr Kohut
Hello,
here are results:

...
begin
   Writeln('--------------');
   Writeln;

   s1 := 'A';   // 1250
   s2 := 'Aä';  // 1250
   Writeln('s1 = "', s1, '" cp = ', StringCodePage(s1));
   Writeln('s2 = "', s2, '" cp = ', StringCodePage(s2));
   r1 := AnsiToUTF8(s1); // 65001
   r2 := AnsiToUTF8(s2); // 65001
   Writeln('r1 = "', r1, '" cp = ', StringCodePage(r1));
   Writeln('r2 = "', r2, '" cp = ', StringCodePage(r2));

   r3 := s1 + r2; // 1250
   Writeln('r3 = "', r3, '" cp = ', StringCodePage(r3));
   r3 := r1 + s2; // 65001
   Writeln('r3 = "', r3, '" cp = ', StringCodePage(r3));

   s3 := s1 + r2; // 1250
   Writeln('s3 = "', s3, '" cp = ', StringCodePage(s3));
   s3 := r1 + s2; // 65001
   Writeln('s3 = "', s3, '" cp = ', StringCodePage(s3));

   SetCodePage(RawByteString(s1), 65001, True);

   r3 := s1 + r2; // 65001
   Writeln('r3="', r3, '" cp=', StringCodePage(r3));
   r3 := r1 + s2; // 65001
   Writeln('r3="', r3, '" cp=', StringCodePage(r3));

   s3 := s1 + r2; // 65001
   Writeln('s3="', s3, '" cp=', StringCodePage(s3));
   s3 := r1 + s2; // 65001
   Writeln('s3="', s3, '" cp=', StringCodePage(s3));

   Readln;
end.

(*
--------------

s1 = "A" cp = 1250
s2 = "Aä" cp = 1250
r1 = "A" cp = 65001
r2 = "Aä" cp = 65001
r3 = "AAä" cp = 1250
r3 = "AA?" cp = 65001
s3 = "AAä" cp = 1250
s3 = "AA?" cp = 65001
r3="AAä" cp=65001
r3="AA?" cp=65001
s3="AAä" cp=65001
s3="AA?" cp=65001

*)



------ Původní zpráva ------
Od: "Jonas Maebe" <[hidden email]>
Komu: "FPC-Pascal users discussions" <[hidden email]>
Odesláno: 23.07.2016 13:03:33
Předmět: Re: [fpc-pascal] Weird string behavior

>On 23/07/16 12:58, [hidden email] wrote:
>>  On 07/23/2016 06:13 AM, Jonas Maebe wrote:
>>  [...]
>>>  var
>>>    s1,s2,s3: AnsiString;
>>>    r1,r2,r3: RawByteString;
>>>  begin
>>>    s1:='A';   // 1252
>>>    s2:='Aä';  // 1252
>>>    writeln('s1="',s1,'" cp=',StringCodePage(s1));
>>>    writeln('s2="',s1,'" cp=',StringCodePage(s2));
>>
>>  writeln('s2="',s2,'" cp=',StringCodePage(s2));
>>
>>
>>  you're not the only one to have missed that...
>
>The only thing that matters for this test is the stringcodepage value,
>which is the correct one.
>
>
>Jonas
>_______________________________________________
>fpc-pascal maillist - [hidden email]
>http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal


_______________________________________________
fpc-pascal maillist  -  [hidden email]
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal
Reply | Threaded
Open this post in threaded view
|

Re: Weird string behavior

Jonas Maebe-2
On 23/07/16 13:31, Petr Kohut wrote:
> Hello,
> here are results:

Thanks a lot. Could you test one more? I think I will have all
information I need then.


Jonas

{$APPTYPE CONSOLE}

type
   tcp866 = type ansistring(866);
var
   s1, s2, s3: tcp866;
begin
   s1:='abc';
   setcodepage(rawbytestring(s1),65001,false);
   s2:='def';
   setcodepage(rawbytestring(s2),437,false);
   s3:=s1+s2;
   Writeln('DefaultSystemCodePage = ',DefaultSystemCodePage);
   Writeln('s3 = "', s3, '" cp = ', StringCodePage(s3));
   Readln;
end.


_______________________________________________
fpc-pascal maillist  -  [hidden email]
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal
Reply | Threaded
Open this post in threaded view
|

Re: Weird string behavior

Mattias Gaertner
On Mon, 25 Jul 2016 22:25:59 +0200
Jonas Maebe <[hidden email]> wrote:

> On 23/07/16 13:31, Petr Kohut wrote:
> > Hello,
> > here are results:  
>
> Thanks a lot. Could you test one more? I think I will have all
> information I need then.
>
>
> Jonas
>
> {$APPTYPE CONSOLE}
>
> type
>    tcp866 = type ansistring(866);
> var
>    s1, s2, s3: tcp866;
> begin
>    s1:='abc';
>    setcodepage(rawbytestring(s1),65001,false);
>    s2:='def';
>    setcodepage(rawbytestring(s2),437,false);
>    s3:=s1+s2;
>    Writeln('DefaultSystemCodePage = ',DefaultSystemCodePage);
>    Writeln('s3 = "', s3, '" cp = ', StringCodePage(s3));
>    Readln;
> end.

DefaultSystemCodePage = 1252
s3 = "abcdef" cp = 65001

Mattias
_______________________________________________
fpc-pascal maillist  -  [hidden email]
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal
Reply | Threaded
Open this post in threaded view
|

Re: Weird string behavior

Jonas Maebe-2
On 25/07/16 23:07, Mattias Gaertner wrote:
> DefaultSystemCodePage = 1252
> s3 = "abcdef" cp = 65001

Thanks. So the rule for concatenation appears to be:
* the dynamic code page of the result of a string concatenation is that
of the left operand (except if it's an empty string, then it's that of
the right operand)
* the declared code page of the final concatenation result is that of
the left operand

You then process the assignment as if you are assigning a string with
the above declared/dynamic code page to whatever you are assigning the
result of the concatenation to (which means no code page conversion in
case the declared code pages match, like in the above case).

That's indeed not what FPC does currently. It's mainly complicated by
the fact that FPC contains optimised helpers to avoid the final
assignment if possible and directly concatenate into the destination
when possible (those helpers right now only get passed the declared code
page of the final destination, and not that of the concatenation, and
hence virtually always convert the result to the dynamic code page of
the destination at this time -- there are some exceptions for rawbytestring)


Jonas
_______________________________________________
fpc-pascal maillist  -  [hidden email]
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal
Reply | Threaded
Open this post in threaded view
|

Re: Weird string behavior

Mattias Gaertner
On Mon, 25 Jul 2016 23:23:23 +0200
Jonas Maebe <[hidden email]> wrote:

> On 25/07/16 23:07, Mattias Gaertner wrote:
> > DefaultSystemCodePage = 1252
> > s3 = "abcdef" cp = 65001  
>
> Thanks. So the rule for concatenation appears to be:
> * the dynamic code page of the result of a string concatenation is that
> of the left operand (except if it's an empty string, then it's that of
> the right operand)
> * the declared code page of the final concatenation result is that of
> the left operand

Here are some more hints:

{$APPTYPE CONSOLE}

type
   tcp866 = type ansistring(866);

var
   s1, s2: tcp866;
   u1: UTF8String;
   r1: RawByteString;
begin
   s1:='abc';
   setcodepage(rawbytestring(s1),65001,false);
   Writeln('s1 = "', s1, '" cp = ', StringCodePage(s1));
   u1:='nop';
   Writeln('u1 = "', u1, '" cp = ', StringCodePage(u1));
   s2:=s1+u1;
   Writeln('s2 = "', s2, '" cp = ', StringCodePage(s2));
   s2:=u1+s1;
   Writeln('s2 = "', s2, '" cp = ', StringCodePage(s2));
   r1:=s1+u1;
   Writeln('r1 = "', r1, '" cp = ', StringCodePage(r1));
   r1:=u1+s1;
   Writeln('r1 = "', r1, '" cp = ', StringCodePage(r1));
   readln;
end.

s1 = "abc" cp = 65001
u1 = "nop" cp = 65001
s2 = "abcnop" cp = 866
s2 = "nopabc" cp = 866
r1 = "abcnop" cp = 1252
r1 = "nopabc" cp = 1252


Mattias
_______________________________________________
fpc-pascal maillist  -  [hidden email]
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal
Reply | Threaded
Open this post in threaded view
|

Re: Weird string behavior

Jonas Maebe-2

Mattias Gaertner wrote on Tue, 26 Jul 2016:

> On Mon, 25 Jul 2016 23:23:23 +0200
> Jonas Maebe <[hidden email]> wrote:
>
>> Thanks. So the rule for concatenation appears to be:
>> * the dynamic code page of the result of a string concatenation is that
>> of the left operand (except if it's an empty string, then it's that of
>> the right operand)
>> * the declared code page of the final concatenation result is that of
>> the left operand
>
> Here are some more hints:

Could you try the same program with u1 as plain ansistring instead of  
utf8string? (with an additional  
"setcodepage(rawbytestring(u1),65001,false);" after assigning u1)

Thanks,


Jonas
_______________________________________________
fpc-pascal maillist  -  [hidden email]
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal
Reply | Threaded
Open this post in threaded view
|

Re: Weird string behavior

Mattias Gaertner
On Tue, 26 Jul 2016 11:01:28 +0200
Jonas Maebe <[hidden email]> wrote:

>[...]
> Could you try the same program with u1 as plain ansistring instead of  
> utf8string? (with an additional  
> "setcodepage(rawbytestring(u1),65001,false);" after assigning u1)

Sure:

{$APPTYPE CONSOLE}

type
   tcp866 = type ansistring(866);

var
   s1, s2: tcp866;
   u1: UTF8String;
   r1: RawByteString;
   a1, a2: AnsiString;
begin
   s1:='cp866';
   setcodepage(rawbytestring(s1),65001,false);
   Writeln('s1 = "', s1, '" cp = ', StringCodePage(s1));
   a1:='acp';
   setcodepage(rawbytestring(a1),65001,false);
   Writeln('a1 = "', a1, '" cp = ', StringCodePage(a1));
   u1:='utf8';
   Writeln('u1 = "', u1, '" cp = ', StringCodePage(u1));

   s2:=s1+u1;
   Writeln('s2:=s1+u1 = "', s2, '" cp = ', StringCodePage(s2));
   s2:=u1+s1;
   Writeln('s2:=u1+s1 = "', s2, '" cp = ', StringCodePage(s2));

   r1:=s1+u1;
   Writeln('r1:=s1+u1 = "', r1, '" cp = ', StringCodePage(r1));
   r1:=u1+s1;
   Writeln('r1:=u1+s1 = "', r1, '" cp = ', StringCodePage(r1));

   a2:=s1+u1;
   Writeln('a2:=s1+u1 = "', a2, '" cp = ', StringCodePage(a2));
   a2:=u1+s1;
   Writeln('a2:=u1+s1 = "', a2, '" cp = ', StringCodePage(a2));

   s2:=s1+a1;
   Writeln('s2:=s1+a1 = "', s2, '" cp = ', StringCodePage(s2));
   s2:=a1+s1;
   Writeln('s2:=a1+s1 = "', s2, '" cp = ', StringCodePage(s2));

   r1:=s1+a1;
   Writeln('r1:=s1+a1 = "', r1, '" cp = ', StringCodePage(r1));
   r1:=a1+s1;
   Writeln('r1:=a1+s1 = "', r1, '" cp = ', StringCodePage(r1));

   a2:=s1+a1;
   Writeln('a2:=s1+a1 = "', a2, '" cp = ', StringCodePage(a2));
   a2:=a1+s1;
   Writeln('a2:=a1+s1 = "', a2, '" cp = ', StringCodePage(a2));

   readln;
end.


s1 = "cp866" cp = 65001
a1 = "acp" cp = 65001
u1 = "utf8" cp = 65001
s2:=s1+u1 = "cp866utf8" cp = 866
s2:=u1+s1 = "utf8cp866" cp = 866
r1:=s1+u1 = "cp866utf8" cp = 1252
r1:=u1+s1 = "utf8cp866" cp = 1252
a2:=s1+u1 = "cp866utf8" cp = 1252
a2:=u1+s1 = "utf8cp866" cp = 1252
s2:=s1+a1 = "cp866acp" cp = 866
s2:=a1+s1 = "acpcp866" cp = 866
r1:=s1+a1 = "cp866acp" cp = 1252
r1:=a1+s1 = "acpcp866" cp = 1252
a2:=s1+a1 = "cp866acp" cp = 1252
a2:=a1+s1 = "acpcp866" cp = 1252

It seems the Delphi rules for non rawbytestrings are:
- Concatenate two same declared strings: append bytes, copy dyn. cp
  from left operand. Declared cp of result is left operand.
- Assign same declared: no conversion, only refcount.
- Concatenate two different declared strings: convert both to
  UnicodeString and append. Maybe there is an optimization for same dyn
  cp.
- Assign different declared strings: convert to LHS.


Mattias
_______________________________________________
fpc-pascal maillist  -  [hidden email]
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal
Reply | Threaded
Open this post in threaded view
|

Re: Weird string behavior

Jonas Maebe-2

Mattias Gaertner wrote on Tue, 26 Jul 2016:

> It seems the Delphi rules for non rawbytestrings are:
> - Concatenate two same declared strings: append bytes, copy dyn. cp
>   from left operand. Declared cp of result is left operand.

Are you sure it's "append bytes" here and not "append bytes if same  
dyn cp, otherwise convert to unicodestring, concatenate, and convert  
back to dyn cp of left operand"?


Jonas
_______________________________________________
fpc-pascal maillist  -  [hidden email]
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal
Reply | Threaded
Open this post in threaded view
|

Re: Weird string behavior

Mattias Gaertner
On Tue, 26 Jul 2016 12:03:21 +0200
Jonas Maebe <[hidden email]> wrote:

> Mattias Gaertner wrote on Tue, 26 Jul 2016:
>
> > It seems the Delphi rules for non rawbytestrings are:
> > - Concatenate two same declared strings: append bytes, copy dyn. cp
> >   from left operand. Declared cp of result is left operand.  
>
> Are you sure it's "append bytes" here and not "append bytes if same  
> dyn cp, otherwise convert to unicodestring, concatenate, and convert  
> back to dyn cp of left operand"?

Now I'm sure. See attachments.


Mattias

_______________________________________________
fpc-pascal maillist  -  [hidden email]
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal

cp_test2.dpr (546 bytes) Download Attachment
cp_test2_result.txt (135 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Weird string behavior

Santiago A.
El 26/07/2016 a las 12:27, Mattias Gaertner escribió:
> a3:=a1+a2 => cp = 1252
> a3:=a2+a1 => cp = 65001
Is that the expected behavior?

IMHO the result should be the same. And the only way is to make it
depend on a3, no matter what is in the left side. That's the way things
are done in Pascal


--
Saludos

Santi
[hidden email]

_______________________________________________
fpc-pascal maillist  -  [hidden email]
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal
Reply | Threaded
Open this post in threaded view
|

Re: Weird string behavior

Michael Van Canneyt


On Tue, 26 Jul 2016, Santiago A. wrote:

> El 26/07/2016 a las 12:27, Mattias Gaertner escribió:
>> a3:=a1+a2 => cp = 1252
>> a3:=a2+a1 => cp = 65001
> Is that the expected behavior?
>
> IMHO the result should be the same. And the only way is to make it
> depend on a3, no matter what is in the left side. That's the way things
> are done in Pascal

This is not correct. In pascal the right-hand side of an assignment has a well-defined type.
The compiler checks whether the type on the right is assignment-compatible to the left side.

Michael.
_______________________________________________
fpc-pascal maillist  -  [hidden email]
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal
Reply | Threaded
Open this post in threaded view
|

Re: Weird string behavior

Jonas Maebe-2
In reply to this post by Santiago A.

Santiago A. wrote on Tue, 26 Jul 2016:

> El 26/07/2016 a las 12:27, Mattias Gaertner escribió:
>> a3:=a1+a2 => cp = 1252
>> a3:=a2+a1 => cp = 65001
> Is that the expected behavior?
>
> IMHO the result should be the same. And the only way is to make it
> depend on a3, no matter what is in the left side. That's the way things
> are done in Pascal

String concatenations are not commutative operations, even if no code  
pages are involved (even in Pascal). I think it is logical that the  
code page of the left operand is kept, because it is the "base string"  
to which you add other data.

The fact that the concatenated data is not converted in this scenario  
makes some sense in the context of the "same declared code page -> no  
conversion" rule that is also used for assignments, but it's something  
I'm less happy about. Maintaining multiple helpers, compiler  
behaviours and documentation is even less enticing though, so I will  
implement the Delphi behaviour. It will also slightly speed up string  
concatenations when all strings have the same declared code page,  
because no checks need to be performed at run time regarding whether  
or not data needs to be converted.


Jonas
_______________________________________________
fpc-pascal maillist  -  [hidden email]
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal
Reply | Threaded
Open this post in threaded view
|

Re: Weird string behavior

Santiago A.
In reply to this post by Michael Van Canneyt
El 26/07/2016 a las 16:19, Michael Van Canneyt escribió:


On Tue, 26 Jul 2016, Santiago A. wrote:

El 26/07/2016 a las 12:27, Mattias Gaertner escribió:
a3:=a1+a2 => cp = 1252
a3:=a2+a1 => cp = 65001
Is that the expected behavior?

IMHO the result should be the same. And the only way is to make it
depend on a3, no matter what is in the left side. That's the way things
are done in Pascal

This is not correct. In pascal the right-hand side of an assignment has a well-defined type. The compiler checks whether the type on the right is assignment-compatible to the left side.
Sorry I meant no matter what is in the right side. Other way my statment has no sense ;-)

Michael.


_______________________________________________
fpc-pascal maillist  -  [hidden email]
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal


-- 
Saludos

Santi
[hidden email]

_______________________________________________
fpc-pascal maillist  -  [hidden email]
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal
Reply | Threaded
Open this post in threaded view
|

Re: Weird string behavior

Michael Van Canneyt


On Tue, 26 Jul 2016, Santiago A. wrote:

> El 26/07/2016 a las 16:19, Michael Van Canneyt escribió:
>>
>>
>> On Tue, 26 Jul 2016, Santiago A. wrote:
>>
>>> El 26/07/2016 a las 12:27, Mattias Gaertner escribió:
>>>> a3:=a1+a2 => cp = 1252
>>>> a3:=a2+a1 => cp = 65001
>>> Is that the expected behavior?
>>>
>>> IMHO the result should be the same. And the only way is to make it
>>> depend on a3, no matter what is in the left side. That's the way things
>>> are done in Pascal
>>
>> This is not correct. In pascal the right-hand side of an assignment
>> has a well-defined type. The compiler checks whether the type on the
>> right is assignment-compatible to the left side.
> Sorry I meant no matter what is in the right side. Other way my statment
> has no sense ;-)
Well, I'm also not sure what left/right sides you are referring to:

left/right of := or left/right of +

But as long as Jonas is on top of things, all will be well =-)

Michael.
_______________________________________________
fpc-pascal maillist  -  [hidden email]
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal
Reply | Threaded
Open this post in threaded view
|

Re: Weird string behavior

Michael Schnell
In reply to this post by Michael Van Canneyt
On 07/26/2016 04:19 PM, Michael Van Canneyt wrote:
>
> This is not correct. In pascal the right-hand side of an assignment
> has a well-defined type. The compiler checks whether the type on the
> right is assignment-compatible to the left side.

Hmm.

if you do

x := y + z;

with x a real and y and z integers, the type of x will not change to be
an integer, but the value will be converted.


Now I understand that with strings the encoding is a kind of "sub-type"
and hence (usually) static and not convertible to allow for the compiler
do decide if a conversion is necessary.

This has been discussed a long time ago and the argument was that
_fully_ dynamically typed  strings are to costly regarding CPU demand.

I did not get to know that those design decision has been changed for
the normal usage case (while there seems to be ways to sue certain kinds
of strings in a fully dynamical way) .

Changing the encoding of the left side operand of ":=" would only be
logical if the encoding is never an attribute to the string's type but
always a dynamical attribute to the string's content.

-Michael.
_______________________________________________
fpc-pascal maillist  -  [hidden email]
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal
Reply | Threaded
Open this post in threaded view
|

Re: Weird string behavior

Santiago A.
El 27/07/2016 a las 16:10, Michael Schnell escribió:

> On 07/26/2016 04:19 PM, Michael Van Canneyt wrote:
>>
>> This is not correct. In pascal the right-hand side of an assignment
>> has a well-defined type. The compiler checks whether the type on the
>> right is assignment-compatible to the left side.
>
> Hmm.
>
> if you do
>
> x := y + z;
>
> with x a real and y and z integers, the type of x will not change to
> be an integer, but the value will be converted.
>
>
> Now I understand that with strings the encoding is a kind of
> "sub-type" and hence (usually) static and not convertible to allow for
> the compiler do decide if a conversion is necessary.
>
> This has been discussed a long time ago and the argument was that
> _fully_ dynamically typed  strings are to costly regarding CPU demand.
>
> I did not get to know that those design decision has been changed for
> the normal usage case (while there seems to be ways to sue certain
> kinds of strings in a fully dynamical way) .
>
> Changing the encoding of the left side operand of ":=" would only be
> logical if the encoding is never an attribute to the string's type but
> always a dynamical attribute to the string's content.

And what are the rules for changing left side operand? It looks that
they are a little complicated.

Freepascal needed codepages, so string with codepage was needed.
Should it need dynamic codepage for backward compatibility?  INHO, no.
String should had been an alias of rawbytestring, and "codepage aware
strings" should be another new type, but codepage should be static.

Legacy programs could compile and run perfectly, and you could start
using codepage aware strings type.

Automatic conversions? Well, I'm not for it, but any way, left side
shouldn't change its codepage.

Nevertheless, that's my two cents. I looks that there is some pressure
to be Delphi XX compatible, I left Delphi long long time ago (Delphi 5),
so these  compatibility issues are not in my radar.

--
Saludos

Santiago A.

_______________________________________________
fpc-pascal maillist  -  [hidden email]
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal
Reply | Threaded
Open this post in threaded view
|

Re: Weird string behavior

Michael Schnell
On 07/28/2016 11:38 AM, Santiago A. wrote:
> And what are the rules for changing left side operand?
Extremely hard to define when trying to follow any decent logic, unless
the decision is either "never" (i.e. a strictly static typing) or
"always" (a strictly dynamic typing).

A way out could be to set all "normal" string types to behave as
strictly static typing (like any other types in Pascal) and a single
dedicated (but intended to be used always where appropriate, e. g. in
"TStrings") String type to behave as strictly dynamic typing).

Analyze and suggestions see ->
http://wiki.lazarus.freepascal.org/not_Delphi_compatible_enhancement_for_Unicode_Support

-Michael

_______________________________________________
fpc-pascal maillist  -  [hidden email]
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal
12