String theory

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

String theory

Tony Whyman
While my first thought over the "String Type" or "End of World" threads
was this is another "how many angels to the pinhead" type discussion.
However, having worked through it, I believe that there is an issue here
and Pascal could be improved by including (for string types) the code
page as part of the string data itself rather than having to infer it.

As a programmer, I want the freedom to choose which was the appropriate
character encoding for my application - or even to mix encodings in the
same application.

- I would always choose UTF-8 for database columns as that is the best
compromise between international support and compact encoding (and hope
that my RDBMS was not so dumb as to allocate four times the max
character width for every UTF-8 string).

- If I was doing a lot of intensive CPU string processing of strings
with international support then UTF-16 is what I would want to use for
internal representation - as long as the cost of UTF-8 to UTF-16
transliteration was justified when reading/writing to disk.

- On the other hand, if I am working on an in house application that I
know is always going to be working in English (or Western Europe) then
use of a National Character set (or more likely ISO 10589-1) seems the
obvious choice.

Pascal does seem to support what I want. It has the unicodestring type
for UTF-16 and the string type (with code page) for UTF-8 and national
character sets. However, the problem is that Pascal (or FPC) permits an
ambiguity between the use of UTF-8 and national character sets.

If you program is in English and your data is in English then UTF-8 and
Ansistrings (or even different 8-bit code pages) look the same and is
very easy to get sloppy, use the basic string type all over the place,  
and to get very confused as to what your string code page really is. The
whole thing then just falls apart when you try and internationalise it.

I would argue that this problem would be avoided if the code page was
part of the string data (just as the byte count is already) and that
strings defined without an explicit code page could have a string with
any code page assigned to them, while strings with an explicit code code
as part of their type could only be assigned a string of that code page
(perhaps with automatic transliteration on assignment from another code
page). Also, byte length and character length could then be returned by
standard routines.

This is in contrast to the current situation where strings without an
explicit code page setting are simply assumed to use the
DefaultSystemCodePage with limited run time checking (often none).

Indeed, if the code page was part of the string data, then the "string"
type should be able to unify both wide string and ansistrings.


_______________________________________________
fpc-pascal maillist  -  [hidden email]
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal
Reply | Threaded
Open this post in threaded view
|

Re: String theory

Bart-48
On 5/10/16, Tony Whyman <[hidden email]> wrote:
> .. Pascal could be improved by including (for string types) the code
> page as part of the string data itself rather than having to infer it.

It already is [part of the string type.
See the StringCodePage function.

Bart
_______________________________________________
fpc-pascal maillist  -  [hidden email]
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal
Reply | Threaded
Open this post in threaded view
|

Re: String theory

Tony Whyman
I don't think this is what I meant as StringCodePage is a unicode string
function. I am looking at the single byte string types.

On 10/05/16 14:15, Bart wrote:
> It already is [part of the string type.
> See the StringCodePage function.

_______________________________________________
fpc-pascal maillist  -  [hidden email]
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal
Reply | Threaded
Open this post in threaded view
|

Re: String theory

Jürgen Hestermann


Am 2016-05-10 um 17:48 schrieb Tony Whyman:
I don't think this is what I meant as StringCodePage is a unicode string function. I am looking at the single byte string types.

On 10/05/16 14:15, Bart wrote:
It already is [part of the string type.
See the StringCodePage function.


Codepages are not restricted to Unicode.
They can be others too (although it should only be used if unicode is no option for some reason).
Ansistring is single byte and can contain non-unicode codepages.
From
http://wiki.freepascal.org/FPC_Unicode_support
:

-----------------------------------------------------------------------------------------------------------
Shortstring

The code page of a shortstring is implicitly CP_ACP and hence will always be equal to the current value of DefaultSystemCodePage.

PAnsiChar/AnsiChar

These types are the same as the old PChar/Char types. In all compiler modes except for {$mode delphiunicode}, PChar/Char are also still aliases for PAnsiChar/AnsiChar. Their code page is implicitly CP_ACP and hence will always be equal to the current value of DefaultSystemCodePage.

PWideChar/PUnicodeChar and WideChar/UnicodeChar

These types remain unchanged. WideChar/UnicodeChar can contain a single UTF-16 code unit, while PWideChar/PUnicodeChar point to a single or an array of UTF-16 code units.

In {$mode delphiunicode}, PChar becomes an alias for PWideChar/PUnicodeChar and Char becomes an alias for WideChar/UnicodeChar.

UnicodeString/WideString

These types behave the same as in previous versions:

  • Widestring is the same as a "COM BSTR" on Windows, and an alias for UnicodeString on all other platforms. Its string data is encoded using UTF-16.
  • UnicodeString is a reference-counted string with a maximum length of high(SizeInt) UTF-16 code units.

Ansistring

AnsiStrings are reference-counted types with a maximum length of high(SizeInt) bytes. Additionally, they now also have code page information associated with them.

The most important thing to understand about the new AnsiString type is that it both has a declared/static/preferred/default code page (called declared code page from now on), and a dynamic code page. The declared code page tells the compiler that when assigning something to that AnsiString, it should first convert the data to that declared code page (except if it is CP_NONE, see RawByteString below). The dynamic code page is a property of the AnsiString which, similar to the length and the reference count, defines the actual code page of the data currently held by that AnsiString.
-----------------------------------------------------------------------------------------------------------

with

CP_ACP: this value represents the currently set "default system code page". See #Code page settings for more information.

_______________________________________________
fpc-pascal maillist  -  [hidden email]
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal
Reply | Threaded
Open this post in threaded view
|

Re: String theory

Jonas Maebe-2
In reply to this post by Tony Whyman
On 10/05/16 17:48, Tony Whyman wrote:
>
> On 10/05/16 14:15, Bart wrote:
>> It already is [part of the string type.
>> See the StringCodePage function.
>
 > I don't think this is what I meant as StringCodePage is a unicode
 > string function. I am looking at the single byte string types.

StringCodePage() works for both ansistring and unicodestring.


Jonas
_______________________________________________
fpc-pascal maillist  -  [hidden email]
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal