Read lines into UnicodeString variable from UCS2 (UTF-16) encoded text file

classic Classic list List threaded Threaded
27 messages Options
12
Reply | Threaded
Open this post in threaded view
|

Read lines into UnicodeString variable from UCS2 (UTF-16) encoded text file

LacaK
Hi *,

is there any smart way how to read string data line by line from UCS2
encoded text files (lines delimited by $0A00).

Using ReadLn(TextFile, UnicodeStringVariable) does not work as per
comment in text.inc:

// all standard input is assumed to be ansi-encoded

Nor reading into WideChar varible does not work.
Nor setting SetTextCodePage to CP_UTF16 helped.

I wonder if Delphi supports ReadLn() for UTF-16 encoded text files ...?
Is there way how to add support for it in FPC? May be if there will be
set TextRec(T).CodePage=CP_UTF16 then any of the fpc_Read_Text_*
procedures will assume that input file is utf-16 encoded and not ansi?

Now I work-around by reading two chars into array[0..1] of Char and then
cast it to UnicodeChar.
(I know that it is not safe for UTF-16, but for UCS2 it works)

Thanks

-Laco.

_______________________________________________
fpc-pascal maillist  -  [hidden email]
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal
Reply | Threaded
Open this post in threaded view
|

Re: Read lines into UnicodeString variable from UCS2 (UTF-16) encoded text file

Bart-48
On Wed, Sep 4, 2019 at 7:46 AM LacaK <[hidden email]> wrote:

> is there any smart way how to read string data line by line from UCS2
> encoded text files (lines delimited by $0A00).

So, some LoadFromFile with a stream is no option for you?

> I wonder if Delphi supports ReadLn() for UTF-16 encoded text files ...?

From what I gather from the Embarcadero wiki and google searches it does not.
I only have D7 so I cannot test that myself though,

Seems you need to use LoadFromFile with a TEncoding specified, see:
http://docwiki.embarcadero.com/RADStudio/Tokyo/en/Using_TEncoding_for_Unicode_Files

Bart
--
Bart
_______________________________________________
fpc-pascal maillist  -  [hidden email]
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal
Reply | Threaded
Open this post in threaded view
|

Re: Read lines into UnicodeString variable from UCS2 (UTF-16) encoded text file

Bart-48
Stupid an lazy workaround, probably not suitable for larger files.

{$mode objfpc}
{$h+}
uses
  sysutils;

type
  TUCS2TextFile = file of WideChar;

procedure ReadLine(var F: TUCS2TextFile; out S: UnicodeString);
var
  WC: WideChar;
begin
  //Assume file is opend for read
  S := '';
  while not Eof(F) do
  begin
    Read(F, WC);
    if WC = WideChar(#$000A) then
      exit
    else
      if (WC <> WideChar(#$000D)) and (WC<>WideChar(#$FEFF {Unicode LE
BOM})) then S := S + WC;
  end;
end;

var
  UFile: TUCS2TextFile;
  US: UnicodeString;
begin
  AssignFile(UFile, 'ucs2.txt');
  Reset(Ufile);
  while not Eof(UFile) do
  begin
    ReadLine(UFile, US);
    writeln('US = ',US);
  end;
  CloseFile(UFile);
end.

Outputs
US = Line1
US = Line2
US = Line3
which is correct for my test file (Unicode LE encoding created with Notepad).

--
Bart
_______________________________________________
fpc-pascal maillist  -  [hidden email]
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal
Reply | Threaded
Open this post in threaded view
|

Re: Read lines into UnicodeString variable from UCS2 (UTF-16) encoded text file

LacaK
In reply to this post by Bart-48


>
>> is there any smart way how to read string data line by line from UCS2
>> encoded text files (lines delimited by $0A00).
> So, some LoadFromFile with a stream is no option for you?
It should be an option, but AFAIK LoadFromFile with optional TEncoding
is not a part of FPC 3.0.4

It is only in upcoming 3.2.0 ...


>
>> I wonder if Delphi supports ReadLn() for UTF-16 encoded text files ...?
>  From what I gather from the Embarcadero wiki and google searches it does not.
> I only have D7 so I cannot test that myself though,
>
> Seems you need to use LoadFromFile with a TEncoding specified, see:
> http://docwiki.embarcadero.com/RADStudio/Tokyo/en/Using_TEncoding_for_Unicode_Files

Yes it was my impression also ... I was wondering if there is other way?
(best using ReadLn() ... so I can open TextFile, then read first 2-3
bytes (BOM) and detect what encoding file has (UTF-8 or UTF-16) and then
either use ReadLn with AnsiString (UTF-8 case) or UnicodeString (UTF-16
case))

L.

>
> Bart
_______________________________________________
fpc-pascal maillist  -  [hidden email]
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal
Reply | Threaded
Open this post in threaded view
|

Re: Read lines into UnicodeString variable from UCS2 (UTF-16) encoded text file

LacaK
In reply to this post by Bart-48
Nice! Thank you very much.

As an alternative for F:TextFile I am using:

procedure UCS2ReadLn(var F: TextFile; out s: String);
var
   c: record
       case boolean of
        false: (a: array[0..1] of AnsiChar);
        true : (w: WideChar);
      end;
begin
   s:='';
   while not Eof(F) do begin
     System.Read(F,c.a[0]);
     System.Read(F,c.a[1]);
     if c.w in [#10,#13] then
       if s = '' then {begin of line} else break {end of line}
     else
       s := s + c.w;
   end;
end;

which works for me also, but I would be like to have better solution. I
will try LoadFromFile with TEncoding once FPC 3.2 will be out.

-L.

> Stupid an lazy workaround, probably not suitable for larger files.
>
> {$mode objfpc}
> {$h+}
> uses
>    sysutils;
>
> type
>    TUCS2TextFile = file of WideChar;
>
> procedure ReadLine(var F: TUCS2TextFile; out S: UnicodeString);
> var
>    WC: WideChar;
> begin
>    //Assume file is opend for read
>    S := '';
>    while not Eof(F) do
>    begin
>      Read(F, WC);
>      if WC = WideChar(#$000A) then
>        exit
>      else
>        if (WC <> WideChar(#$000D)) and (WC<>WideChar(#$FEFF {Unicode LE
> BOM})) then S := S + WC;
>    end;
> end;
>
> var
>    UFile: TUCS2TextFile;
>    US: UnicodeString;
> begin
>    AssignFile(UFile, 'ucs2.txt');
>    Reset(Ufile);
>    while not Eof(UFile) do
>    begin
>      ReadLine(UFile, US);
>      writeln('US = ',US);
>    end;
>    CloseFile(UFile);
> end.
>
> Outputs
> US = Line1
> US = Line2
> US = Line3
> which is correct for my test file (Unicode LE encoding created with Notepad).
>
> --
> Bart
> _______________________________________________
> fpc-pascal maillist  -  [hidden email]
> https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal
_______________________________________________
fpc-pascal maillist  -  [hidden email]
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal
Reply | Threaded
Open this post in threaded view
|

Re: Read lines into UnicodeString variable from UCS2 (UTF-16) encoded text file

Tony Whyman
You may be able to improve on this using system.BlockRead.

Also, you are assuming low order byte first which may not be portable.

On 04/09/2019 11:14, LacaK wrote:

> Nice! Thank you very much.
>
> As an alternative for F:TextFile I am using:
>
> procedure UCS2ReadLn(var F: TextFile; out s: String);
> var
>   c: record
>       case boolean of
>        false: (a: array[0..1] of AnsiChar);
>        true : (w: WideChar);
>      end;
> begin
>   s:='';
>   while not Eof(F) do begin
>     System.Read(F,c.a[0]);
>     System.Read(F,c.a[1]);
>     if c.w in [#10,#13] then
>       if s = '' then {begin of line} else break {end of line}
>     else
>       s := s + c.w;
>   end;
> end;
>
> which works for me also, but I would be like to have better solution.
> I will try LoadFromFile with TEncoding once FPC 3.2 will be out.
>
> -L.
>
>> Stupid an lazy workaround, probably not suitable for larger files.
>>
>> {$mode objfpc}
>> {$h+}
>> uses
>>    sysutils;
>>
>> type
>>    TUCS2TextFile = file of WideChar;
>>
>> procedure ReadLine(var F: TUCS2TextFile; out S: UnicodeString);
>> var
>>    WC: WideChar;
>> begin
>>    //Assume file is opend for read
>>    S := '';
>>    while not Eof(F) do
>>    begin
>>      Read(F, WC);
>>      if WC = WideChar(#$000A) then
>>        exit
>>      else
>>        if (WC <> WideChar(#$000D)) and (WC<>WideChar(#$FEFF {Unicode LE
>> BOM})) then S := S + WC;
>>    end;
>> end;
>>
>> var
>>    UFile: TUCS2TextFile;
>>    US: UnicodeString;
>> begin
>>    AssignFile(UFile, 'ucs2.txt');
>>    Reset(Ufile);
>>    while not Eof(UFile) do
>>    begin
>>      ReadLine(UFile, US);
>>      writeln('US = ',US);
>>    end;
>>    CloseFile(UFile);
>> end.
>>
>> Outputs
>> US = Line1
>> US = Line2
>> US = Line3
>> which is correct for my test file (Unicode LE encoding created with
>> Notepad).
>>
>> --
>> Bart
>> _______________________________________________
>> fpc-pascal maillist  -  [hidden email]
>> https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal
> _______________________________________________
> fpc-pascal maillist  -  [hidden email]
> https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal
>
_______________________________________________
fpc-pascal maillist  -  [hidden email]
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal
Reply | Threaded
Open this post in threaded view
|

Re: Read lines into UnicodeString variable from UCS2 (UTF-16) encoded text file

LacaK

> You may be able to improve on this using system.BlockRead.
Probably yes, but then I must read in local buffer and examine buffer
for CR/LF.

And return from my function UCS2ReadLn() only portion of string up to
CR/LF and rest of string return on next call to my function.
(so I must keep unprocessed part in global buffer)


>
> Also, you are assuming low order byte first which may not be portable.

Yes, In my case LE is sufficient as far as I check presence of BOM $FF$FE

L.

>
> On 04/09/2019 11:14, LacaK wrote:
>> Nice! Thank you very much.
>>
>> As an alternative for F:TextFile I am using:
>>
>> procedure UCS2ReadLn(var F: TextFile; out s: String);
>> var
>>   c: record
>>       case boolean of
>>        false: (a: array[0..1] of AnsiChar);
>>        true : (w: WideChar);
>>      end;
>> begin
>>   s:='';
>>   while not Eof(F) do begin
>>     System.Read(F,c.a[0]);
>>     System.Read(F,c.a[1]);
>>     if c.w in [#10,#13] then
>>       if s = '' then {begin of line} else break {end of line}
>>     else
>>       s := s + c.w;
>>   end;
>> end;
>>
>> which works for me also, but I would be like to have better solution.
>> I will try LoadFromFile with TEncoding once FPC 3.2 will be out.
>>
>> -L.
>>
>>> Stupid an lazy workaround, probably not suitable for larger files.
>>>
>>> {$mode objfpc}
>>> {$h+}
>>> uses
>>>    sysutils;
>>>
>>> type
>>>    TUCS2TextFile = file of WideChar;
>>>
>>> procedure ReadLine(var F: TUCS2TextFile; out S: UnicodeString);
>>> var
>>>    WC: WideChar;
>>> begin
>>>    //Assume file is opend for read
>>>    S := '';
>>>    while not Eof(F) do
>>>    begin
>>>      Read(F, WC);
>>>      if WC = WideChar(#$000A) then
>>>        exit
>>>      else
>>>        if (WC <> WideChar(#$000D)) and (WC<>WideChar(#$FEFF {Unicode LE
>>> BOM})) then S := S + WC;
>>>    end;
>>> end;
>>>
>>> var
>>>    UFile: TUCS2TextFile;
>>>    US: UnicodeString;
>>> begin
>>>    AssignFile(UFile, 'ucs2.txt');
>>>    Reset(Ufile);
>>>    while not Eof(UFile) do
>>>    begin
>>>      ReadLine(UFile, US);
>>>      writeln('US = ',US);
>>>    end;
>>>    CloseFile(UFile);
>>> end.
>>>
>>> Outputs
>>> US = Line1
>>> US = Line2
>>> US = Line3
>>> which is correct for my test file (Unicode LE encoding created with
>>> Notepad).
>>>
>>> --
>>> Bart
>>> _______________________________________________
>>> fpc-pascal maillist  -  [hidden email]
>>> https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal
>> _______________________________________________
>> fpc-pascal maillist  -  [hidden email]
>> https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal
>>
> _______________________________________________
> fpc-pascal maillist  -  [hidden email]
> https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal
_______________________________________________
fpc-pascal maillist  -  [hidden email]
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal
Reply | Threaded
Open this post in threaded view
|

Re: Read lines into UnicodeString variable from UCS2 (UTF-16) encoded text file

Tomas Hajny-2
On 2019-09-04 13:39, LacaK wrote:

>> You may be able to improve on this using system.BlockRead.
> Probably yes, but then I must read in local buffer and examine buffer
> for CR/LF.
>
> And return from my function UCS2ReadLn() only portion of string up to
> CR/LF and rest of string return on next call to my function.
> (so I must keep unprocessed part in global buffer)
>
>
>> Also, you are assuming low order byte first which may not be portable.
>
> Yes, In my case LE is sufficient as far as I check presence of BOM
> $FF$FE

Just as a comment - a contribution allowing ReadLn to read UTF-16 files
(preferably complete from functional point of view, especially without
shortcuts like handling only UCS2 instead of complete Unicode) would be
obviously welcome.

Tomas
_______________________________________________
fpc-pascal maillist  -  [hidden email]
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal
Reply | Threaded
Open this post in threaded view
|

Re: Read lines into UnicodeString variable from UCS2 (UTF-16) encoded text file

LacaK




You may be able to improve on this using system.BlockRead.
Probably yes, but then I must read in local buffer and examine buffer for CR/LF.

And return from my function UCS2ReadLn() only portion of string up to
CR/LF and rest of string return on next call to my function.
(so I must keep unprocessed part in global buffer)


Also, you are assuming low order byte first which may not be portable.

Yes, In my case LE is sufficient as far as I check presence of BOM $FF$FE

Just as a comment - a contribution allowing ReadLn to read UTF-16 files (preferably complete from functional point of view, especially without shortcuts like handling only UCS2 instead of complete Unicode) would be obviously welcome.


Is there consensus/demand on such solution and any patch in this direction will be accepted?
If yes we must agree on implementation details and IMO also someone must check what situation is in Delphi ... because I guess, that if Delphi does not support this that also FPC will not diverge?
Question1: should be supported "SetTextCodePage(CP_UTF16)" and "SetTextCodePage(CP_UTF16BE)"?
Question2: is this supported in Delphi?
If answer to both questions is YES then I will fill bug report as start point.

As I wrote there is in sources explicit comment: "// all standard input is assumed to be ansi-encoded" which will be no more true if we will add UTF-16 support.

I can imagine, that we can add check for TextRec(T).CodePage=CP_UTF16 and CP_UTF16BE and these two situations handle specially (in read and also in write procedures of text files)

But as far as Read[Ln]/Write[Ln] is core functionality I think, that somebody of core developers should look at it ... ;-)

-Laco.



_______________________________________________
fpc-pascal maillist  -  [hidden email]
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal
Reply | Threaded
Open this post in threaded view
|

Re: Read lines into UnicodeString variable from UCS2 (UTF-16) encoded text file

Tony Whyman

A few points:

1. IMHO: This is currently a Windows problem where the console buffer is UCS2. Linux (and probably all other cases its UTF8 - to be verified).

2. The following Microsoft blog post is interesting background on where MS are going with this:

https://devblogs.microsoft.com/commandline/windows-command-line-unicode-and-utf-8-output-text-buffer/

3. The current Windows API includes "SetConsoleCP" which should (I haven't tested this) allow you to set transliteration to UTF-8 when you call the Windows ReadConsoleInput API function. This seems to imply that FTP can be a consistent UTF8 environment even when the Windows Console buffer is UCS2.

4. Because console input is buffered, you probably cannot have a situation where readln changes the console code page to fit the type (unicode or ansistring) of the variable that you are reading into.

5. You could change FTP so that under Windows, the console is always read using UCS2 with transliteration to ansistring happening when required and depending on the type of the variable that you are reading into. I think that is probably what you are asking for under Windows:

- The console code page is always UCS2.

- Console input is read into unicodestrings in native mode

- Console input is read into ansistrings with transliteration from UCS2 after the input buffer has been parsed.

- Conversion to integers, floats, etc. occurs after transliteration to ansistring in order to avoid too many changes to the RTL.

- Under other OSs, Console input is UTF8 (or a supported ANSI code page). Transliteration to unicodestrings occurs after parsing the input buffer.

6. The question is: is it worth having a different approach to Windows when Windows allows you to set the console input buffer to UTF8 and hence have a common input environment for all OSs?

On 05/09/2019 08:00, LacaK wrote:
Is there consensus/demand on such solution and any patch in this direction will be accepted?
If yes we must agree on implementation details and IMO also someone must check what situation is in Delphi ... because I guess, that if Delphi does not support this that also FPC will not diverge?
Question1: should be supported "SetTextCodePage(CP_UTF16)" and "SetTextCodePage(CP_UTF16BE)"?
Question2: is this supported in Delphi?
If answer to both questions is YES then I will fill bug report as start point.

_______________________________________________
fpc-pascal maillist  -  [hidden email]
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal
Reply | Threaded
Open this post in threaded view
|

Re: Read lines into UnicodeString variable from UCS2 (UTF-16) encoded text file

Tony Whyman

Apologies: when I typed "FTP" below I meant "FPC" :( I'm currently drowning in acronym soup.

On 05/09/2019 09:24, Tony Whyman wrote:

A few points:

1. IMHO: This is currently a Windows problem where the console buffer is UCS2. Linux (and probably all other cases its UTF8 - to be verified).

2. The following Microsoft blog post is interesting background on where MS are going with this:

https://devblogs.microsoft.com/commandline/windows-command-line-unicode-and-utf-8-output-text-buffer/

3. The current Windows API includes "SetConsoleCP" which should (I haven't tested this) allow you to set transliteration to UTF-8 when you call the Windows ReadConsoleInput API function. This seems to imply that FTP can be a consistent UTF8 environment even when the Windows Console buffer is UCS2.

4. Because console input is buffered, you probably cannot have a situation where readln changes the console code page to fit the type (unicode or ansistring) of the variable that you are reading into.

5. You could change FTP so that under Windows, the console is always read using UCS2 with transliteration to ansistring happening when required and depending on the type of the variable that you are reading into. I think that is probably what you are asking for under Windows:

- The console code page is always UCS2.

- Console input is read into unicodestrings in native mode

- Console input is read into ansistrings with transliteration from UCS2 after the input buffer has been parsed.

- Conversion to integers, floats, etc. occurs after transliteration to ansistring in order to avoid too many changes to the RTL.

- Under other OSs, Console input is UTF8 (or a supported ANSI code page). Transliteration to unicodestrings occurs after parsing the input buffer.

6. The question is: is it worth having a different approach to Windows when Windows allows you to set the console input buffer to UTF8 and hence have a common input environment for all OSs?

On 05/09/2019 08:00, LacaK wrote:
Is there consensus/demand on such solution and any patch in this direction will be accepted?
If yes we must agree on implementation details and IMO also someone must check what situation is in Delphi ... because I guess, that if Delphi does not support this that also FPC will not diverge?
Question1: should be supported "SetTextCodePage(CP_UTF16)" and "SetTextCodePage(CP_UTF16BE)"?
Question2: is this supported in Delphi?
If answer to both questions is YES then I will fill bug report as start point.

_______________________________________________
fpc-pascal maillist  -  [hidden email]
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal

_______________________________________________
fpc-pascal maillist  -  [hidden email]
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal
Reply | Threaded
Open this post in threaded view
|

Re: Read lines into UnicodeString variable from UCS2 (UTF-16) encoded text file

Tomas Hajny-2
In reply to this post by Tony Whyman
On 2019-09-05 10:24, Tony Whyman wrote:
> A few points:
>
> 1. IMHO: This is currently a Windows problem where the console buffer
> is UCS2. Linux (and probably all other cases its UTF8 - to be
> verified).
  .
  .

No, the subject refers to text files, not to console. Obviously, console
output has its caveats, but that's something else - the possibly added
functionality of being able to read and write text files with UTF-16
encoding using Read(Ln)/Write(Ln) does not imply that you might be able
to change the console to whatever codepage value directly (this is not
the case today either: you can perfectly write UTF-8 to a text file
under GO32v2 if using the fpWideString manager, but the underlying
communication is performed using the console encoding, not the text file
encoding, and translation is needed on the fly).

Tomas
_______________________________________________
fpc-pascal maillist  -  [hidden email]
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal
Reply | Threaded
Open this post in threaded view
|

Re: Read lines into UnicodeString variable from UCS2 (UTF-16) encoded text file

Tomas Hajny-2
In reply to this post by LacaK
On 2019-09-05 09:00, LacaK wrote:
  .
  .
> Is there consensus/demand on such solution and any patch in this
> direction will be accepted?

I'm not aware of potential discussion about this so far, thus I cannot
talk about any existing consensus (let's hear others), but I believe
that such a consensus could be reached.


> If yes we must agree on implementation details and IMO also someone
> must check what situation is in Delphi ... because I guess, that if
> Delphi does not support this that also FPC will not diverge?

No, this is not necessarily the case. FPC certainly provides more
functionality in various areas. As long as the parts supported in both
Delphi and FPC are compatible, there should be no problem.


> Question1: should be supported "SetTextCodePage(CP_UTF16)" and
> "SetTextCodePage(CP_UTF16BE)"?

I don't know whether putting CP_UTF16 and CP_UTF16BE to the same level
as 8-bit encodings is the right solution. I can imagine that it might be
a completely new flag (e.g. CodepointSize) rather than relying on a
background knowledge that CP_UTF16 and CP_UTF16BE are 2-bytes, CP_UTF32
is 4-bytes and others are 1-byte encodings, because this knowledge would
need to be hardcoded in quite a few places and it would be too easy to
forget one.


> Question2: is this supported in Delphi?
> If answer to both questions is YES then I will fill bug report as start
> point.

I have no idea about Delphi features (neither current nor future ones),
that is up to someone else.


> As I wrote there is in sources explicit comment: "// all standard
> input is assumed to be ansi-encoded" which will be no more true if we
> will add UTF-16 support.

Yes - checking places where this assumption is used as well as providing
an appopriate resolution need to be part of the potential contribution.


> I can imagine, that we can add check for TextRec(T).CodePage=CP_UTF16
> and CP_UTF16BE and these two situations handle specially (in read and
> also in write procedures of text files)

See above regarding using this flag or some other.


> But as far as Read[Ln]/Write[Ln] is core functionality I think, that
> somebody of core developers should look at it ... ;-)

Yes, that's for sure. There's at least one person from the core team
list already involved. ;-) However, I'd be specifically interested in
the opinion of Jonas (who provided great deal of the current Unicode
support), Michael and Marco; I guess that others may not have so strong
positions in this RTL part, but obviously any opinion needs to be
considered.

Tomas
_______________________________________________
fpc-pascal maillist  -  [hidden email]
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal
Reply | Threaded
Open this post in threaded view
|

Re: Read lines into UnicodeString variable from UCS2 (UTF-16) encoded text file

Tony Whyman
In reply to this post by Tomas Hajny-2
Text files and the console should look the same from the programmer's
point of view. The general principle is that you should be able to
redirect stdin to a file and for there to be no difference as far the
program is concerned when reading a file as opposed to reading console
input.

If there is a problem here, it is that the code page of text files
cannot be declared.  While under Linux, FPC seems to have an implicit
assumption that the console and text files are both UTF8, under Windows
it seems to assume UCS2 (and then transliterated to the current code
page) for the console and (by default) UTF8 for text files. A built-in
assumption of UCS2 for Windows text files would seem to be more consistent.

Under all OSs you should be able to set the actual code page for a text
file and the code page that input should be transliterated to.

On 05/09/2019 10:49, Tomas Hajny wrote:

> On 2019-09-05 10:24, Tony Whyman wrote:
>> A few points:
>>
>> 1. IMHO: This is currently a Windows problem where the console buffer
>> is UCS2. Linux (and probably all other cases its UTF8 - to be
>> verified).
>  .
>  .
>
> No, the subject refers to text files, not to console. Obviously,
> console output has its caveats, but that's something else - the
> possibly added functionality of being able to read and write text
> files with UTF-16 encoding using Read(Ln)/Write(Ln) does not imply
> that you might be able to change the console to whatever codepage
> value directly (this is not the case today either: you can perfectly
> write UTF-8 to a text file under GO32v2 if using the fpWideString
> manager, but the underlying communication is performed using the
> console encoding, not the text file encoding, and translation is
> needed on the fly).
>
> Tomas
> _______________________________________________
> fpc-pascal maillist  -  [hidden email]
> https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal
>
_______________________________________________
fpc-pascal maillist  -  [hidden email]
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal
Reply | Threaded
Open this post in threaded view
|

Re: Read lines into UnicodeString variable from UCS2 (UTF-16) encoded text file

Joost van der Sluis
In reply to this post by Tomas Hajny-2
Op 05-09-19 om 12:06 schreef Tomas Hajny:
> On 2019-09-05 09:00, LacaK wrote:
>> Is there consensus/demand on such solution and any patch in this
>> direction will be accepted?
>
> I'm not aware of potential discussion about this so far, thus I cannot
> talk about any existing consensus (let's hear others), but I believe
> that such a consensus could be reached.

> Yes, that's for sure. There's at least one person from the core team
> list already involved. ;-)

I think that this question from LacaK was not that strange. For people
outside the core team, it is not always clear who is member of core.

Sometimes there are discussions in the mailinglist between people
without any of the core-members joining in. Then it is really
frustrating when a decision or patch is not accepted by one of the
core-team members. (After all, only they can commit patches)

For me, it is clear that when Tomas welcomes a patch regarding the
file-output part of fpc, it will almost always be accepted. (Unless some
others can point to a flaw that Tomas did not foresee)

But for 'outsiders' this might be less clear.

Regards,

Joost.
_______________________________________________
fpc-pascal maillist  -  [hidden email]
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal
Reply | Threaded
Open this post in threaded view
|

Re: Read lines into UnicodeString variable from UCS2 (UTF-16) encoded text file

Tomas Hajny-2
On 2019-09-05 13:04, Joost van der Sluis wrote:

> Op 05-09-19 om 12:06 schreef Tomas Hajny:
>> On 2019-09-05 09:00, LacaK wrote:
>>> Is there consensus/demand on such solution and any patch in this
>>> direction will be accepted?
>>
>> I'm not aware of potential discussion about this so far, thus I cannot
>> talk about any existing consensus (let's hear others), but I believe
>> that such a consensus could be reached.
>
>> Yes, that's for sure. There's at least one person from the core team
>> list already involved. ;-)
>
> I think that this question from LacaK was not that strange. For people
> outside the core team, it is not always clear who is member of core.
  .
  .

Absolutely, the question was perfectly valid, sorry if my response
sounded differently. In any case, I also explicitly mentioned people I'd
like to be involved in reaching the consensus. I will make sure to get
their opinion (either here or elsewhere) and provide the summary here
for LacaK and others as appropriate.

Tomas
_______________________________________________
fpc-pascal maillist  -  [hidden email]
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal
Reply | Threaded
Open this post in threaded view
|

Re: Read lines into UnicodeString variable from UCS2 (UTF-16) encoded text file

LacaK
 From user POV we have this situation:
- on one side there is input text file encoded UTF-16 (either LE or BE)
- on other side there is FPC, where RTL procedures like AssignFile,
SetTextCodePage, Reset, Read(Ln), Write(Ln) are available.

My original intention was simply use call to existing procedure
SetTextCodePage with parameter CP_UTF16, which in my opinion will simply
signal to RTL, that input/output text file is/should be encoded using UTF16.
Then any subsequent call to ReadLn with any destination variable
(ansistring, unicodestring, integer, etc.) will simply do something like:
- read from file byte sequence, which will be interpreted as UTF-16 so
we will have on input UnicodeString
- this UnicodeString will be further transliterated to requested
destination variable (as there are in FPC implicit conversions between
UnicodeString and AnsiString this would be no problem)

(for Write(Ln) same will happen only in reverse order: source variable
-> UnicodeString -> Write to File)

If SetTextCodePage(CP_UTF16) is not appropriate, then we must IMO
introduce any new procedure which will give to user possibility signal
that "I have UTF-16 encoded text file" or "I want that all writes to my
text file should be encoded UTF-16".
(but personally I do not see reason to introduce new procedure as
SetTetCodePage for me perfectly fit)

So firstly we need design/proposal, which is/will be accepted.
(probably here is needed deeper knowledge of RTL internals so it is
reason why also others core developers should step in)

L.


> On 2019-09-05 13:04, Joost van der Sluis wrote:
>> Op 05-09-19 om 12:06 schreef Tomas Hajny:
>>> On 2019-09-05 09:00, LacaK wrote:
>>>> Is there consensus/demand on such solution and any patch in this
>>>> direction will be accepted?
>>>
>>> I'm not aware of potential discussion about this so far, thus I
>>> cannot talk about any existing consensus (let's hear others), but I
>>> believe that such a consensus could be reached.
>>
>>> Yes, that's for sure. There's at least one person from the core team
>>> list already involved. ;-)
>>
>> I think that this question from LacaK was not that strange. For people
>> outside the core team, it is not always clear who is member of core.
>  .
>  .
>
> Absolutely, the question was perfectly valid, sorry if my response
> sounded differently. In any case, I also explicitly mentioned people
> I'd like to be involved in reaching the consensus. I will make sure to
> get their opinion (either here or elsewhere) and provide the summary
> here for LacaK and others as appropriate.
>
> Tomas
> _______________________________________________
> fpc-pascal maillist  -  [hidden email]
> https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal
_______________________________________________
fpc-pascal maillist  -  [hidden email]
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal
Reply | Threaded
Open this post in threaded view
|

Re: Read lines into UnicodeString variable from UCS2 (UTF-16) encoded text file

Tomas Hajny-2
On 2019-09-06 07:24, LacaK wrote:
> From user POV we have this situation:
> - on one side there is input text file encoded UTF-16 (either LE or BE)
> - on other side there is FPC, where RTL procedures like AssignFile,
> SetTextCodePage, Reset, Read(Ln), Write(Ln) are available.
>
> My original intention was simply use call to existing procedure
> SetTextCodePage with parameter CP_UTF16, which in my opinion will
> simply signal to RTL, that input/output text file is/should be encoded
> using UTF16.

Yes, I believe that extending SetTextCodePage with supporting UTF-16
makes sense (with certain caveats like that calling it should be
performed before Rewrite in case of new files creation, or otherwise the
BOM mark will not be added to the beginning of the file). The other
question is what needs to happen within the text file record - as
mentioned in my other post, I'd prefer adding a new field specifying the
codepoint size rather than having to check for specific codepage values
in all code branches which would need to be created for handling the
difference.

Moreover, the case of opening a file is somewhat trickier, because the
file may have the encoding specified within the file itself. Would we
add code for reading the first bytes every time Reset is called for a
text file not associated with another device (console) and set the
fields in the text file record (possibly overriding an explicit setting
from SetTextCodePage)? Personally, I'd do so, but others may have a
different opinion.


> Then any subsequent call to ReadLn with any destination variable
> (ansistring, unicodestring, integer, etc.) will simply do something
> like:
> - read from file byte sequence, which will be interpreted as UTF-16 so
> we will have on input UnicodeString

Just a comment - if already adding this support, we should IMHO allow
UTF-32 as well.


> - this UnicodeString will be further transliterated to requested
> destination variable (as there are in FPC implicit conversions between
> UnicodeString and AnsiString this would be no problem)

Yes.


> (for Write(Ln) same will happen only in reverse order: source variable
> -> UnicodeString -> Write to File)
>
> If SetTextCodePage(CP_UTF16) is not appropriate, then we must IMO
> introduce any new procedure which will give to user possibility signal
> that "I have UTF-16 encoded text file" or "I want that all writes to
> my text file should be encoded UTF-16".
> (but personally I do not see reason to introduce new procedure as
> SetTetCodePage for me perfectly fit)

See above - a new procedure may not be needed, but I'd prefer a new text
file record field in the background for better efficiency and
maintainability.


> So firstly we need design/proposal, which is/will be accepted.
> (probably here is needed deeper knowledge of RTL internals so it is
> reason why also others core developers should step in)

Right. See my input above for my current thoughts. In the end, we should
preferably extend the FPC Unicode handling page in the Wiki; in the
meantime, a new page may be used for documenting the specification.
Before doing that, I'd still want to hear the opinion from Jonas, Marco
and Michael - I'll ask them.

Tomas
_______________________________________________
fpc-pascal maillist  -  [hidden email]
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal
Reply | Threaded
Open this post in threaded view
|

Re: Read lines into UnicodeString variable from UCS2 (UTF-16) encoded text file

Tomas Hajny-2
On 2019-09-06 15:22, Tomas Hajny wrote:
> On 2019-09-06 07:24, LacaK wrote:


Hi *,

As promised, I discussed the idea of adding support for UTF-16 encoded
text files (and preferably UTF-32 as well while at it) to the RTL with
other core team members. Overall, I didn't come across anybody oposing
this idea, the only (logical) requirement is taking care of the
performance implications of this change, i.e. avoiding considerable
performance decrease in processing of 8-bit encoded files (actually,
this is one of reasons of my suggestion to add codepoint size
information to the text file record and use that instead of checking
individual values of the codepage variable to find out the codepoint
size implications every time working with the file - see below).


> Yes, I believe that extending SetTextCodePage with supporting UTF-16
> makes sense (with certain caveats like that calling it should be
> performed before Rewrite in case of new files creation, or otherwise
> the BOM mark will not be added to the beginning of the file). The
> other question is what needs to happen within the text file record -
> as mentioned in my other post, I'd prefer adding a new field
> specifying the codepoint size rather than having to check for specific
> codepage values in all code branches which would need to be created
> for handling the difference.
>
> Moreover, the case of opening a file is somewhat trickier, because the
> file may have the encoding specified within the file itself. Would we
> add code for reading the first bytes every time Reset is called for a
> text file not associated with another device (console) and set the
> fields in the text file record (possibly overriding an explicit
> setting from SetTextCodePage)? Personally, I'd do so, but others may
> have a different opinion.
  .
  .

After the discussion with some people from the core team, I suggest the
following:

1) New attribute for the codepoint size will be added to the text file
record and all the text file I/O needs to be checked and possibly
extended to with using this attribute instead of current implicit
expectation that the codepoint size is always 1 byte.

2) Support for UTF-16BE/LE and UTF-32BE/LE will be added to
SetTextCodePage, the new codepoint size attribute will be updated as
appropriate.

3) New function 'DetectUtfBom (var T: text): boolean' will be added.
This function may be called after the call to 'Reset (T: text)' to check
for existence of BOM at the beginning of the text file. If it is found
(Result=true), SetTextCodePage is invoked automatically from
DetectUtfBom with the codepage value corresponding to the found BOM and
encoding variant. If BOM is not found (Result=false), nothing changes.

4) A new procedure 'SetUtfBom (var T: text; CodePage: word; BOM:
boolean)' will be added. This procedure may be called after the call to
Rewrite and allows writing BOM to the respective text file.
SetTextCodePage with the respective value will be called from SetUtfBom.

Comments, anybody?

Tomas
_______________________________________________
fpc-pascal maillist  -  [hidden email]
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal
Reply | Threaded
Open this post in threaded view
|

Re: Read lines into UnicodeString variable from UCS2 (UTF-16) encoded text file

DougC
Tomas-

Thanks for pursuing this! Generally a good approach.

I do not like item 3 in that the function, as described, is named DetectUtfBom but does more than detect. Side effects of functions are generally not good. I would at least rename it something like DetectAndHandleUtfBom.

But to fully correct the situation, I would also change it to a procedure since leaving it as a function still suggests it only returns a result and has no other side effects.

Doug C.


---- On Sun, 15 Sep 2019 18:20:22 -0400 Tomas Hajny <[hidden email]> wrote ----

3) New function 'DetectUtfBom (var T: text): boolean' will be added.
This function may be called after the call to 'Reset (T: text)' to check
for existence of BOM at the beginning of the text file. If it is found
(Result=true), SetTextCodePage is invoked automatically from
DetectUtfBom with the codepage value corresponding to the found BOM and
encoding variant. If BOM is not found (Result=false), nothing changes.



_______________________________________________
fpc-pascal maillist  -  [hidden email]
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal
12