Case insensitive comparison of strings with non-ascii characters

classic Classic list List threaded Threaded
24 messages Options
12
Reply | Threaded
Open this post in threaded view
|

Case insensitive comparison of strings with non-ascii characters

Luiz Americo Pereira Camara
Hi,

I'm trying to fix bug http://bugs.freepascal.org/view.php?id=14135 but
could not get a way to do case insensitive comparison of UTF8 strings
with non ascii characters (in the test even ansi strings failed).

See the attached test program. I tried StrIComp, AnsiCompareText,
CompareText and even widestringmanager.CompareTextWideStringProc.

On Linux, only AnsiCompareText worked. On Windows none worked.

Any hints on how to do such comparison or this is a limitation of  
current fpc? Or i'm doing something wrong?

BTW:

Key := 'ç';

Key will be encoded in the current system encoding Ok?

Key := Utf8Encode('ç');

'ç' will be converted to widestring and then back to Utf8 Ok?

Luiz

program bugCompInsensitiveUTF8;

{$mode objfpc}{$H+}

uses
 {$ifdef unix}
 cwstring,
 {$endif}
 Classes, SysUtils;
var
  Key, Str: String;
begin
  Key := 'c';
  Str := 'C';
  WriteLn('Testing C/c');
  if stricomp(PChar(Key), PChar(Str)) = 0 then
    WriteLn('StrIComp OK');
  if AnsiCompareText(Key, Str) = 0 then
    WriteLn('AnsiCompareText OK');
  if CompareText(Key, Str) = 0 then
    WriteLn('CompareText OK');

  Key := 'ç';
  Str := 'Ç';
  WriteLn('Testing Ç/ç');
  if stricomp(PChar(Key), PChar(Str)) = 0 then
    WriteLn('StrIComp OK');
  if AnsiCompareText(Key, Str) = 0 then
    WriteLn('AnsiCompareText OK');
  if CompareText(Key, Str) = 0 then
    WriteLn('CompareText OK');

  Key := UpperCase('ç');
  Str := 'Ç';
  WriteLn('Testing Ç/Uppercase(ç)');
  if strcomp(PChar(Key), PChar(Str)) = 0 then
    WriteLn('StrComp OK');
  if AnsiCompareStr(Key, Str) = 0 then
    WriteLn('AnsiCompareStr OK');
  if CompareStr(Key, Str) = 0 then
    WriteLn('CompareStr OK');

  //test UTF8
  Key := UTF8Encode('ç');
  Str := UTF8Encode('Ç');
  WriteLn('Testing UTF8 Ç/ç');
  if stricomp(PChar(Utf8ToAnsi(Key)), PChar(Utf8ToAnsi(Str))) = 0 then
    WriteLn('StrIComp OK');
  if AnsiCompareText(Utf8ToAnsi(Key), Utf8ToAnsi(Str)) = 0 then
    WriteLn('AnsiCompareText OK');
  if CompareText(Utf8ToAnsi(Key), Utf8ToAnsi(Str)) = 0 then
    WriteLn('CompareText OK');
  if widestringmanager.CompareTextWideStringProc(UTF8Decode(Key), UTF8Decode(Str)) = 0 then
    WriteLn('WideStringManager OK');
end.


_______________________________________________
fpc-pascal maillist  -  [hidden email]
http://lists.freepascal.org/mailman/listinfo/fpc-pascal
Reply | Threaded
Open this post in threaded view
|

Re: Case insensitive comparison of strings with non-ascii characters

José Mejuto
Hello FPC-Pascal,

Tuesday, July 21, 2009, 6:45:03 AM, you wrote:

LAPC> I'm trying to fix bug
LAPC> http://bugs.freepascal.org/view.php?id=14135 but
LAPC> could not get a way to do case insensitive comparison of UTF8 strings
LAPC> with non ascii characters (in the test even ansi strings failed).

Unicode case insensitive comparations are not trivial and in fact are
quite complex. None ansi version will work properly, so the conversion
should be provided by the OS or in the worst case a "general case"
unfolding could be added to FPC as a fallback mechanism. This function
requires quite large tables and a non trivial amount of CPU (based in
the amount of folded code points).

Try to do the same with WideStrings instead UTF8.

--
Best regards,
 JoshyFun

_______________________________________________
fpc-pascal maillist  -  [hidden email]
http://lists.freepascal.org/mailman/listinfo/fpc-pascal
Reply | Threaded
Open this post in threaded view
|

Re: Case insensitive comparison of strings with non-ascii characters

Luiz Americo Pereira Camara-2
JoshyFun escreveu:

> Hello FPC-Pascal,
>
> Tuesday, July 21, 2009, 6:45:03 AM, you wrote:
>
> LAPC> I'm trying to fix bug
> LAPC> http://bugs.freepascal.org/view.php?id=14135 but
> LAPC> could not get a way to do case insensitive comparison of UTF8 strings
> LAPC> with non ascii characters (in the test even ansi strings failed).
>
> Unicode case insensitive comparations are not trivial and in fact are
> quite complex. None ansi version will work properly, so the conversion
> should be provided by the OS

AnsiCompare* functions already use the functions provided by OS through
WideStringManager but it seems it's not working (or i'm doing something
wrong)


[..]
> Try to do the same with WideStrings instead UTF8.
>  

I tried widestringmanager.CompareTextWideStringProc('Ç', 'ç')
It also fails.

Using fpc 224, under windowsXP SP3

Luiz
_______________________________________________
fpc-pascal maillist  -  [hidden email]
http://lists.freepascal.org/mailman/listinfo/fpc-pascal
Reply | Threaded
Open this post in threaded view
|

Re: Case insensitive comparison of strings with non-ascii characters

Luiz Americo Pereira Camara-2
Luiz Americo Pereira Camara escreveu:

> JoshyFun escreveu:
>> Hello FPC-Pascal,
>>
>> Tuesday, July 21, 2009, 6:45:03 AM, you wrote:
>>
>> LAPC> I'm trying to fix bug
>> LAPC> http://bugs.freepascal.org/view.php?id=14135 but LAPC> could
>> not get a way to do case insensitive comparison of UTF8 strings
>> LAPC> with non ascii characters (in the test even ansi strings failed).
>> Try to do the same with WideStrings instead UTF8.
>>  
>
> I tried widestringmanager.CompareTextWideStringProc('Ç', 'ç')
> It also fails.

It worked with WideCompareText(UTF8Decode(Key), UTF8Decode(Str)) where
Key and Str are UTF8.
It did not worked in my example because the strings where not UTF8 encoded

Luiz
_______________________________________________
fpc-pascal maillist  -  [hidden email]
http://lists.freepascal.org/mailman/listinfo/fpc-pascal
Reply | Threaded
Open this post in threaded view
|

Re: Case insensitive comparison of strings with non-ascii characters

theo-6
In reply to this post by Luiz Americo Pereira Camara
@Luiz Americo

Your code
WideCompareText(UTF8Decode(Key), UTF8Decode(Str))
will work, but if speed matters, then it's rather bad.

I've tried to make a faster function for UTF-8:

uses unicodeinfo, LCLProc;

function UTF8CompareText(s1, s2: UTF8String): Integer;
var u1, u2: Ucs4Char;
  u1l, u2l: longint;
  BytePos1, Len1, SLen1: integer;
  BytePos2, Len2, SLen2: integer;
begin
  Result := 0;
  BytePos1 := 1;
  BytePos2 := 1;
  SLen1 := System.Length(s1);
  SLen2 := System.Length(s2);

  if SLen1 <> SLen2 then  //Assuming lower/uppercase representations
have the same byte length
  begin
    if SLen1 > SLen2 then Result := 1 else Result := -1;
    exit;
  end;

  repeat
    u1 := UTF8CharacterToUnicode(@s1[BytePos1], Len1);
    inc(BytePos1, Len1);
    u2 := UTF8CharacterToUnicode(@s2[BytePos2], Len2);
    inc(BytePos2, Len2);
    if u1 <> u2 then
    begin
      {$IFDEF useunicodinfo}
      u1l := unicodeinfo.utf8proc_get_property(u1)^.lowercase_mapping;
      if u1l <> -1 then u1 := u1l;
      u2l := unicodeinfo.utf8proc_get_property(u2)^.lowercase_mapping;
      if u2l <> -1 then u2 := u2l;
      {$ELSE}
      u1 := UCS4Char(WideUpperCase(WideChar(u1))[1]);
      u2 := UCS4Char(WideUpperCase(WideChar(u2))[1]);
      {$ENDIF}
      if u1 <> u2 then
      begin
        Result := u1 - u2;
        exit;
      end;
    end;
  until (BytePos1 > SLen1) or (BytePos2 > SLen2)
end;


Some numbers for my system (Linux) where WideCompareText is the function
you use now, WideUppercase is the above function and unicodeinfo is
the above function with useunicodinfo defined. See here
http://wiki.lazarus.freepascal.org/Theodp


Comparing identical Strings of 322 Chars 10000 times
WideCompareText: 785ms
unicodeinfo: 75ms
WideUpperCase: 74ms

Comparing Strings of 322 Chars 10000 times where the 3rd char differs
WideCompareText: 268ms
unicodeinfo: 3ms
WideUpperCase: 8ms

Comparing identical Text of 322 Chars 10000 times where one Text is all
uppercase
WideCompareText: 810ms
unicodeinfo: 121ms
WideUpperCase: 1076ms

Regards Theo

_______________________________________________
fpc-pascal maillist  -  [hidden email]
http://lists.freepascal.org/mailman/listinfo/fpc-pascal
Reply | Threaded
Open this post in threaded view
|

Re: Case insensitive comparison of strings with non-ascii characters

Luiz Americo Pereira Camara-2
theo escreveu:
> @Luiz Americo
>
> Your code
> WideCompareText(UTF8Decode(Key), UTF8Decode(Str))
> will work, but if speed matters, then it's rather bad.
>  

Hi, i'm aware that the performance is bad although had not tested like
you did, but at this point i'd like to stick with a solution that fpc
provides natively since it's being used in a fpc component
(TSqlite3Dataset).

In last revision i switched to the ansi version of the functions to save
the conversion of the Key at each comparison. See
http://svn.freepascal.org/cgi-bin/viewvc.cgi/trunk/packages/fcl-db/src/sqlite/customsqliteds.pas?view=log#rev13431

Anyway is clear that functions to handle UTF8 and unicode in general is
missing in fpc...
> I've tried to make a faster function for UTF-8:
>  

... maybe your function can be used as a base to future development. Add
a new function to the widestringmanager?

Luiz

> uses unicodeinfo, LCLProc;
>
> function UTF8CompareText(s1, s2: UTF8String): Integer;
> var u1, u2: Ucs4Char;
>   u1l, u2l: longint;
>   BytePos1, Len1, SLen1: integer;
>   BytePos2, Len2, SLen2: integer;
> begin
>   Result := 0;
>   BytePos1 := 1;
>   BytePos2 := 1;
>   SLen1 := System.Length(s1);
>   SLen2 := System.Length(s2);
>
>   if SLen1 <> SLen2 then  //Assuming lower/uppercase representations
> have the same byte length
>   begin
>     if SLen1 > SLen2 then Result := 1 else Result := -1;
>     exit;
>   end;
>
>   repeat
>     u1 := UTF8CharacterToUnicode(@s1[BytePos1], Len1);
>     inc(BytePos1, Len1);
>     u2 := UTF8CharacterToUnicode(@s2[BytePos2], Len2);
>     inc(BytePos2, Len2);
>     if u1 <> u2 then
>     begin
>       {$IFDEF useunicodinfo}
>       u1l := unicodeinfo.utf8proc_get_property(u1)^.lowercase_mapping;
>       if u1l <> -1 then u1 := u1l;
>       u2l := unicodeinfo.utf8proc_get_property(u2)^.lowercase_mapping;
>       if u2l <> -1 then u2 := u2l;
>       {$ELSE}
>       u1 := UCS4Char(WideUpperCase(WideChar(u1))[1]);
>       u2 := UCS4Char(WideUpperCase(WideChar(u2))[1]);
>       {$ENDIF}
>       if u1 <> u2 then
>       begin
>         Result := u1 - u2;
>         exit;
>       end;
>     end;
>   until (BytePos1 > SLen1) or (BytePos2 > SLen2)
> end;
>
>
> Some numbers for my system (Linux) where WideCompareText is the function
> you use now, WideUppercase is the above function and unicodeinfo is
> the above function with useunicodinfo defined. See here
> http://wiki.lazarus.freepascal.org/Theodp
>
>
> Comparing identical Strings of 322 Chars 10000 times
> WideCompareText: 785ms
> unicodeinfo: 75ms
> WideUpperCase: 74ms
>
> Comparing Strings of 322 Chars 10000 times where the 3rd char differs
> WideCompareText: 268ms
> unicodeinfo: 3ms
> WideUpperCase: 8ms
>
> Comparing identical Text of 322 Chars 10000 times where one Text is all
> uppercase
> WideCompareText: 810ms
> unicodeinfo: 121ms
> WideUpperCase: 1076ms
>
> Regards Theo
>
> _______________________________________________
> fpc-pascal maillist  -  [hidden email]
> http://lists.freepascal.org/mailman/listinfo/fpc-pascal
>
>  

_______________________________________________
fpc-pascal maillist  -  [hidden email]
http://lists.freepascal.org/mailman/listinfo/fpc-pascal
Reply | Threaded
Open this post in threaded view
|

Re: Case insensitive comparison of strings with non-ascii characters

Jonas Maebe-2
In reply to this post by theo-6

On 25 Jul 2009, at 17:46, theo wrote:

>  if SLen1 <> SLen2 then  //Assuming lower/uppercase representations
> have the same byte length

That is a wrong assumption. E.g., the lowercase version of I  
(uppercase i, a single byte) in Turkish is ı (an "i" without a dot,  
definitely not a single-byte character).


Jonas_______________________________________________
fpc-pascal maillist  -  [hidden email]
http://lists.freepascal.org/mailman/listinfo/fpc-pascal
Reply | Threaded
Open this post in threaded view
|

Re: Case insensitive comparison of strings with non-ascii characters

theo-6

>>  if SLen1 <> SLen2 then  //Assuming lower/uppercase representations
>> have the same byte length
>
> That is a wrong assumption. E.g., the lowercase version of I
> (uppercase i, a single byte) in Turkish is ı (an "i" without a dot,
> definitely not a single-byte character).

OK thanks. That's why I added the comment, because I was not sure  ;-)
So then one should compare UTF8Lengths or probably forget that shortcut,
because calculating UTF8Lengths is not cheap.

Do turkish systems behave differently for WideLowerCase('I')?
Will they return $0131 instead of $0069 ?

Regards Theo

_______________________________________________
fpc-pascal maillist  -  [hidden email]
http://lists.freepascal.org/mailman/listinfo/fpc-pascal
Reply | Threaded
Open this post in threaded view
|

Re: Case insensitive comparison of strings with non-ascii characters

Jonas Maebe-2

On 25 Jul 2009, at 19:03, theo wrote:

> Do turkish systems behave differently for WideLowerCase('I')?
> Will they return $0131 instead of $0069 ?

They should, since the uppercase version of i is İ there (i.e., a  
capital I with a dot on top). See e.g. http://www.i18nguy.com/unicode/turkish-i18n.html


Jonas_______________________________________________
fpc-pascal maillist  -  [hidden email]
http://lists.freepascal.org/mailman/listinfo/fpc-pascal
Reply | Threaded
Open this post in threaded view
|

Re: Case insensitive comparison of strings with non-ascii characters

theo-5

>
> They should, since the uppercase version of i is İ there (i.e., a
> capital I with a dot on top). See e.g.
> http://www.i18nguy.com/unicode/turkish-i18n.html
>
>

Oh, Goodness. ;-)

Thanks for the information.

Regards
Theo

_______________________________________________
fpc-pascal maillist  -  [hidden email]
http://lists.freepascal.org/mailman/listinfo/fpc-pascal
Reply | Threaded
Open this post in threaded view
|

Re[2]: Case insensitive comparison of strings with non-ascii characters

José Mejuto
In reply to this post by theo-6
Hello FPC-Pascal,

Saturday, July 25, 2009, 5:46:39 PM, you wrote:

t> @Luiz Americo

t> Your code
t> WideCompareText(UTF8Decode(Key), UTF8Decode(Str))
t> will work, but if speed matters, then it's rather bad.

That's not right, the assumption that:

lowercasemapping(a)=lowercasemapping(b)

is the same as:

IsSameText(a,b)

is wrong at unicode levels.

--
Best regards,
 JoshyFun

_______________________________________________
fpc-pascal maillist  -  [hidden email]
http://lists.freepascal.org/mailman/listinfo/fpc-pascal
Reply | Threaded
Open this post in threaded view
|

Re[2]: Case insensitive comparison of strings with non-ascii characters

José Mejuto
In reply to this post by Luiz Americo Pereira Camara-2
Hello FPC-Pascal,

Thursday, July 23, 2009, 2:02:38 PM, you wrote:

LAPC> Hi, i'm aware that the performance is bad although had not tested like
LAPC> you did, but at this point i'd like to stick with a solution that fpc
LAPC> provides natively since it's being used in a fpc component
LAPC> (TSqlite3Dataset).

Write unicode functions in UTF8 is almost non-sense, most unicode
operations are not like we are used in the ANSI world, in unicode also
there are a language context as in example in spanish 'á' renders to
uppercase 'Á' but in other languages they are different letters.

There are some functions named "general case" which perform a
reasonable job for most used languages and only introduce errors in
non widespread ones.

I have some implementations for the general case, not heavily tested,
like sametext, upper, lower and a bit more.

The code is not optimized but if somebody wants to use them please ask
:)

The case of the SameText is specially CPU consumer as each string must
be transformed several times before the comparation is some complex
characters are present.

--
Best regards,
 JoshyFun

_______________________________________________
fpc-pascal maillist  -  [hidden email]
http://lists.freepascal.org/mailman/listinfo/fpc-pascal
Reply | Threaded
Open this post in threaded view
|

Re[2]: Case insensitive comparison of strings with non-ascii characters

theo-6
In reply to this post by José Mejuto

> lowercasemapping(a)=lowercasemapping(b)
>
> is the same as:
>
> IsSameText(a,b)
>
> is wrong at unicode levels.
>
>  
@JoshyFun

It depends on what you excpect from such a function and with which sort
of input data
you have to deal.
I wouldn't say it is wrong. It is not really accurate for all possible
language and unicode details but it's fast.

In your strict sense, AnsiCompareText didn't work either.

Even Swiss German de_CH (my language) differs from de_DE.
For example if a Swiss German user is looking for the word "schließlich"
(finally) in a german text he will type
"schliesslich" in the search box because the letter "ß" does not exist
on Swiss German keybords.

Does AnsiCompareText report these strings as equal? No. Same with 'ö' ->
'oe' or 'à'->'A'


> Write unicode functions in UTF8 is almost non-sense, most unicode
operations are not like we are used in the ANSI world

The latter is certainly true, but I don't understand what it has to do
with UTF-8 or UTF-16.

> The code is not optimized but if somebody wants to use them please ask

Yes please!

Regards Theo



_______________________________________________
fpc-pascal maillist  -  [hidden email]
http://lists.freepascal.org/mailman/listinfo/fpc-pascal
Reply | Threaded
Open this post in threaded view
|

Re[3]: Case insensitive comparison of strings with non-ascii characters

José Mejuto
Hello FPC-Pascal,

Sunday, July 26, 2009, 9:43:06 AM, you wrote:

t> In your strict sense, AnsiCompareText didn't work either.

Yes, Ansi does not work fine also for some usual languages. But we are
used to "simulate" a comparetext using lowercase(a)=lowercase(b) where
the same character could have different foldings to be represented.

t> Does AnsiCompareText report these strings as equal? No. Same with 'ö' ->
'oe' or 'à'->>'A'

Yeah, you are completly right.

>> Write unicode functions in UTF8 is almost non-sense, most unicode
t> operations are not like we are used in the ANSI world
t> The latter is certainly true, but I don't understand what it has to do
t> with UTF-8 or UTF-16.

Because unicode operations many times needs scan forward and back,
rescan, several pass, etc, so processing it in native UTF-8 is a waste
of CPU instead a gain, except some trivial operations. Usually is
faster to pass the string to UTF-16 or UTF-32 and then perform all the
operations that encode and decode to UTF-8 constantly.

>> The code is not optimized but if somebody wants to use them please ask
t> Yes please!

I'll try to make it compile :) The code is mostly experimental, so
some functions are implemented to simply work and return a value not
to get the value fast.

I'll post a link in the list as soon as I can confirm that it at least
compile.

--
Best regards,
 JoshyFun

_______________________________________________
fpc-pascal maillist  -  [hidden email]
http://lists.freepascal.org/mailman/listinfo/fpc-pascal
Reply | Threaded
Open this post in threaded view
|

Re[3]: Case insensitive comparison of strings with non-ascii characters

José Mejuto
In reply to this post by theo-6
Hello FPC-Pascal,

Sunday, July 26, 2009, 9:43:06 AM, you wrote:

>> The code is not optimized but if somebody wants to use them please ask
t> Yes please!

I had uploaded the sources to zshare server with the link:

http://www.zshare.net/download/63185150b6099e5d/

The code is not an example of how things must be done, and there are
some data files mixed with source code files (not well organized).

It has been developed with Lazarus so maybe some code could need
lazarus LCL or related functions.

--
Best regards,
 JoshyFun

_______________________________________________
fpc-pascal maillist  -  [hidden email]
http://lists.freepascal.org/mailman/listinfo/fpc-pascal
Reply | Threaded
Open this post in threaded view
|

Re: Case insensitive comparison of strings with non-ascii characters

theo-6

> I had uploaded the sources to zshare server with the link:
>
>  


Thanks, I'll have a look at it.

_______________________________________________
fpc-pascal maillist  -  [hidden email]
http://lists.freepascal.org/mailman/listinfo/fpc-pascal
Reply | Threaded
Open this post in threaded view
|

Re: DBus interface needs an update

theo-6
In reply to this post by Jonas Maebe-2
If somebody is still interested in HAL (I know that DeviceKit ist coming):
I've made a header translation and a demo (translation of lshal.c)
yesterday.
It seems to work for me. http://www.theo.ch/lazarus/dbus/hal.tar.gz


_______________________________________________
fpc-pascal maillist  -  [hidden email]
http://lists.freepascal.org/mailman/listinfo/fpc-pascal
Reply | Threaded
Open this post in threaded view
|

Re: DBus interface needs an update

Florian Klämpfl
theo schrieb:
> If somebody is still interested in HAL (I know that DeviceKit ist coming):
> I've made a header translation and a demo (translation of lshal.c)
> yesterday.
> It seems to work for me. http://www.theo.ch/lazarus/dbus/hal.tar.gz

Please create and attach it to an issue report so we don't forget it.
_______________________________________________
fpc-pascal maillist  -  [hidden email]
http://lists.freepascal.org/mailman/listinfo/fpc-pascal
Reply | Threaded
Open this post in threaded view
|

Re: DBus interface needs an update

theo-6
Florian Klaempfl schrieb:
> Please create and attach it to an issue report so we don't forget it.
>  

You mean using the bugtracker?

If you like, you could also add the files to packages/dbus

What do you prefer?

Regards
Theo

_______________________________________________
fpc-pascal maillist  -  [hidden email]
http://lists.freepascal.org/mailman/listinfo/fpc-pascal
Reply | Threaded
Open this post in threaded view
|

Re: DBus interface needs an update

Vincent Snijders-2
theo schreef:
> Florian Klaempfl schrieb:
>> Please create and attach it to an issue report so we don't forget it.
>>  
>
> You mean using the bugtracker?
>
> If you like, you could also add the files to packages/dbus
>

But if they are not added to the bug tracker, chances are that they will be
forgotten, before somebody has added them to packages/dbus.

> What do you prefer?

Vincent
_______________________________________________
fpc-pascal maillist  -  [hidden email]
http://lists.freepascal.org/mailman/listinfo/fpc-pascal
12