Split stream into words

classic Classic list List threaded Threaded
11 messages Options
Reply | Threaded
Open this post in threaded view
|

Split stream into words

Michael Van Canneyt

Hi,

What's the easiest way to split a stream into words ?
Words are just that: words, but - here is the caveat - they must support unicode.
So Michael and Michaël are both words.

Tried regexpr unit (the obvious choice), but that does not seem to do the trick:

{$mode objfpc}
{$H+}
uses cwstring, sysutils, classes, regexpr;

Var
   Split : TStringList;
   S : String;
   R : TRegexpr;

begin
   Split:=TStringList.Create;
   Split.LoadFromFile(ParamStr(1));
   S:=Split.Text;
   Split.Clear;
   r := TRegExpr.Create;
   try
     r.Expression :='[\w]+';
     r.Split (S, Split);
     for S in Split do
       Writeln('Found: ',S);
   finally
     r.Free;
   end;
end.

Prints simply nonsense...

Michael.
_______________________________________________
fpc-pascal maillist  -  [hidden email]
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal
Reply | Threaded
Open this post in threaded view
|

Re: Split stream into words

Michael Van Canneyt


On Tue, 3 Jul 2018, Michael Van Canneyt wrote:

>
> Hi,
>
> What's the easiest way to split a stream into words ?
> Words are just that: words, but - here is the caveat - they must support
> unicode.
> So Michael and Michaël are both words.
>
> Tried regexpr unit (the obvious choice), but that does not seem to do the
> trick:
Correction, regexp can handle it if you compile for unicode, and use the
correct regexp...

Michael.
_______________________________________________
fpc-pascal maillist  -  [hidden email]
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal
Reply | Threaded
Open this post in threaded view
|

Re: Split stream into words

Marco van de Voort
In reply to this post by Michael Van Canneyt
In our previous episode, Michael Van Canneyt said:
>
> What's the easiest way to split a stream into words ?

Doesn't strutils have some word extraction and count functions?

_______________________________________________
fpc-pascal maillist  -  [hidden email]
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal
Reply | Threaded
Open this post in threaded view
|

Re: Split stream into words

Michael Van Canneyt


On Tue, 3 Jul 2018, Marco van de Voort wrote:

> In our previous episode, Michael Van Canneyt said:
>>
>> What's the easiest way to split a stream into words ?
>
> Doesn't strutils have some word extraction and count functions?

It does: WordCount,ExtractWord, but they are very inefficent.

Michael.
_______________________________________________
fpc-pascal maillist  -  [hidden email]
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal
Reply | Threaded
Open this post in threaded view
|

Re: Split stream into words

Marco van de Voort
In our previous episode, Michael Van Canneyt said:
>
> > In our previous episode, Michael Van Canneyt said:
> >>
> >> What's the easiest way to split a stream into words ?
> >
> > Doesn't strutils have some word extraction and count functions?
>
> It does: WordCount,ExtractWord, but they are very inefficent.


function splitstring(const s:string;c:char):TStringList;

var i,i2,j : integer;
    x : string;
begin
  result:=TStringlist.create;
  i:=0;
  repeat
    j:=PosEx(c,s,i+1);
    i2:=j;
    if i2=0 then i2:=length(s)+1;
    x:=trim(copy(s,i+1,i2-i-1));
    result.add(x);
    i:=j;
  until j=0;
end;

Afaik I also must have a variant with posset somewhere. In another variant
I use a class around a array of string, which keeps a count of valid
entries. This avoids setlengths on repeated use. All fairly trivial.
_______________________________________________
fpc-pascal maillist  -  [hidden email]
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal
Reply | Threaded
Open this post in threaded view
|

Re: Split stream into words

Michael Van Canneyt


On Tue, 3 Jul 2018, Marco van de Voort wrote:

> In our previous episode, Michael Van Canneyt said:
>>
>> > In our previous episode, Michael Van Canneyt said:
>> >>
>> >> What's the easiest way to split a stream into words ?
>> >
>> > Doesn't strutils have some word extraction and count functions?
>>
>> It does: WordCount,ExtractWord, but they are very inefficent.
>
>
> function splitstring(const s:string;c:char):TStringList;
>
> var i,i2,j : integer;
>    x : string;
> begin
>  result:=TStringlist.create;
>  i:=0;
>  repeat
>    j:=PosEx(c,s,i+1);
>    i2:=j;
>    if i2=0 then i2:=length(s)+1;
>    x:=trim(copy(s,i+1,i2-i-1));
>    result.add(x);
>    i:=j;
>  until j=0;
> end;
>
> Afaik I also must have a variant with posset somewhere. In another variant
> I use a class around a array of string, which keeps a count of valid
> entries. This avoids setlengths on repeated use. All fairly trivial.

Trivial indeed, till you need more fine-grained control.
e.g. C needs to be an array of chars that mark word boundaries etc.

But I managed to solve the problem with regexps...

Michael.
_______________________________________________
fpc-pascal maillist  -  [hidden email]
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal
Reply | Threaded
Open this post in threaded view
|

Re: Split stream into words

Marcos Douglas B. Santos
On Tue, Jul 3, 2018 at 7:50 AM, Michael Van Canneyt
<[hidden email]> wrote:
>
> On Tue, 3 Jul 2018, Marco van de Voort wrote:
> Trivial indeed, till you need more fine-grained control.
> e.g. C needs to be an array of chars that mark word boundaries etc.
>
> But I managed to solve the problem with regexps...

How?
_______________________________________________
fpc-pascal maillist  -  [hidden email]
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal
Reply | Threaded
Open this post in threaded view
|

Re: Split stream into words

Michael Van Canneyt


On Tue, 3 Jul 2018, Marcos Douglas B. Santos wrote:

> On Tue, Jul 3, 2018 at 7:50 AM, Michael Van Canneyt
> <[hidden email]> wrote:
>>
>> On Tue, 3 Jul 2018, Marco van de Voort wrote:
>> Trivial indeed, till you need more fine-grained control.
>> e.g. C needs to be an array of chars that mark word boundaries etc.
>>
>> But I managed to solve the problem with regexps...
>
> How?
I misunderstood how Split works. The regex is the 'word separator' in that
function.

The following correctly gives me all words. unit uregexp is the regexp unit
compiled for unicode.

Michael.

--------------

{$mode objfpc}
{$H+}
uses cwstring, sysutils, classes, uregexpr;

Var
   Split : TStringList;
   S : String;
   R : TRegexpr;
   E : TEncoding;

begin
   Split:=TStringList.Create;
   E:=TEncoding.UTF8;
   Split.LoadFromFile(ParamStr(1),E);
   S:=Split.Text;
   r := TRegExpr.Create;
   try
     r.spaceChars:=r.spaceChars+'|&@#"''(§^!{})-[]*%`=+/.;:,?';
     r.LineSeparators:=#10;
     r.Expression :='(\b[^\d\s]+\b)';
     if R.Exec(S) then
        REPEAT
        Writeln('Found: ',System.Copy (S, R.MatchPos [0], R.MatchLen[0]));
        UNTIL not R.ExecNext;
   finally
     r.Free;
   end;
end.
_______________________________________________
fpc-pascal maillist  -  [hidden email]
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal
Reply | Threaded
Open this post in threaded view
|

Re: Split stream into words

Marcos Douglas B. Santos
On Tue, Jul 3, 2018 at 10:26 AM, Michael Van Canneyt
<[hidden email]> wrote:

>
>
> On Tue, 3 Jul 2018, Marcos Douglas B. Santos wrote:
>
>> On Tue, Jul 3, 2018 at 7:50 AM, Michael Van Canneyt
>> <[hidden email]> wrote:
>>>
>>>
>>> On Tue, 3 Jul 2018, Marco van de Voort wrote:
>>> Trivial indeed, till you need more fine-grained control.
>>> e.g. C needs to be an array of chars that mark word boundaries etc.
>>>
>>> But I managed to solve the problem with regexps...
>>
>>
>> How?
>
>
> I misunderstood how Split works. The regex is the 'word separator' in that
> function.
>
> The following correctly gives me all words. unit uregexp is the regexp unit
> compiled for unicode.

Thanks.
But, is uregexp part of FPC?

Marcos Douglas
_______________________________________________
fpc-pascal maillist  -  [hidden email]
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal
Reply | Threaded
Open this post in threaded view
|

Re: Split stream into words

Michael Van Canneyt


On Tue, 3 Jul 2018, Marcos Douglas B. Santos wrote:

> On Tue, Jul 3, 2018 at 10:26 AM, Michael Van Canneyt
> <[hidden email]> wrote:
>>
>>
>> On Tue, 3 Jul 2018, Marcos Douglas B. Santos wrote:
>>
>>> On Tue, Jul 3, 2018 at 7:50 AM, Michael Van Canneyt
>>> <[hidden email]> wrote:
>>>>
>>>>
>>>> On Tue, 3 Jul 2018, Marco van de Voort wrote:
>>>> Trivial indeed, till you need more fine-grained control.
>>>> e.g. C needs to be an array of chars that mark word boundaries etc.
>>>>
>>>> But I managed to solve the problem with regexps...
>>>
>>>
>>> How?
>>
>>
>> I misunderstood how Split works. The regex is the 'word separator' in that
>> function.
>>
>> The following correctly gives me all words. unit uregexp is the regexp unit
>> compiled for unicode.
>
> Thanks.
> But, is uregexp part of FPC?

Not yet, but I intend to make it so.

Michael.
_______________________________________________
fpc-pascal maillist  -  [hidden email]
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal
Reply | Threaded
Open this post in threaded view
|

Re: Split stream into words

Marcos Douglas B. Santos
On Tue, Jul 3, 2018 at 11:55 AM, Michael Van Canneyt
<[hidden email]> wrote:
>
>> Thanks.
>> But, is uregexp part of FPC?
>
>
> Not yet, but I intend to make it so.

All right! Thanks.

Marcos Douglas

PS. Please, don't forget the XPath Unicode implementation too.
We have talked about it months ago... but I can imagine that your
to-do list is huge.
_______________________________________________
fpc-pascal maillist  -  [hidden email]
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal