fast text processing

classic Classic list List threaded Threaded
21 messages Options
12
Reply | Threaded
Open this post in threaded view
|

fast text processing

Jeff Pohlmeyer
> > this kludge is about 25% faster than your perl script
> > on my machine....

> Nope. It's still more or less twice slower. :-D


I guess it depends on the hardware:

% time koleksi.pl   # perl
Word count: 126944
Unique word count: 11793

real    0m1.019s
user    0m0.992s
sys     0m0.028s


% time koleksi   # fpc
Word count:126944
Unique word count:11793

real    0m0.817s
user    0m0.784s
sys     0m0.020s


AMD-K6-700 / SuSE-10.3 / Linux-2.6.22  / perl-5.8.8 / fpc-2.2.0


 - Jeff
_______________________________________________
fpc-pascal maillist  -  [hidden email]
http://lists.freepascal.org/mailman/listinfo/fpc-pascal
Reply | Threaded
Open this post in threaded view
|

Re: fast text processing

Vincent Snijders
Jeff Pohlmeyer schreef:

>>> this kludge is about 25% faster than your perl script
>>> on my machine....
>
>> Nope. It's still more or less twice slower. :-D
>
>
> I guess it depends on the hardware:
>
> % time koleksi.pl   # perl
> Word count: 126944
> Unique word count: 11793
>
> real    0m1.019s
> user    0m0.992s
> sys     0m0.028s
>
>
> % time koleksi   # fpc
> Word count:126944
> Unique word count:11793
>
> real    0m0.817s
> user    0m0.784s
> sys     0m0.020s
>
>
> AMD-K6-700 / SuSE-10.3 / Linux-2.6.22  / perl-5.8.8 / fpc-2.2.0
>
>

Thanks Jeff, for writing that parser code, I am not good in doing that.

I made it three times as fast on my computer (windows 2000, fpc 2.3.1, P4 1.5 Ghz)
using a hashlist for the unique word count. Using a larger textbuf gave an
additional 10% speed up:

program project1;
{$MODE OBJFPC} {$H+}

uses classes, strings, contnrs;

const
   bufsize = $1FFF;

var
   f: text;
   s:ansistring;
   wc:longint=0;
   wl:TStringList;
   uhl: TFPStringHashTable;
   i,n:LongInt;
   textbuf: array[0..bufsize-1] of byte;

begin
   assign(f, 'Koleksi.dat');
   reset(f);
   SetTextBuf(f, textbuf, sizeof(textbuf));
   wl:=TStringList.Create();
   uhl:=TFPStringHashTable.Create;
   while not eof(f) do begin
     readln(f,s);
     n:=length(s);
     if (n>0) then begin
     StrLower(@s[1]);
       if (s[1]='<') then begin
         if StrLComp(@s[1], '<title>',7) = 0 then begin
           delete(s,1,7);
         end else continue;
       end;
       for i:=1 to n do if not (s[i] in ['a'..'z','0'..'9']) then begin
         if ( s[i] <> '<' ) then begin
           s[i]:=#10
         end else begin
           s[i]:=#0;
           SetLength(s,StrLen(@s[1]));
           break;
         end;
       end;
       wl.Text:=s;
       for i:=0 to wl.Count-1 do begin
         s:=wl[i];
         for n:=1 to length(s) do if (s[n] in ['0'..'9']) then begin
           s:='';
           break;
         end;
         if (s<>'') then begin
           inc(wc);
           if uhl.Find(s) = nil then
             uhl.Add(s,'');
         end;
       end;
     end;
   end;
   close(f);
   WriteLn('Word count:',wc, #10'Unique word count:', uhl.Count);
end.
_______________________________________________
fpc-pascal maillist  -  [hidden email]
http://lists.freepascal.org/mailman/listinfo/fpc-pascal
Reply | Threaded
Open this post in threaded view
|

Re: fast text processing

Bee-6
In reply to this post by Jeff Pohlmeyer
> AMD-K6-700 / SuSE-10.3 / Linux-2.6.22  / perl-5.8.8 / fpc-2.2.0

Probably because the different fpc version, no? I'm using fpc 2.0.4.
However, this is a good news. :)

-Bee-

has Bee.ography at:
http://beeography.wordpress.com

_______________________________________________
fpc-pascal maillist  -  [hidden email]
http://lists.freepascal.org/mailman/listinfo/fpc-pascal
Reply | Threaded
Open this post in threaded view
|

Re: fast text processing

Bee-6
In reply to this post by Vincent Snijders
> I made it three times as fast on my computer (windows 2000, fpc 2.3.1,
> P4 1.5 Ghz) using a hashlist for the unique word count. Using a larger
> textbuf gave an additional 10% speed up:

Arrrggghhhh, I hate myself for not able to upgrade to fpc v.2.2.0! I
can't find TFPStringHashTable on fpc v.2.0.4! :((

-Bee-

has Bee.ography at:
http://beeography.wordpress.com
_______________________________________________
fpc-pascal maillist  -  [hidden email]
http://lists.freepascal.org/mailman/listinfo/fpc-pascal
Reply | Threaded
Open this post in threaded view
|

Re: fast text processing

Graeme Geldenhuys-2
In reply to this post by Vincent Snijders
Well done Vincent!!  :-)  I can confirm your results...


graemeg@graemeg:word_parser$ time ./project1
Word count:126944
Unique word count:11793

real    0m0.185s
user    0m0.140s
sys     0m0.000s

graemeg@graemeg:word_parser$ time perl project1.perl
Word count: 126944
Unique word count: 11793

real    0m0.281s
user    0m0.244s
sys     0m0.016s


Hardware:  Intel P4 CPU 2.40GHz with 1Gig RAM
FPC Compiler:   v2.2.0


Regards,
  - Graeme -


_______________________________________________
fpGUI - a cross-platform Free Pascal GUI toolkit
http://opensoft.homeip.net/fpgui/
_______________________________________________
fpc-pascal maillist  -  [hidden email]
http://lists.freepascal.org/mailman/listinfo/fpc-pascal
Reply | Threaded
Open this post in threaded view
|

Re: fast text processing

Bee-6
> graemeg@graemeg:word_parser$ time ./project1
> Word count:126944
> Unique word count:11793
>
> real    0m0.185s
> user    0m0.140s
> sys     0m0.000s
>
> graemeg@graemeg:word_parser$ time perl project1.perl
> Word count: 126944
> Unique word count: 11793
>
> real    0m0.281s
> user    0m0.244s
> sys     0m0.016s

Vincent said it was 3 times faster. I expected the result would be about
0.10s. Or am I wrong?

-Bee-

has Bee.ography at:
http://beeography.wordpress.com
_______________________________________________
fpc-pascal maillist  -  [hidden email]
http://lists.freepascal.org/mailman/listinfo/fpc-pascal
Reply | Threaded
Open this post in threaded view
|

Re: fast text processing

Graeme Geldenhuys-2
On 31/10/2007, Bee <[hidden email]> wrote:
>
> Vincent said it was 3 times faster. I expected the result would be about
> 0.10s. Or am I wrong?

Maybe that's machine dependent.... I'll try the one without the hash
table to see the difference.  Otherwise, lets just compare the sys
time and say it's 16x faster. ;-)


Regards,
  - Graeme -


_______________________________________________
fpGUI - a cross-platform Free Pascal GUI toolkit
http://opensoft.homeip.net/fpgui/
_______________________________________________
fpc-pascal maillist  -  [hidden email]
http://lists.freepascal.org/mailman/listinfo/fpc-pascal
Reply | Threaded
Open this post in threaded view
|

Re: fast text processing

Vincent Snijders
In reply to this post by Bee-6
Bee schreef:

>> graemeg@graemeg:word_parser$ time ./project1
>> Word count:126944
>> Unique word count:11793
>>
>> real    0m0.185s
>> user    0m0.140s
>> sys     0m0.000s
>>
>> graemeg@graemeg:word_parser$ time perl project1.perl
>> Word count: 126944
>> Unique word count: 11793
>>
>> real    0m0.281s
>> user    0m0.244s
>> sys     0m0.016s
>
> Vincent said it was 3 times faster. I expected the result would be about
> 0.10s. Or am I wrong?

Maybe I have a relatively slow computer, so I get more speedup. Keep in mind, that
disk time is constant.

So for example, total time is disk time + processing time
For me:
StringList: 10 + 50 = 60
Hashtable: 10 + 10 = 20
Speedup: 3 times

For Graeme with his faster computer (2x processor time):
StringList: 10 + 25 = 35
Hashtable: 10 + 5 = 15
Speedup: 2.3

Please choose appropiate number to get to the result :-)

Vincent
_______________________________________________
fpc-pascal maillist  -  [hidden email]
http://lists.freepascal.org/mailman/listinfo/fpc-pascal
Reply | Threaded
Open this post in threaded view
|

Re: fast text processing

Vincent Snijders
In reply to this post by Bee-6
Bee schreef:

>> graemeg@graemeg:word_parser$ time ./project1
>> Word count:126944
>> Unique word count:11793
>>
>> real    0m0.185s
>> user    0m0.140s
>> sys     0m0.000s
>>
>> graemeg@graemeg:word_parser$ time perl project1.perl
>> Word count: 126944
>> Unique word count: 11793
>>
>> real    0m0.281s
>> user    0m0.244s
>> sys     0m0.016s
>
> Vincent said it was 3 times faster. I expected the result would be about
> 0.10s. Or am I wrong?
>

It was three times faster than the string list version of Jeff. I don't have a perl
interpreter :-).

Vincent
_______________________________________________
fpc-pascal maillist  -  [hidden email]
http://lists.freepascal.org/mailman/listinfo/fpc-pascal
Reply | Threaded
Open this post in threaded view
|

Re: fast text processing

Graeme Geldenhuys-2
In reply to this post by Vincent Snijders
On 31/10/2007, Vincent Snijders <[hidden email]> wrote:
>
> Maybe I have a relatively slow computer, so I get more speedup. Keep in mind, that
> disk time is constant.
>

I'm also not sure if FPC compiler parameters where used. I did.
I compiled as:    fpc project1.pas


Anyway, here is the Hash Table vs No Hash Table results.  Quite a
difference in speed when using the hash table.


graemeg@graemeg:word_parser$ time ./project1_nohashtable
Word count:126944
Unique word count:11793

real    0m0.291s
user    0m0.276s
sys     0m0.004s


graemeg@graemeg:word_parser$ time ./project1
Word count:126944
Unique word count:11793

real    0m0.196s
user    0m0.132s
sys     0m0.008s


graemeg@graemeg:word_parser$ time perl ./project1.perl
Word count: 126944
Unique word count: 11793

real    0m0.292s
user    0m0.268s
sys     0m0.000s
graemeg@graemeg:word_parser$




Regards,
  - Graeme -


_______________________________________________
fpGUI - a cross-platform Free Pascal GUI toolkit
http://opensoft.homeip.net/fpgui/
_______________________________________________
fpc-pascal maillist  -  [hidden email]
http://lists.freepascal.org/mailman/listinfo/fpc-pascal
Reply | Threaded
Open this post in threaded view
|

Re: fast text processing

Marco van de Voort
> On 31/10/2007, Vincent Snijders <[hidden email]> wrote:
> >
> > Maybe I have a relatively slow computer, so I get more speedup. Keep in mind, that
> > disk time is constant.
> >
>
> I'm also not sure if FPC compiler parameters where used. I did.
> I compiled as:    fpc project1.pas

It could be wise to add -O3 for anything considered a benchmark :-)
_______________________________________________
fpc-pascal maillist  -  [hidden email]
http://lists.freepascal.org/mailman/listinfo/fpc-pascal
Reply | Threaded
Open this post in threaded view
|

Re: fast text processing

Graeme Geldenhuys-2
On 31/10/2007, Marco van de Voort <[hidden email]> wrote:
>
> It could be wise to add -O3 for anything considered a benchmark :-)


It squeezed another 0.015s out of the time making it even faster. :-)


Regards,
  - Graeme -


_______________________________________________
fpGUI - a cross-platform Free Pascal GUI toolkit
http://opensoft.homeip.net/fpgui/
_______________________________________________
fpc-pascal maillist  -  [hidden email]
http://lists.freepascal.org/mailman/listinfo/fpc-pascal
Reply | Threaded
Open this post in threaded view
|

Re: fast text processing

Florian Klämpfl
In reply to this post by Vincent Snijders
Vincent Snijders schrieb:

> Jeff Pohlmeyer schreef:
>>>> this kludge is about 25% faster than your perl script
>>>> on my machine....
>>
>>> Nope. It's still more or less twice slower. :-D
>>
>>
>> I guess it depends on the hardware:
>>
>> % time koleksi.pl   # perl
>> Word count: 126944
>> Unique word count: 11793
>>
>> real    0m1.019s
>> user    0m0.992s
>> sys     0m0.028s
>>
>>
>> % time koleksi   # fpc
>> Word count:126944
>> Unique word count:11793
>>
>> real    0m0.817s
>> user    0m0.784s
>> sys     0m0.020s
>>
>>
>> AMD-K6-700 / SuSE-10.3 / Linux-2.6.22  / perl-5.8.8 / fpc-2.2.0
>>
>>
>
> Thanks Jeff, for writing that parser code, I am not good in doing that.
>
> I made it three times as fast on my computer (windows 2000, fpc 2.3.1,
> P4 1.5 Ghz) using a hashlist for the unique word count. Using a larger
> textbuf gave an additional 10% speed up:
>
> program project1;
> {$MODE OBJFPC} {$H+}
>
> uses classes, strings, contnrs;
>
> const
>   bufsize = $1FFF;
>
> var
>   f: text;
>   s:ansistring;
>   wc:longint=0;
>   wl:TStringList;
>   uhl: TFPStringHashTable;
>   i,n:LongInt;
>   textbuf: array[0..bufsize-1] of byte;
>
> begin
>   assign(f, 'Koleksi.dat');
>   reset(f);
>   SetTextBuf(f, textbuf, sizeof(textbuf));
>   wl:=TStringList.Create();
>   uhl:=TFPStringHashTable.Create;
>   while not eof(f) do begin
>     readln(f,s);
>     n:=length(s);
>     if (n>0) then begin
>     StrLower(@s[1]);
>       if (s[1]='<') then begin
>         if StrLComp(@s[1], '<title>',7) = 0 then begin
>           delete(s,1,7);
>         end else continue;
>       end;
>       for i:=1 to n do if not (s[i] in ['a'..'z','0'..'9']) then begin
>         if ( s[i] <> '<' ) then begin
>           s[i]:=#10
>         end else begin
>           s[i]:=#0;
>           SetLength(s,StrLen(@s[1]));

Why not SetLength(s,i)? StrLen is _very_ expensive. I don't see a way
how another #0 can be before.
_______________________________________________
fpc-pascal maillist  -  [hidden email]
http://lists.freepascal.org/mailman/listinfo/fpc-pascal
Reply | Threaded
Open this post in threaded view
|

Re: fast text processing

Vincent Snijders
Florian Klaempfl schreef:
> Vincent Snijders schrieb:
>> Jeff Pohlmeyer schreef:
>>           s[i]:=#0;
>>           SetLength(s,StrLen(@s[1]));
>
> Why not SetLength(s,i)? StrLen is _very_ expensive. I don't see a way
> how another #0 can be before.

That is right, I am working on a version which does not do that anymore.

Vincent
_______________________________________________
fpc-pascal maillist  -  [hidden email]
http://lists.freepascal.org/mailman/listinfo/fpc-pascal
L-9
Reply | Threaded
Open this post in threaded view
|

Re: fast text processing

L-9
In reply to this post by Bee-6
> > Word count: 126944
> > Unique word count: 11793
> >
> > real    0m0.281s
> > user    0m0.244s
> > sys     0m0.016s
>

Can someone do a test for 5 minutes of parsing and see if things slow down or
speed up for one of the programs?

That takes away process load time too..

example: the time it takes to fork the process.

Not sure if perl scripts are initially faster since you don't have to fork a
process, assuming perl is already in memory waiting for the script.

_______________________________________________
fpc-pascal maillist  -  [hidden email]
http://lists.freepascal.org/mailman/listinfo/fpc-pascal
Reply | Threaded
Open this post in threaded view
|

Re: fast text processing

Vincent Snijders
In reply to this post by Florian Klämpfl
Florian Klaempfl schreef:
> Vincent Snijders schrieb:
>
> Why not SetLength(s,i)? StrLen is _very_ expensive. I don't see a way
> how another #0 can be before.

No more strlen:
http://www.hu.freepascal.org/fpcircbot/cgipastebin?msgid=1432

Vincent
_______________________________________________
fpc-pascal maillist  -  [hidden email]
http://lists.freepascal.org/mailman/listinfo/fpc-pascal
Reply | Threaded
Open this post in threaded view
|

Re: fast text processing

Bugzilla from daniel.mantione@freepascal.org


Op Wed, 31 Oct 2007, schreef Vincent Snijders:

> Florian Klaempfl schreef:
> > Vincent Snijders schrieb:
> >
> > Why not SetLength(s,i)? StrLen is _very_ expensive. I don't see a way
> > how another #0 can be before.
>
> No more strlen:
> http://www.hu.freepascal.org/fpcircbot/cgipastebin?msgid=1432

One more possible speedup: Why are you using strlower and strlcomp instead
of lowercase/pos? The latter ones are probably faster and they don't care
of code pages which are not relevant in this example.

Daniël
_______________________________________________
fpc-pascal maillist  -  [hidden email]
http://lists.freepascal.org/mailman/listinfo/fpc-pascal
Reply | Threaded
Open this post in threaded view
|

Re: fast text processing

Graeme Geldenhuys-2
In reply to this post by Vincent Snijders
On 01/11/2007, Vincent Snijders <[hidden email]> wrote:
>
> No more strlen:
> http://www.hu.freepascal.org/fpcircbot/cgipastebin?msgid=1432


Wow, that version improved quite a bit from the previous one!!

graemeg@graemeg:word_parser$ time ./project1_fast
Word count:126944
Unique word count:11793

real    0m0.107s
user    0m0.100s
sys     0m0.000s


graemeg@graemeg:word_parser$ time perl ./project1.perl
Word count: 126944
Unique word count: 11793

real    0m0.271s
user    0m0.248s
sys     0m0.008s



Regards,
  - Graeme -


_______________________________________________
fpGUI - a cross-platform Free Pascal GUI toolkit
http://opensoft.homeip.net/fpgui/
_______________________________________________
fpc-pascal maillist  -  [hidden email]
http://lists.freepascal.org/mailman/listinfo/fpc-pascal
L-9
Reply | Threaded
Open this post in threaded view
|

Re: fast text processing

L-9


> >
> > No more strlen:
> > http://www.hu.freepascal.org/fpcircbot/cgipastebin?msgid=1432
>

This doesn't work if you have spaces in front of the < tags >

  <sometag>
      <sometag>

I'm not sure if the Perl one fails too though.
I don't have perl installed and can't test it ;-)

A real parser doesn't care  about whitespace in front.
And will be a bit slower.. because of that check.

L505
_______________________________________________
fpc-pascal maillist  -  [hidden email]
http://lists.freepascal.org/mailman/listinfo/fpc-pascal
Reply | Threaded
Open this post in threaded view
|

Re: fast text processing

S. Fisher

--- L <[hidden email]> wrote:

> > > No more strlen:
> > > http://www.hu.freepascal.org/fpcircbot/cgipastebin?msgid=1432
> >
>
> This doesn't work if you have spaces in front of the < tags >
>
>   <sometag>
>       <sometag>
>
> I'm not sure if the Perl one fails too though.
> I don't have perl installed and can't test it ;-)
>
> A real parser doesn't care  about whitespace in front.
> And will be a bit slower.. because of that check.

{$MODE OBJFPC} {$H+}

uses sysutils, strings, contnrs;


const
  chars : set of char = ['a'..'z','0'..'9'];

var
  f: text;
  line : ansistring;
  p, pword : pchar;
  saved: char;
  wc : longint;
  counting, good : boolean;
  unique: TFPStringHashTable;
  textbuf: array[1..4096] of byte;
  when : tDateTime;

function do_tag( var s: ansistring; var p: pchar):boolean;
var
  pword: pchar;
begin
  pword := p;
  while p^ <> '>' do
    inc(p);
  p^ := #0;
  result := ('<title'=pword) or ('<text'=pword);
end;

 
begin
 when := time;

  assign(f, 'Koleksi.dat');
  reset(f);
  SetTextBuf(f, textbuf, sizeof(textbuf));
  wc := 0;  counting := false;
  unique := TFPStringHashTable.Create;
  while not eof(f) do
  begin
    readln(f, line );
    if '' = line then continue;
    line := lowercase( line );
    p := pchar( line );
    repeat
      // Skip junk.
      while (p^ <> #0) and (not (p^ in chars)) do
      begin
        if '<' = p^ then
          counting := do_tag( line, p );
        inc(p);
      end;
      // Build word.
      pword := p;
      good := true;
      while p^ in chars do
      begin
        if not (p^ in ['a'..'z']) then good := false;
        inc(p);
      end;
      if counting and good then
        if pword <> p then
        begin
          saved := p^;
          p^ := #0;
          inc( wc );
          if unique.Find( pword) = nil then
            unique.Add( pword,'');
          p^ := saved;
        end
    until #0 = p^;
  end;

  close(f);
  writeln( ((time-when)*secsPerDay):0:3 );
  WriteLn('Word count:',wc, #10'Unique word count:', unique.Count);
end.
{
Word count: 126944
Unique word count: 11793
}


__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around
http://mail.yahoo.com 
_______________________________________________
fpc-pascal maillist  -  [hidden email]
http://lists.freepascal.org/mailman/listinfo/fpc-pascal
12