Porter Stemming for FPC 2.0

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

Porter Stemming for FPC 2.0

Alan Mead
It's great that 2.0 is out!  Unfortunately it seems to break some
code I used for Porter Stemming because the code sometimes reads data
from a pchar at negative indexes.  Reportedly this works fine in
Delphi 5 and I don't seem to have trouble with Delphi 7 but it
generates RTEs using fpc 2.0.  (If this is a FAQ, forgive me, I've
been away from Free Pascal for a while...)

So, here is a file containing my patched code and a test program that
seems to run fine:

http://www.alanmead.org/downloads/fpc2_PorterStem.zip

The homepage for this algorithm, which may offer my fix someday, is
here:

http://www.tartarus.org/~martin/PorterStemmer/

-Alan
_______________________________________________
fpc-pascal maillist  -  [hidden email]
http://lists.freepascal.org/mailman/listinfo/fpc-pascal
Reply | Threaded
Open this post in threaded view
|

Re: Porter Stemming for FPC 2.0

Matt Emson
> It's great that 2.0 is out!  Unfortunately it seems to break some
> code I used for Porter Stemming because the code sometimes reads data
> from a pchar at negative indexes.  Reportedly this works fine in
> Delphi 5 and I don't seem to have trouble with Delphi 7 but it
> generates RTEs using fpc 2.0.  (If this is a FAQ, forgive me, I've
> been away from Free Pascal for a while...)

Reading PChars at negative indexes? Buffer underrun in other words... This
is absolutely not a good thing. If FPC is preventing buffer under and
overrruns, then it is actually right, for once, and Delphi is wrong,
wrong, wrong!

A question... how do you know the memory at the negative index is valid?
Various factors (memory management, record alignement and poor consistency
in longterm projects) can alter what you are reading drastically. Even if
you *believe* you know what it is. This is the kind of horror story I see
sometimes in Legacy code that makes me wonder how the darn thing _ever_
worked all these years. You ask the guy maintaining it and he goes all
mistical - "It just works, but no one remembers why.. we dare not change
it because last time somebody did anything to it the entive project A/V'd
every 30 seconds and died in a puff of green smoke."

M

M

_______________________________________________
fpc-pascal maillist  -  [hidden email]
http://lists.freepascal.org/mailman/listinfo/fpc-pascal
Reply | Threaded
Open this post in threaded view
|

Re: Porter Stemming for FPC 2.0

Lance Boyle
This is very interesting. I've always wondered if anyone did this on  
purpose, and I've always wondered what the big deal is with just  
adding array range checking to C. A company with tons of internal  
software development, and whose existence is made miserable by buffer  
under/over flows, could surely pull this off. For example, Microsoft  
could change the compilers that they use internally, and any  
programmers found to be depending on the persistence of memory "next  
to" an array would be taken out and shot.

I'm sure someone will respond, telling me why this would be a bad  
idea or impossible, but that's my two cents worth.

Lance


On Sep 16, 2005, at 4:55 PM, memsom wrote:

>> It's great that 2.0 is out!  Unfortunately it seems to break some
>> code I used for Porter Stemming because the code sometimes reads data
>> from a pchar at negative indexes.  Reportedly this works fine in
>> Delphi 5 and I don't seem to have trouble with Delphi 7 but it
>> generates RTEs using fpc 2.0.  (If this is a FAQ, forgive me, I've
>> been away from Free Pascal for a while...)
>>
>
> Reading PChars at negative indexes? Buffer underrun in other  
> words... This
> is absolutely not a good thing. If FPC is preventing buffer under and
> overrruns, then it is actually right, for once, and Delphi is wrong,
> wrong, wrong!
>
> A question... how do you know the memory at the negative index is  
> valid?
> Various factors (memory management, record alignement and poor  
> consistency
> in longterm projects) can alter what you are reading drastically.  
> Even if
> you *believe* you know what it is. This is the kind of horror story  
> I see
> sometimes in Legacy code that makes me wonder how the darn thing  
> _ever_
> worked all these years. You ask the guy maintaining it and he goes all
> mistical - "It just works, but no one remembers why.. we dare not  
> change
> it because last time somebody did anything to it the entive project  
> A/V'd
> every 30 seconds and died in a puff of green smoke."
>
> M
>
> M
This is very interesting. I've always wondered if anyone did this on  
purpose, and I've always wondered what the big deal is with just  
adding array range checking to C. A company with tons of internal  
software development, and whose existence is made miserable by buffer  
under/over flows, could surely pull this off. And surely it wouldn't  
break _that_ much code. For example, Microsoft could change the  
compilers that they use internally, and any programmers found to be  
depending on the persistence of memory "next to" an array would be  
taken out and shot.

I'm sure someone will respond, telling me why this would be a bad  
idea or impossible, but that's my two cents worth.

Lance

_______________________________________________
fpc-pascal maillist  -  [hidden email]
http://lists.freepascal.org/mailman/listinfo/fpc-pascal
Reply | Threaded
Open this post in threaded view
|

Re: Porter Stemming for FPC 2.0

Bugzilla from elio@mixtk.com
El Vie 16 Sep 2005 21:06, Lance Boyle escribió:
> This is very interesting. I've always wondered if anyone did this on
> purpose, and I've always wondered what the big deal is with just
> adding array range checking to C. A company with tons of internal
> software development, and whose existence is made miserable by buffer
> under/over flows, could surely pull this off. And surely it wouldn't
> break _that_ much code. For example, Microsoft could change the
> compilers that they use internally, and any programmers found to be
> depending on the persistence of memory "next to" an array would be
> taken out and shot.

Microsoft already added this feature to it's own C compiler (Range checking,
not people shoting), if i'm not mistaken the newest versionds of XP has this
enabled. I't sure thing that Vista is compiled with this.

>
> I'm sure someone will respond, telling me why this would be a bad
> idea or impossible, but that's my two cents worth.

I don't think it's impossible or bad idea, in fact major compilers support
range checking (including GCC) it's just not enabled default. It has also
been done by hardware, so this is indeed a big deal.

>
> Lance
>

Regards.
Elio

> _______________________________________________
> fpc-pascal maillist  -  [hidden email]
> http://lists.freepascal.org/mailman/listinfo/fpc-pascal

_______________________________________________
fpc-pascal maillist  -  [hidden email]
http://lists.freepascal.org/mailman/listinfo/fpc-pascal
Reply | Threaded
Open this post in threaded view
|

Re: Porter Stemming for FPC 2.0

Alan Mead
In reply to this post by Matt Emson
memsom <[hidden email]> wrote:

> Reading PChars at negative indexes? Buffer underrun in other
> words... This
> is absolutely not a good thing. If FPC is preventing buffer under
> and
> overrruns, then it is actually right, for once, and Delphi is
> wrong,
> wrong, wrong!

Yeah, it seems dumb.  Well, I'm more a Turbo Pascal than a Delphi
programmer and not a software professional.  I don't actually know
what a pchar is...  I guess it's a pointer to a strong that was added
to Delphi to talk to the Windows API?

Anway, if you look at this guy's code, I'm convinced that he falls
into the "power user" category.  He has no problem writing ASM but he
also provides a a "pure Pascal" solution (chosen at compile-time).
And he's apparently benchmarked his code and is agressively trying to
optimize it.

And it's not a buffer under-run per se.  He's checking the ends of
words against a series of word endings.  He calculates a negative
index when he checks a long ending like '-ization' against a short
word like 'word' ... he calculates that he has to start checking
character -3 of 'word' [length('word'-length('ization')] ... The
outcome of this checking is "true, the word ends in the ending" or
"false" and of course 99% of the time it's false.  I think it's
impossible that he could get a wrong result because even if the
garbage at memory p[-3] to p[-1] matches the word ending, the word
itself will not.

So, I mention all this because it is an obscure point of
incompatibility between FPC 2.0 and Delphi 5-7 (and FPC 1.x) ... In
my case, this code worked fine and then it broke... just turning off
range-checking isn't an answer for me, as I need $R+ to catch my own
errors.  Luckily, it was easy enough to wrap IF statements around
these bits.

> A question... how do you know the memory at the negative index is
> valid?

I've explained why garbage won't goose the algorithm.  As to why it
does not GPF, I suppose this pchar is always pointed into the
"middle" of the data segment and the negative indices are always
single digits.  In the zip file containing my fix, I included the
test program that drives this guy's unit and the test data.  It
compiles and runs fine in Delphi 7 (you may have to comment out the
"{$mode DELPHI}") on the test data.

-Alan
_______________________________________________
fpc-pascal maillist  -  [hidden email]
http://lists.freepascal.org/mailman/listinfo/fpc-pascal
Reply | Threaded
Open this post in threaded view
|

Re: Porter Stemming for FPC 2.0

Jonas Maebe-2

On 19 Sep 2005, at 17:40, Alan Mead wrote:


>  I think it's
> impossible that he could get a wrong result because even if the
> garbage at memory p[-3] to p[-1] matches the word ending, the word
> itself will not.
>

There's not necessarily garbage at those locations, those addresses  
may simply not be mapped (i.e. cause an access violation/general  
protection fault when accessed).


Jonas

_______________________________________________
fpc-pascal maillist  -  [hidden email]
http://lists.freepascal.org/mailman/listinfo/fpc-pascal