for loops performance problems?

classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

for loops performance problems?

Anthony Walter-3
I recall earlier this year some people in this mailing list were discussing surprising performance problems with fpc and for loops. I wanted to know if this is still an existing problem as I am experiencing some unusual performance degradation related to a for loop in one of my test applications.

Here is a description of my test application:

http://cache.getlazarus.org/videos/fonts.mp4 (vsync on for recording purposes)

An opengl window which renders example text of various fonts. The user can press a key to cycle through the available fonts to see how they look as textured billboard sprites. The text displays in a few paragraphs.

The performance issue:

Adding a paragraph of sample text greatly reduces the opengl frame rate. On some systems, like the raspberry pi, the frame rate can drop to 10 frames a second. This seems like a bit much of a low frame rate given that it's actually not a lot of geometry (4 vert or colors per character).

When I turn on geometry buffering, that is storing the vertex information, then drawing using a user memory vertex buffer, the frame rate skyrockets to 200+ fps (vsync is off) on a raspberry.

I think the code to generate the geometry each frame isn't that complex, and I pre-allocate room in my buffer for all the geometry just once, so it seems doing to calculations for the geometry is what's killing the performance. The calculations are simple multiplication of "Single" type, and I am thinking maybe the "for looping" part is what's degrading performance. 

Here is the gist of the loop that generates the text vertex buffer:


I can convert to static buffers and get good performance (if I know the text isn't changing), but I'm now curious if this specific performance issue is related to fpc's for loop code generation. 

What do you think?

_______________________________________________
fpc-pascal maillist  -  [hidden email]
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal
Reply | Threaded
Open this post in threaded view
|

Re: for loops performance problems?

Karoly Balogh (Charlie/SGR)
Hi,

On Tue, 4 Jul 2017, Anthony Walter wrote:

> I think the code to generate the geometry each frame isn't that complex,
> and I pre-allocate room in my buffer for all the geometry just once, so
> it seems doing to calculations for the geometry is what's killing the
> performance. The calculations are simple multiplication of "Single"
> type, and I am thinking maybe the "for looping" part is what's degrading
> performance. 
>
> Here is the gist of the loop that generates the text vertex buffer:
>
> https://gist.github.com/sysrpl/8af6e5a9d62cc2f2a1c40f9a9ae13b64
Well, first, please provide a compilable and runnable example for further
investigation.

> I can convert to static buffers and get good performance (if I know the
> text isn't changing), but I'm now curious if this specific performance
> issue is related to fpc's for loop code generation. 

No, it's probably the fact that you're doing 10 function calls per glyph
setup in the "World." part of your for loop, each involving their own set
of register/save restore, etc. I'd say that's probably much slower than
any performance degratation which might arise from the fact that fpc
doesn't do SSA in for loops.

But because the example you provided is not compilable, I cannot give
further hints, and the above is just speculation.

Charlie
_______________________________________________
fpc-pascal maillist  -  [hidden email]
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal
Reply | Threaded
Open this post in threaded view
|

Re: for loops performance problems?

José Mejuto
In reply to this post by Anthony Walter-3
El 04/07/2017 a las 11:09, Anthony Walter escribió:

> I can convert to static buffers and get good performance (if I know the
> text isn't changing), but I'm now curious if this specific performance
> issue is related to fpc's for loop code generation.
> What do you think?

Hello,

AFAIK the problem was/is some floating point maths not loops, and the
partial/full SSA missing in fpc.

--

_______________________________________________
fpc-pascal maillist  -  [hidden email]
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal
Reply | Threaded
Open this post in threaded view
|

Re: for loops performance problems?

Peter
In reply to this post by Anthony Walter-3
I usually start performance investigations by compiling with '-al', and
looking at the generated assembler.

Regards,
Peter

P.S.  From what we know so far, inclined to agree with Charlie.
_______________________________________________
fpc-pascal maillist  -  [hidden email]
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal
Reply | Threaded
Open this post in threaded view
|

Re: for loops performance problems?

fredvs
Hello.

Please take a look at this:

https://www.mail-archive.com/fpc-pascal%40lists.freepascal.org/msg46162.html

and this:

http://www.mail-archive.com/mseide-msegui-talk@lists.sourceforge.net/msg11078.html

fpc has a huge problem for float calculation.

It makes fpc not competitive for audio libraries/programs that use DSP.
Decent today's audio stuffs do use float 32 résolution for samples.

Fre;D
Many thanks ;-)
Reply | Threaded
Open this post in threaded view
|

Re: for loops performance problems?

Karoly Balogh (Charlie/SGR)
Hi,

On Wed, 5 Jul 2017, fredvs wrote:

> Please take a look at this:
>
> https://www.mail-archive.com/fpc-pascal%40lists.freepascal.org/msg46162.html
>
> and this:
>
> http://www.mail-archive.com/mseide-msegui-talk@.../msg11078.html
>
> fpc has a huge problem for float calculation.
>
> It makes fpc not competitive for audio libraries/programs that use DSP.
> Decent today's audio stuffs do use float 32 résolution for samples.
This problem has nothing to do with the topic. No, the asker's posted code
is not slow because of this or similar problem, but because it does 10+
function calls inside a tightloop.

Charlie
_______________________________________________
fpc-pascal maillist  -  [hidden email]
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal
Reply | Threaded
Open this post in threaded view
|

Re: for loops performance problems?

Anthony Walter-3
Karloy,

I replaced the calls to World.Vertex/.TexCoord/.Color with a local vertex buffer (an array of TColorTexVertex) eliminating the function calls you mentioned. The frames per seconds with vsync off is identical, so I'm pretty sure that's not causing the slow down. It's either that the addition/multiplication of floats given the font map (heights/widths stored in an array) is inefficient or that there is something about the nature of a for..loop that is causing it to be slow.

I will investigate further.

_______________________________________________
fpc-pascal maillist  -  [hidden email]
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal
Reply | Threaded
Open this post in threaded view
|

Re: for loops performance problems?

Karoly Balogh (Charlie/SGR)
Hi,

On Wed, 5 Jul 2017, Anthony Walter wrote:

> I replaced the calls to World.Vertex/.TexCoord/.Color with a local
> vertex buffer (an array of TColorTexVertex) eliminating the function
> calls you mentioned. The frames per seconds with vsync off is identical,
> so I'm pretty sure that's not causing the slow down. It's either that
> the addition/multiplication of floats given the font map (heights/widths
> stored in an array) is inefficient or that there is something about the
> nature of a for..loop that is causing it to be slow.

If you still think that loop causes the slowdown, can you post the
generated assembly of it with -al? Otherwise it's really just guesswork.

Also, since you're compiling for ARM if I'm correct, make sure that you

A., using the hardfloat target, and not actually using the softfpu...

B., your data strutures are properly aligned, and any underlying records
are *NOT* declared as packed.

C., you're actually doing aligned accesses indeed, so there are no hidden
exceptions involved from the kernel side, handling the load/store of your
values.

The other example which bubles up again and again, is down to the fact,
that FPC doesn't do autovectorization of that example, while other
compilers, mainly LLVM does. With scalar code, FPC is not that far behind,
if at all.

Charlie
_______________________________________________
fpc-pascal maillist  -  [hidden email]
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal