Effective memory allocation

classic Classic list List threaded Threaded
16 messages Options
Reply | Threaded
Open this post in threaded view
|

Effective memory allocation

Xiangrong Fang
Hi All,

I am programming a Bloom Filter and need a high-performance way to allocate and wipe large block of memory.  I did the following test:

program getmem;
{$mode objfpc}{$H+}
uses epiktimer;
const
  SIZE = 1024 * 1024 * 1024;
  CNT = 10;
var
  a: array of Byte;
  p: Pointer;
  et: TEpikTimer;
  i: Integer;
  t1, t2: TimerData;
begin
  et := TEpikTimer.Create(nil);
  et.Clear(t1); et.Clear(t2);
  for i := 1 to CNT do begin
    et.Start(t1);
    p := GetMemory(SIZE);
//    SetLength(a, SIZE);
    et.Stop(t1);
    et.Start(t2);
    FillQWord(p^, SIZE div 8, 0);
//    FillQWord(a[0], SIZE div 8, 0);
    et.Stop(t2);
    FreeMem(p);
//    a := nil;
  end;
  WriteLn('Alloc: ', et.Elapsed(t1) / CNT);
  WriteLn('Clear: ', et.Elapsed(t2) / CNT);
end.

The result is:

Using GetMemory:

Alloc:  9.4078169999999997E-0001
Clear:  2.1342020000000002E-0001

Using SetLength:

Alloc:  2.8100000000000000E-0005
Clear:  7.7497550000000004E-0001

It is understandable that GetMemory() is faster than SetLength(), but why the difference in FillQWord()?

Also, I usually use pointer to pass block of memory to functions.  How do I implement a function with the following signature:

procedure MyProc(var Buf; Len: Integer):

I mean, how to handle "var Buf" inside the procedure body?

Thanks!

Xiangrong

_______________________________________________
fpc-pascal maillist  -  [hidden email]
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal
Reply | Threaded
Open this post in threaded view
|

Re: Effective memory allocation

Xiangrong Fang
Sorry, the results in previous mail was mis-labeled.

The result is:

Using SetLength:

Alloc:  9.4078169999999997E-0001
Clear:  2.1342020000000002E-0001

Using GetMemory:

Alloc:  2.8100000000000000E-0005
Clear:  7.7497550000000004E-0001



_______________________________________________
fpc-pascal maillist  -  [hidden email]
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal
Reply | Threaded
Open this post in threaded view
|

Re: Effective memory allocation

Nikolay Nikolov-2
In reply to this post by Xiangrong Fang
On 11/02/2014 03:54 PM, Xiangrong Fang wrote:
Hi All,

...

Also, I usually use pointer to pass block of memory to functions.  How do I implement a function with the following signature:

procedure MyProc(var Buf; Len: Integer):

I mean, how to handle "var Buf" inside the procedure body?
You can take the address of the Buf variable by using @Buf. For example:

procedure MyProc(var Buf; Len: Integer);
var
  I: Integer;
  P: PByte;
begin
  P := @Buf;
  for I := 0 to Len - 1 do
    (P+I)^ := 0;
end;

Nikolay

_______________________________________________
fpc-pascal maillist  -  [hidden email]
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal
Reply | Threaded
Open this post in threaded view
|

Re: Effective memory allocation

Sven Barth-2
In reply to this post by Xiangrong Fang

Am 02.11.2014 14:55 schrieb "Xiangrong Fang" <[hidden email]>:
>
> Hi All,
>
> I am programming a Bloom Filter and need a high-performance way to allocate and wipe large block of memory.  I did the following test:
>
> program getmem;
> {$mode objfpc}{$H+}
> uses epiktimer;
> const
>   SIZE = 1024 * 1024 * 1024;
>   CNT = 10;
> var
>   a: array of Byte;
>   p: Pointer;
>   et: TEpikTimer;
>   i: Integer;
>   t1, t2: TimerData;
> begin
>   et := TEpikTimer.Create(nil);
>   et.Clear(t1); et.Clear(t2);
>   for i := 1 to CNT do begin
>     et.Start(t1);
>     p := GetMemory(SIZE);
> //    SetLength(a, SIZE);
>     et.Stop(t1);
>     et.Start(t2);
>     FillQWord(p^, SIZE div 8, 0);
> //    FillQWord(a[0], SIZE div 8, 0);
>     et.Stop(t2);
>     FreeMem(p);
> //    a := nil;
>   end;
>   WriteLn('Alloc: ', et.Elapsed(t1) / CNT);
>   WriteLn('Clear: ', et.Elapsed(t2) / CNT);
> end.
>
> The result is:
>
> Using GetMemory:
>
> Alloc:  9.4078169999999997E-0001
> Clear:  2.1342020000000002E-0001
>
> Using SetLength:
>
> Alloc:  2.8100000000000000E-0005
> Clear:  7.7497550000000004E-0001
>
> It is understandable that GetMemory() is faster than SetLength(), but why the difference in FillQWord()?

If you use SetLength the dynamic array consists not only of the array data, but also of an information record in front of it. This will likely lead to the data not being aligned correctly (FillQWord works best with 8-Byte alignment). So what about testing FillDWord or FillChar? Just to see whether they would be faster.

>
> Also, I usually use pointer to pass block of memory to functions.  How do I implement a function with the following signature:
>
> procedure MyProc(var Buf; Len: Integer):
>
> I mean, how to handle "var Buf" inside the procedure body?

You need to cast Buf to a pointer if I remember correctly. Take a look at the implementation of a TStream's descendant's Write method (e.g. TStringStream) for an example.

Regards,
Sven


_______________________________________________
fpc-pascal maillist  -  [hidden email]
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal
Reply | Threaded
Open this post in threaded view
|

Re: Effective memory allocation

Constantine Yannakopoulos
In reply to this post by Xiangrong Fang
On Sun, Nov 2, 2014 at 3:54 PM, Xiangrong Fang <[hidden email]> wrote:
Also, I usually use pointer to pass block of memory to functions.  How do I implement a function with the following signature:

procedure MyProc(var Buf; Len: Integer):

I mean, how to handle "var Buf" inside the procedure body?

Use "@Buf" to get a pointer to the address of the untyped variable. If you want to cast that variable to an internal "known" type and the compiler complains you can always use a pointer cast: PMyKnownType(@Buf)^. You can also use pointer arithmetic (e.g. (PElementType(@Buf) + n)^) if the variable is known to be an array.

In Delphi (and probably FPC) it is allowed to pass a nil pointer to an untyped variable if you need to:

begin
  MyProc(nil^, 0);
end;

The compiler understands what you are trying to do and does not complain about the blatant nil dereference. So "var Buf;" is functionally equivalent to "Buf: Pointer;". OTOH "const Buf;" is slightly different in the sense that the compiler will not let you pass the untyped variable to another routine that may change its value.

--
Constantine

_______________________________________________
fpc-pascal maillist  -  [hidden email]
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal
Reply | Threaded
Open this post in threaded view
|

Re: Effective memory allocation

Xiangrong Fang
In reply to this post by Sven Barth-2
2014-11-03 2:50 GMT+08:00 Sven Barth <[hidden email]>:

If you use SetLength the dynamic array consists not only of the array data, but also of an information record in front of it. This will likely lead to the data not being aligned correctly (FillQWord works best with 8-Byte alignment). So what about testing FillDWord or FillChar? Just to see whether they would be faster.

​It is quite strange that if I use SetLength+FillByte, it is really faster (for the FillByte), but if I use GetMemory+FillByte, it is not faster than using FillDWord or FillQWord.

I found this in FPC doc (for $ALIGN switch):

This switch is recognized for Turbo Pascal Compatibility, but is not yet implemented. The alignment of data will be different in any case, since Free Pascalis a 32-bit compiler.

​It is a pity that this switch is not recognized yet. But I need to understand the last statement, will alignment be different or not?

Thanks!

Xiangrong​


_______________________________________________
fpc-pascal maillist  -  [hidden email]
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal
Reply | Threaded
Open this post in threaded view
|

Re: Effective memory allocation

Sven Barth-2
On 03.11.2014 02:59, Xiangrong Fang wrote:

> 2014-11-03 2:50 GMT+08:00 Sven Barth <[hidden email]
> <mailto:[hidden email]>>:
>
>     If you use SetLength the dynamic array consists not only of the
>     array data, but also of an information record in front of it. This
>     will likely lead to the data not being aligned correctly (FillQWord
>     works best with 8-Byte alignment). So what about testing FillDWord
>     or FillChar? Just to see whether they would be faster.
>
> ​It is quite strange that if I use SetLength+FillByte, it is really
> faster (for the FillByte), but if I use GetMemory+FillByte, it is not
> faster than using FillDWord or FillQWord.

Then it's indeed as I suspected. The memory buffer itself that is
allocated by SetLength is allocated, but the pointer returned to you
(what you get as a dynamic array) has - at least for FillQWord - a bad
alignment. Looking at the source though it should still be 8-Byte
aligned though (the information record has a size of 8 on 32-bit
systems)... *sigh*

Please also note that not on every system the Fill* methods are
implemented in assembler. FillChar mostly is, FillWord and FillDWord
perhaps, but FillQWord mostly seldomly (it's e.g. not even implemented
that way on x86_64). So especially on 32-bit system 2 32-bit moves are
used each. Thus FillQWord is not always the optimal choice and I'd
suggest that you time the different Fill* functions especially on
different platforms.

Would you mind to show the timings that you got for FillChar? :)

> I found this in FPC doc (for $ALIGN switch):
>
> This switch is recognized for Turbo Pascal Compatibility, but is not yet
> implemented. The alignment of data will be different in any case, since
> Free Pascalis a 32-bit compiler.
>
> ​It is a pity that this switch is not recognized yet. But I need to
> understand the last statement, will alignment be different or not?

I think that last statement is ment regarding that Turbo Pascal is a
16-bit compiler which had different alignment values available than FPC
needs. I don't know what TP supported back then, but I don't imagine
that it supported e.g. 16-Byte alignment.

Additionally this switch won't help you. The memory buffer that is
allocated by SetLength is also allocated using GetMemory. So the buffer
itself *is* aligned. But you don't get the start byte of the buffer in
case of SetLength, but the first byte after the information record which
might not be aligned correctly and *no* compiler switch will help you
with that.

Regards,
Sven
_______________________________________________
fpc-pascal maillist  -  [hidden email]
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal
Reply | Threaded
Open this post in threaded view
|

Re: Effective memory allocation

Nico Erfurth
In reply to this post by Xiangrong Fang
On 02.11.14 15:05, Xiangrong Fang wrote:

> Sorry, the results in previous mail was mis-labeled.
>
> ​
> The result is:
>
> Using SetLength:
>
> Alloc:  9.4078169999999997E-0001
> Clear:  2.1342020000000002E-0001
>
> Using GetMemory:
>
> Alloc:  2.8100000000000000E-0005
> Clear:  7.7497550000000004E-0001


SetLength will already clear the array before returning it, so the time
for SetLength is approximate time(GetMem) + time(Clear).

As your test allocates a gigabyte of memory the OS will not immediatly
assign physical memory to your process. The pagetable after a the
allocation will contain entries which create faults when accessing the
virtual addresses and then the OS will fill those pages in. So on first
access a lot of time will be spend inside the OSes memory subsystem.

For Linux the newly allocated memory will most likely point to a read
only shared physical page which only contains zeros. On the first write
to one of those virtual addresses the real page allocation will trigger
and Linux assigns a new empty physical page to the virtual memory area.
On Linux you can avoid this by passing MAP_POPULATE to mmap when
allocating the memory.

For SetLength this first access is already done when clearing the memory
inside SetLength. For GetMemory it is done when your explicitly access
the memory by calling FillQWord later.

HTH
Nico
_______________________________________________
fpc-pascal maillist  -  [hidden email]
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal
Reply | Threaded
Open this post in threaded view
|

Re: Effective memory allocation

Xiangrong Fang
In reply to this post by Sven Barth-2
2014-11-03 14:39 GMT+08:00 Sven Barth <[hidden email]>:

Would you mind to show the timings that you got for FillChar? :)

​Using FillChar is always about 5% (or less) faster than FillQWord when used with GetMemory, but will be around 20%-40% faster if the memory is allocated by SetLength.

Additionally this switch won't help you. The memory buffer that is allocated by SetLength is also allocated using GetMemory. So the buffer itself *is* aligned. But you don't get the start byte of the buffer in case of SetLength, but the first byte after the information record which might not be aligned correctly and *no* compiler switch will help you with that.

​Then the problem remains with SetLength vs. GetMemory... Why SetLength is about 10000x (!) slower than GetMemory?​
 

​allocating 1G memory took about 0.1 second by SetLength​, but is about 1E-5 via GetMemory.



_______________________________________________
fpc-pascal maillist  -  [hidden email]
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal
Reply | Threaded
Open this post in threaded view
|

Re: Effective memory allocation

BrunoK


My results :

_Ptr:=GetMem(100000000)        18 mus, 824 ns / GetMem
_Ptr:=GetMem(100000000) + FillChar(_Ptr^,100000000,0)); 81 ms / GetMem + FillChar

var
  ArInt:array of int32;
.
SetLength(ArInt, 100000000 shr 2);         81 ms / SetLength

All timings are variable within [time, time+8%] on repeated runs as is generally the case for system timings.

SetLength 0's (Internally calls FillChar) the array ArInt so to do a identical comparison you need to add the FillChar to the GetMem if it is your intention to 0 the array before using it.

You have to compare things that do the same thing in the end.

Tested on XP Pro Win32 Sp3
Intel Core 2CPU
6400 #@ 2.13GHz
2.05 GHr., 1.00 Gb RAM


_______________________________________________
fpc-pascal maillist  -  [hidden email]
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal
Reply | Threaded
Open this post in threaded view
|

Re: Effective memory allocation

Sven Barth-2
In reply to this post by Nico Erfurth

Am 03.11.2014 08:32 schrieb "Nico Erfurth" <[hidden email]>:
>
> On 02.11.14 15:05, Xiangrong Fang wrote:
> > Sorry, the results in previous mail was mis-labeled.
> >
> > ​
> > The result is:
> >
> > Using SetLength:
> >
> > Alloc:  9.4078169999999997E-0001
> > Clear:  2.1342020000000002E-0001
> >
> > Using GetMemory:
> >
> > Alloc:  2.8100000000000000E-0005
> > Clear:  7.7497550000000004E-0001
>
>
> SetLength will already clear the array before returning it, so the time
> for SetLength is approximate time(GetMem) + time(Clear).

Ah, right. I forgot that :)

Regards,
Sven


_______________________________________________
fpc-pascal maillist  -  [hidden email]
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal
Reply | Threaded
Open this post in threaded view
|

Re: Effective memory allocation

Adriaan van Os-2
In reply to this post by Xiangrong Fang
Xiangrong Fang wrote:
> Hi All,
>
> I am programming a Bloom Filter and need a high-performance way to

On what platform are you doing this ?

Regards,

Adriaan van Os

_______________________________________________
fpc-pascal maillist  -  [hidden email]
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal
Reply | Threaded
Open this post in threaded view
|

Re: Effective memory allocation

Xiangrong Fang

2014-11-03 23:40 GMT+08:00 Adriaan van Os <[hidden email]>:
Xiangrong Fang wrote:
Hi All,

I am programming a Bloom Filter and need a high-performance way to

On what platform are you doing this ?

​I am programming on Linux, but it will be used on both Windows and Linux, Windows is the primary target.​
 

_______________________________________________
fpc-pascal maillist  -  [hidden email]
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal
Reply | Threaded
Open this post in threaded view
|

Re: Effective memory allocation

Nico Erfurth
Hi,

>         Hi All,
>
>         I am programming a Bloom Filter and need a high-performance way to
>
>     On what platform are you doing this ?
>
> ​I am programming on Linux, but it will be used on both Windows and
> Linux, Windows is the primary target.​


Well, the first thing you should ask yourself is "Do I REALLY need such
a large bloom filter". Everything larger than the last level cache will
seriously harm your performance as you are going to trigger a lot of
cache and TLB misses. In general you should target for typical L1-Cache
sizes (16-32KByte) or if REALLY necessary L2-Cache-sizes
(256KByte-32MByte). Everything above that will make the performance go
down rapidly, especially in hashing-application the CPU can't prefetch
data properly as the access-patterns are erratic.

Nico
_______________________________________________
fpc-pascal maillist  -  [hidden email]
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal
Reply | Threaded
Open this post in threaded view
|

Re: Effective memory allocation

Xiangrong Fang
2014-11-04 6:35 GMT+08:00 Nico Erfurth <[hidden email]>:

Well, the first thing you should ask yourself is "Do I REALLY need such
a large bloom filter". Everything larger than the last level cache will
seriously harm your performance as you are going to trigger a lot of
cache and TLB misses. In general you should target for typical L1-Cache
sizes (16-32KByte) or if REALLY necessary L2-Cache-sizes
(256KByte-32MByte). Everything above that will make the performance go
down rapidly, especially in hashing-application the CPU can't prefetch
data properly as the access-patterns are erratic.

​I didn't think of things like cache and TLB misses. Because I try to use BloomFilter and HASH to avoid repeated calculation ​and/or lookups which are time consuming. So, I don't think it will harm performance, unless bloom filter lookup is slower than the calculation, am I right?

Thanks!

_______________________________________________
fpc-pascal maillist  -  [hidden email]
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal
Reply | Threaded
Open this post in threaded view
|

Re: Effective memory allocation

BrunoK
In reply to this post by BrunoK

Relooking at your timings and mine, it appears that you allocate 10x my count of register-size count of items and require 10x the FillChar which you need to initialize your filter array.
 
My timing is about 80 ms and yours looks like 900 ms for 10x more register sized data, which look like the reasonable ratio since we may have difference in the way we get the timings  (my timing routines beeing maybe a bit optimistic).
 
Here I think the speed limit is the time it takes to effectively transfer the data to RAM and that whether you have a FillChar or FillQWord that ends up beeing STOSB or STOSQ, that is fully cached inside the CPU thus the limiting factor is the time it takes to move the initialized CPU cached data to the RAM.

_______________________________________________
fpc-pascal maillist  -  [hidden email]
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal