carga's picture

Cloo performance? [ OpenCL CPU vs Pure .NET ]

Hello!

I succeeded to run VectorAdd sample from Cloo project. In my particular environment there is no GPGPU available for OpenCL, so it uses CPU only.

I was interested to compare Cloo performance with what .NET provides out of the box. Here is my result for vector with 10,000,000 elements:
------------------| Start VectorAdd |------------------
Dim(a)=10000000 GPU Time: 290 msec
Dim(a)=10000000 .NET Time: 87 msec
-------------------| End VectorAdd |-------------------

Pure .NET is 3 times faster.

Please, provide here result of this test executed in environment with GPGPU available?

I would like to see at least 10 times OpenCL speed up, otherwise it's just a waste of time to use such complicated technology.

Best regards,
Anton.
http://kyta.spb.ru

PS I had observed similar situation when using Mono-to-SSE bindings: if SSE is available -- we have 2 times speed up. If not -- then 2 times slow down.


Comments

nythrix's picture

First of all thanks for testing it out.
I've been running similar tests and I can confirm this behaviour for the VectorAdd example. With a GeForce 9600GT it was running 8-9x slower than a simple for(;;) through an array of 50 000 items.

You see, Cloo and the underlying OpenCL carry a little overhead. With such a small kernel (just a simple c[i] = a[i] + b[i]) there is no performance gain. Actually, there is a huge performance drop.

VectorAdd is a dummy test that is supposed to tell you things work. I'll add a more complex kernel to the examples so that we have a better match against the real world.

nythrix's picture

I'm posting some numbers that might make you happy. This test intersects a ray with a number of triangles and it was run in both OpenCL (on GF 9600 GT) and .NET (release config).

10 triangles:

Cloo ticks: 952072,             milliseconds: 0
.NET ticks: 5891232,            milliseconds: 2

100 triangles

Cloo ticks: 950576,             milliseconds: 0
.NET ticks: 6017176,            milliseconds: 2

1 000 triangles

Cloo ticks: 1767872,            milliseconds: 0
.NET ticks: 7309984,            milliseconds: 2

10 000 triangles

Cloo ticks: 2695160,            milliseconds: 1
.NET ticks: 19414624,           milliseconds: 7

65 535 triangles

Cloo ticks: 9008704,            milliseconds: 3
.NET ticks: 107045968,          milliseconds: 40

The test will be included in the next release of Cloo.

carga's picture

Yes, this is much, much better! Thank you for the good news!

I would like to test more "general-purpose" task such as standard NBody problem, so I will post my results and src asap.

Best regards,
Anton.

viewon01's picture

Hi nythrix,

Can you provide the code you use for your performance test ? I would like to do some tests too .

Thx

nythrix's picture

Yes. Cloo 0.2.1 is out.

viewon01's picture

It sounds great...

I have run it and I got the following error :

"InvalidKernelNameComputeException"

Here is the stack trace :

Cloo.dll!Cloo.ComputeException.ThrowIfError(OpenTK.Compute.CL10.ErrorCode errorCode = InvalidKernelName) Line 150 C#
Cloo.dll!Cloo.ComputeKernel.ComputeKernel(string functionName = "intersect", Cloo.ComputeProgram program = {ComputeProgram(92622517)}) Line 87 + 0x8 bytes C#
Cloo.dll!Cloo.ComputeProgram.CreateKernel(string functionName = "intersect") Line 230 + 0x1b bytes C#
ClooTester.exe!ClooTester.TriangleIntersector.Run() Line 97 + 0x10 bytes C#
ClooTester.exe!ClooTester.Program.Main() Line 20 + 0x1d bytes C#

I run the ATI SDK... with :
- Intel(R) Core(TM)2 Duo CPU E7400 @ 2.80GHz
- Graphic card : Intel Eaglelake

viewon01's picture

Here is the ouput I got :

------------------| Start OpenCL platform info |------------------
For test only: Expires on Sun Feb 28 00:00:00 2010
name: ATI Stream
version: OpenCL 1.0 ATI-Stream-v2.0-beta4
profile: FULL_PROFILE
vendor: Advanced Micro Devices, Inc.
extensions:
+

devices:
name: Intel(R) Core(TM)2 Duo CPU E7400 @ 2.80GHz
driver: 1.0
vendor: GenuineIntel
extensions:
+ cl_khr_global_int32_base_atomics
+ cl_khr_global_int32_extended_atomics
+ cl_khr_local_int32_base_atomics
+ cl_khr_local_int32_extended_atomics
+ cl_khr_byte_addressable_store
-------------------| End OpenCL platform info |-------------------

------------------| Start Program binary |------------------
file://C:\DOCUME~1\guest\LOCALS~1\Temp\OCL1ACC.tmp.dll
-------------------| End Program binary |-------------------

------------------| Start Memory Object mapping |------------------
Original content:
6754592535810294784
5609699999583784960
7831835295529076736
8435040577537347584
4115809016406119424
5747797109613563904
Mapped content:
6754592535810294784
5609699999583784960
7831835295529076736
8435040577537347584
4115809016406119424
5747797109613563904
-------------------| End Memory Object mapping |-------------------

------------------| Start VectorAdd |------------------
33,86629 + 96,28593 = 130,1522
59,61269 + 27,52616 = 87,13885
31,76719 + 9,762908 = 41,53009
92,57583 + 97,41137 = 189,9872
19,51515 + 12,30295 = 31,8181
21,562 + 42,41216 = 63,97417
64,71769 + 8,006876 = 72,72456
54,36967 + 78,10004 = 132,4697
28,51088 + 84,06319 = 112,5741
25,838 + 5,153872 = 30,99187
-------------------| End VectorAdd |-------------------

------------------| Start Triangle intersection |------------------
C:\DOCUME~1\guest\LOCALS~1\Temp\OCL1ACE.tmp.cl(3): error: a parameter cannot
be allocated in a named address space
global float4 dir,
^

1 error detected in the compilation of "C:\DOCUME~1\guest\LOCALS~1\Temp\OCL1ACE.tmp.cl".

nythrix's picture

Strange, this doesn't occur with nVidia drivers. I have to find out whether this is caused by the kernel or by the ATI drivers. Anyway, I can't tell for sure until I get home tonight.

viewon01's picture

Thanks,

I have do a post on the AMD forums : http://forums.amd.com/devforum/messageview.cfm?catid=390&threadid=123037...

I'll try to do my best to help you... if you don't find before ;-)

pceric's picture

I get the same error on nVidia drivers:

------------------| Start OpenCL platform info |------------------
name: NVIDIA CUDA
version: OpenCL 1.0 CUDA 3.0.1
profile: FULL_PROFILE
vendor: NVIDIA Corporation
extensions:
+ cl_khr_byte_addressable_store
+ cl_khr_gl_sharing
+ cl_nv_compiler_options
+ cl_nv_device_attribute_query

devices:
name: GeForce 9600 GT
driver: 195.62
vendor: NVIDIA Corporation
extensions:
+ cl_khr_byte_addressable_store
+ cl_khr_gl_sharing
+ cl_nv_compiler_options
+ cl_nv_device_attribute_query
+
+ cl_khr_global_int32_base_atomics
+ cl_khr_global_int32_extended_atomics
-------------------| End OpenCL platform info |-------------------

------------------| Start Program binary |------------------
//
// Generated by NVIDIA NVPTX Backend for LLVM
//

.version 1.5
.target sm_11, texmode_independent, map_f64_to_f32

// Global Launch Offsets
.const[0] .s32 %_global_block_offset[3];
.const[0] .s32 %_global_launch_offset[3];
.const[0] .s32 %_global_num_groups[3];
.const[0] .s32 %_global_size[3];
.const[0] .u32 %_work_dim;

.const .align 8 .b8 def___internal_i2opi_d[144] = { 0x08, 0x5D, 0x8D, 0x1F, 0xB1, 0x5F, 0xFB, 0x6B, 0xEA, 0x92, 0x52, 0x8A, 0xF7, 0x39, 0x07, 0x3D, 0x7B, 0xF1, 0xE5, 0xEB, 0xC7, 0xBA, 0x27, 0x75, 0x2D, 0xEA, 0x5F, 0x9E, 0x66, 0x3F, 0x46, 0x4F, 0xB7, 0x09, 0xCB, 0x27, 0xCF, 0x7E, 0x36, 0x6D, 0x1F, 0x6D, 0x0A, 0x5A, 0x8B, 0x11, 0x2F, 0xEF, 0x0F, 0x98, 0x05, 0xDE, 0xFF, 0x97, 0xF8, 0x1F, 0x3B, 0x28, 0xF9, 0xBD, 0x8B, 0x5F, 0x84, 0x9C, 0xF4, 0x39, 0x53, 0x83, 0x39, 0xD6, 0x91, 0x39, 0x41, 0x7E, 0x5F, 0xB4, 0x26, 0x70, 0x9C, 0xE9, 0x84, 0x44, 0xBB, 0x2E, 0xF5, 0x35, 0x82, 0xE8, 0x3E, 0xA7, 0x29, 0xB1, 0x1C, 0xEB, 0x1D, 0xFE, 0x1C, 0x92, 0xD1, 0x09, 0xEA, 0x2E, 0x49, 0x06, 0xE0, 0xD2, 0x4D, 0x42, 0x3A, 0x6E, 0x24, 0xB7, 0x61, 0xC5, 0xBB, 0xDE, 0xAB, 0x63, 0x51, 0xFE, 0x41, 0x90, 0x43, 0x3C, 0x99, 0x95, 0x62, 0xDB, 0xC0, 0xDD, 0x34, 0xF5, 0xD1, 0x57, 0x27, 0xFC, 0x29, 0x15, 0x44, 0x4E, 0x6E, 0x83, 0xF9, 0xA2 };

.entry vectorAdd
(
.param .b32 vectorAdd_param_0,
.param .b32 vectorAdd_param_1,
.param .b32 vectorAdd_param_2
)
{
.reg .f32 %f<4>;
.reg .s32 %r<9>;

_vectorAdd:
{
// get_global_id(0)
.reg .u32 %vntidx;
.reg .u32 %vctaidx;
.reg .u32 %vtidx;
cvt.u32.u16 %vntidx, %ntid.x;
cvt.u32.u16 %vctaidx, %ctaid.x;
cvt.u32.u16 %vtidx, %tid.x;
mad.lo.s32 %r1, %vntidx, %vctaidx, %vtidx;
.reg .u32 %temp;
ld.const.u32 %temp, [%_global_launch_offset+0];
add.u32 %r1, %r1, %temp;
}

shl.b32 %r2, %r1, 2;
ld.param.u32 %r3, [vectorAdd_param_1];
ld.param.u32 %r4, [vectorAdd_param_0];
add.s32 %r5, %r4, %r2;
add.s32 %r6, %r3, %r2;
ld.param.u32 %r7, [vectorAdd_param_2];
ld.global.f32 %f1, [%r5];
ld.global.f32 %f2, [%r6];
add.rn.ftz.f32 %f3, %f1, %f2;
add.s32 %r8, %r7, %r2;
st.global.f32 [%r8], %f3;
ret;
}

-------------------| End Program binary |-------------------

------------------| Start Memory Object mapping |------------------
Original content:
2934802780230322176
8527569823997102080
204372612966262688
7046751857986121728
3974656866072502784
1309497746332879616
Mapped content:
2934802780230322176
8527569823997102080
204372612966262688
7046751857986121728
3974656866072502784
1309497746332879616
-------------------| End Memory Object mapping |-------------------

------------------| Start VectorAdd |------------------
56.57155 + 58.45175 = 115.0233
58.53479 + 70.96561 = 129.5004
7.859627 + 63.40628 = 71.26591
6.03061 + 52.21336 = 58.24397
49.79776 + 95.02283 = 144.8206
46.15215 + 2.900972 = 49.05312
57.18979 + 83.5145 = 140.7043
72.97466 + 0.7222652 = 73.69693
8.511994 + 2.004705 = 10.5167
75.73506 + 20.94896 = 96.68402
-------------------| End VectorAdd |-------------------

------------------| Start Triangle intersection |------------------
:3: error: invalid address space for argument to __kernel function
global float4 dir,

viewon01's picture

Someone at AMD has answer to the problem :

Take a look at the solution at :
http://forums.amd.com/devforum/messageview.cfm?catid=390&threadid=123037...

Regards

viewon01's picture

So,

I have do some test and in my "Test" application I receive the following message "Compilation failed".

But when I put my "CL" code in your application I receive a complete error description by calling :
program.GetBuildLog(context.Devices[0])

Here is the code I use :
ComputeContext context = new ComputeContext(DeviceTypeFlags.DeviceTypeDefault, null, null);
ComputeProgram program = new ComputeProgram(context, kernelSource);

try
{
program.Build(context.Devices, null, null, IntPtr.Zero);
}
catch (ComputeException e)
{
LogManager.Trace(LogSeverity.CriticalError, LogType.Rendering, program.GetBuildLog(context.Devices[0]));
return;
}

Why I receive a "short" description of the error and not the complete one like in your application ?

Thx

viewon01's picture

I have find the problem, but don' t know how to solve it.

When I run the application as a console application all is fine... but here it is a WPF application that launch several threads.
Does the OpenCL call must be done from the main thread ? or something related ?

Thx

nythrix's picture

Since this topic has grown a bit off-topic, please direct further posts to the bug report I've created. Thanks.

PS.: I will probably not fix this issue before weekend. I'm out of time for the next two days, guys.

carga's picture

I could not run speed tests on my nVidia GPU since Sony is too slow updating drivers for their VAIO notebooks, so TriangleIntersection example is computed on general CPU. Here is my result:

------------------| Start OpenCL platform info |------------------
For test only: Expires on Sun Feb 28 00:00:00 2010
name: ATI Stream
version: OpenCL 1.0 ATI-Stream-v2.0-beta4
profile: FULL_PROFILE
vendor: Advanced Micro Devices, Inc.
extensions:
+

devices:
name: Intel(R) Core(TM)2 Duo CPU T7700 @ 2.40GHz
driver: 1.0
vendor: GenuineIntel
extensions:
+ cl_khr_global_int32_base_atomics
+ cl_khr_global_int32_extended_atomics
+ cl_khr_local_int32_base_atomics
+ cl_khr_local_int32_extended_atomics
+ cl_khr_byte_addressable_store
-------------------| End OpenCL platform info |-------------------

------------------| Start Triangle intersection |------------------
Cloo ticks: 17484, milliseconds: 1
.NET ticks: 666924, milliseconds: 46
-------------------| End Triangle intersection |-------------------

I am very happy to see 50 times acceleration! Basically it means that OpenCL executes parallel kernels on general CPU in much, much more effective way then .NET does for it's sequential code.

Could anybody rewrite .NET intersect(...) method using Parallel LINQ? Will it give comparable speed-up for .NET performance?

Best regards,
Anton.

PS Compilation error on my system was due to the 'dir' declaration in kernel. The correct method signature looks like

kernel void intersect(
    float4 dir,
    global float4 * pointA,
    global float4 * pointB,
    global float4 * pointC,
    global float *    hits )

I suggest that any non-array argument that will be passed through kernel.SetValueArg(...) should not have global or local modifier.

viewon01's picture

Hi,

1 - I think that there is an error in the way we compute the "Performance". When using OpenCL we must :
a) take account of the "Data transfer" between .NET memory and the CPU or GPU memory (Theses are 2 kinds of different transfers.
b) we must take into account the fact that the code is compiled... in some case we have to create a new context, a new compilation and a new data transfer
c) we need to free some OpenCL memory, in some cases too

2 - Using the .NET Parallel LINQ will not speed up the processing up to 50 times... except if you have at least 50 core on your CPU....

So, it is very difficult to compare both !

What I suggest is that we should put the 'CL data creation + CL data transfer' in the speed test.

Regards

carga's picture
viewon01 wrote:

What I suggest is that we should put the 'CL data creation + CL data transfer' in the speed test.

It is very common to have all the data already prepared before starting any hard cpu-intensive processing, so it is not required to measure 'preparation' step. Otherwise we will measure the speed of our Random generator together with the speed of the algo itself. It is also wise to create context and to initialize kernels at early stages of the program, so this overhead should not be measured as well.

But I absolutely agree that timer must be started just before first kernel.Set*Arg() and it must be stopped just after last kernel.Read*(). I measure performance of VectorAdd exactly in this way. And TriangleIntercept also does it in this way, doesn't it?

Enjoy IT!
Anton.

viewon01's picture

Hi Anto,

I understand your point of view and agree...

Except that in some cases it is not possible to directly convert .NET data to OpenCL data, in fact, we can do some "basic performance tests" but in the reality there are "some cases" where we must take "more" parameters into account to compute the "real" performance gain ;-)

Even, in some case we can send all the "triangles" information at the startup of the application and doing all the "intersections tests" without doing the "transfer" (Set*Arg) at each call...

When playing with this kind of technology we must "adapt" the "algorithms" for massive parallelism and take account of all the constraints :-)

nythrix's picture

Could anybody rewrite .NET intersect(...) method using Parallel LINQ? Will it give comparable speed-up for .NET performance?
I have no experience with LINQ whatsoever, I'm sorry. All of my projects up till now were based on .NET 2.0 for compatibility reasons. I think same applies to OpenTK and by extent to Cloo. I will take any advise on this.

[intersection kernel]
That has been fixed and will be released along with Cloo-0.3.1 in a couple of hours.
Because of certain limitations it wasn't possible to pass "__local" arguments to kernels. That's been fixed as well.

[speed comparisons]
Generally speaking, OpenCL vs. .NET is hard to compare. You have to take into account whether the result finishes onscreen (no readback penalty) or the data travel back and forth through different memory stacks. It is therefore obvious that the determining factor for speedups is the nature of the algorithm. Other players, like HW, drivers or OpenTK+Cloo, are far less important (asymptotically speaking).

Edit: read-back operation is not included in the timer because raytracing results usually end up on screen and not on the CPU. Oh, and one more thing. Most of the OpenCL commands can run asynchronously. So, direct comparison with real world apps is compromised once again :)

viewon01's picture

Right,

Are you trying to do a "raytracer" based on OpenCL ?

Thx

nythrix's picture

Yes. Never been a fan of rasterized graphics. My hobby engine uses a SW raytracer and a couple of GL rasterizers but I'm not happy with any of them. So I want to try out an OpenCL raytracer. If that works out, my next move is pushing the whole engine on top of OpenCL. A lot of work there.

But first, I have to polish Cloo. I'm not going anywhere without a decent base.

viewon01's picture

Great...

It is what I'm doing too...

Doing intersection per primitive is slow...so currently I'm working on implementing the acceleration structure with OpenCL... in order to handle millions of polygons ;-)

Take a look at :
- http://www.cs.unc.edu/~lauterb/GPUBVH/paper.pdf
- http://www.tml.tkk.fi/~timo/publications/aila2009hpg_paper.pdf

carga's picture

I just realized that this huge advantage of the TriangleIntersect sample is just the difference between implementations of scalar and cross multiplications in .NET (OpenTK Vector4 multiplications) and in OpenCL.

OpenCL kernel and .NET Intersect() code look very similar, but there are a lot of math under the cover. Difference in execution time is the difference between OCL's float uw = dot( EdgeAB, w ); and OTK's float uw = Vector4.Dot( EdgeAB, w );, between implementations of vector substraction and normalisation, between OCL's float4 Normal = normalize( cross( EdgeAB, EdgeAC ) ); and OTK's Vector4 Normal = Vector4.Normalize( new Vector4( Vector3.Cross( EdgeAB.Xyz, EdgeAC.Xyz ), 0 ) );.

In my environment the sample is executed on CPU in both cases, so the results should be more or less equal. The 50 times acceleration is (mostly) the result of different overheads and inefficiencies in OpenTK implementation of all these complex vector operators.

I also expect that on GPUs with hardware support of vector operators this kernel will be executed much faster. But nythrix and I we both have similar results... Strange...

Mr nythrix, could you please execute TriangleIntersector sample in two different ways:
1. Forcing OpenCL to use CPU as computing device (here we compare OpenCL CPU vs .NET CPU);
2. Forcing OpenCL to use GPU as computing device (here we compare OpenCL GPU vs .NET CPU);

having these 2 results we will be able to compare OpenCL CPU vs OpenCL GPU. THIS challenge is even more interesting!!!

I even suggest you to include one more sample in next release: all the kernels we saw before are executed via OpenCL one-by-one on all devices available in system sequentially in a loop. My environment is very limited, but it is very interesting to see the results of guys with several devices available.

Best regards,
Anton.

the Fiddler's picture

I haven't seen the example code yet, but this is very slow code:

Vector4 Normal = Vector4.Normalize( new Vector4( Vector3.Cross( EdgeAB.Xyz, EdgeAC.Xyz ), 0 ) );.

It constructs and copies 5 temporary vectors (the Xyz properties, the Cross and Normalize methods and the Vector4 constructor).

Rearranging the code to reduce the number of temporaries will have a positive impact to performance. Using OpenTK methods that take ref overloads will also help a lot:

float uw = Vector4.Dot(EdgeAB, w); // slow
 
float uw;
Vector4.Dot(ref EdgeAB, ref w, out uw); // fast

Unfortunately, it is very hard to write performant math code in .Net (this is a common complaint against XNA too). The lack of const references means you have to choose between multiple temporary objects and ugly syntax. (Even worse, sometimes it's impossible to avoid tempoararies.) Then, there's the lack of generic operators and other similar annoyances that make math code very tedious to implement.

I believe that even a CPU-powered OpenCL implementation will be faster for non-trivial applications.

viewon01's picture

Also,

Take a look at this sample : http://kioku.sys-k.net/archives/code/ , it is an OpenCL port of the AO Bench ;-)

Quite simple, but can be usefull to do some test.

Regards

nythrix's picture

I would say that CL vector operations are better optimized even on a CPU. There's no match for them in the managed world unless you get down to assembler/SSE. Still, the gap is so big that this is not the only difference. Depending on the implementation of OpenCL, multithreading capabilities of CPUs are exploited darn well, I'd say.

Right now, the only way of trying out the CPU vs GPU test is this:
1) First run the test on a CL capable nVidia card.
2) Uninstall nVidia drivers and run the test with ATI Stream on the CPU.

As of now, clGetPlatformIDs will not list both of these platforms. The vendors haven't reached an agreement in interoperability, yet. So we're stuck in one platform for now.

But you could still use multiple devices in one platform, right? Well, not so fast:
1) nVidia needs to cooperate with CPU vendors to release CPU capable OpenCL. [Edit: never mind the sentence here. I'm still asleep]
2) ATI releases GPU capable drivers. I love their HW since it's usually ahead of the competition. But the drivers have been a pain in the arse since I was a kid.
3) Get a MAC. It looks like you can run kernels on multiple devices with their integrated drivers. Not having one, I cannot confirm this.

nythrix's picture
the Fiddler wrote:

...Then, there's the lack of generic operators and other similar annoyances that make math code very tedious to implement...

Apparently one simple thing that could've helped here is patented.

the Fiddler's picture

2) AMD Stream 2.0 beta 4 is both CPU- and GPU-capable, so that's a viable route.

If you have a Nvidia GPU, there might be a way to hack a CPU/GPU testing environment together: first install AMD's Stream SDK (CPU only). Afterwards, install Nvidia's OpenCL-capable drivers (GPU only). If you run OpenCL now, you should get Nvidia's GPU implementation. If you wish to test with AMD's CPU implementation, simply copy the relevant dlls and exes from %Program Files%\Amd\Stream 2.0 to the test program's folder. To go back to Nvidia, simply rename AMD's OpenCL.dll to something else.

Edit: about that patent, I wouldn't worry about it too much. There's a *lot* of prior art (the oldest I could find is from 2004: http://www.codeproject.com/KB/cs/genericnumerics.aspx, but it's likely you can find even older implementations) and it probably won't hold up in court.

nythrix's picture

2) AMD Stream 2.0 beta 4 is both CPU- and GPU-capable, so that's a viable route.
Thanks for the pointer, I'll update the docs.

Copying dll's here and there is a faster route, yes. But it remains an "external" procedure. I was playing with the idea of querying the CL platforms using home-baked methods. I'll reconsider it, if nVidia doesn't come up with CPU support till January.

carga's picture
viewon01 wrote:

Take a look at this sample : http://kioku.sys-k.net/archives/code/ , it is an OpenCL port of the AO Bench ;-)

Quite simple, but can be usefull to do some test.

Extremely interesting performance comparison sample!

The AOBench kernel for OpenCL has the only output parameter of type global uint * result.

1. What is the size of this array for default picture size 256x256? 65536 I guess, isn't it?

2. How should I interpret this uint[] output to reproduce the picture? I am able to get it using AOBench for C#, and I wish to reproduce it for OpenCL implementation as well (just to be sure that both results are exactly identical).

After this step I will be ready to contribute AOBenchTest for Cloo (if anybody interested =) ).

Best regards,
Anton.

PS What is about Cloo 0.3.x announced a few days ago?

Who's online

There are currently 3 users and 10 guests online.

Online users

  • JTalton
  • Shad
  • AnorZaken