Chris The Avatar's picture

How to improve performance...?

Hello All,

I am getting ready to launch the first beta of my game (www.fantasyrealmonline.com) and I was trying to refine all methods in my game to get the best possible performance for users. I have ran my game through a profiler and optimized it as much as I can from my viewpoint without more knowledge. My game primarily draws textured quads, uses lighting and alpha blending.. You had previously helped me with lighting so I thank you for that.

I have tested my game on numerous PCs so far... most Nvidia chips seem to yield the highest performance (between 80 - 200 fps based on computer), (although I was dissapointed in the Nvidia ION, I really thought it might to better on my netbook). The ATI chips I had were also in a relatively acceptable range for the age of the pcs, however lighting on my Linux desktop that has an older radeon in it was haywire to say the least, but that may be the Linux driver as well... However on the intel integrated chipset I was dissapointed in performance which on a new I7 with integrated graphics yielded 15 - 40 FPS, which even for an integrated chip seems extremely low for the kind of graphics processing I am doing. The integrated graphics chips seemed be bit 20 - 30 frames higher under DirectX...

I would really appreciate some pointers, tips or recommendations that I could pursue to try to get FPS up on less than stellar chipsets...

Some information about my game:

My game generally draws between 1000 - 3000 textured quads each frame (there is really no way to decrease this based on how the isometric game view is laid out). Very few texture rotations are performed each frame, so it is generally a straight draw from the texture to the screen. A viewport is used to clip around my map, although the profiler I ran really points to the texture rendering as the main time consumer...

I am initializing with the following flags:

    GL.ClearColor(Color.Black);
            GL.Enable(EnableCap.Texture2D);
            GL.Enable(EnableCap.Blend);
            GL.ShadeModel(ShadingModel.Smooth);  
            GL.BlendFunc(BlendingFactorSrc.SrcAlpha, BlendingFactorDest.OneMinusSrcAlpha);

Vsync is not used in my measurements above...

My main drawing routine for images looks like the following:

        public void DrawImage(Surface Surf, System.Drawing.Rectangle SourceRect, System.Drawing.Rectangle DestRect, System.Drawing.Color Blender)
        {
            if (Surf != null)
            {
                GL.BindTexture(TextureTarget.Texture2D, Surf.GlTextureID);
 
 
                GL.TexParameter(TextureTarget.Texture2D,
           TextureParameterName.TextureMinFilter,
           (int)TextureMinFilter.Linear);
                GL.TexParameter(TextureTarget.Texture2D,
                                TextureParameterName.TextureMagFilter,
                                (int)TextureMagFilter.Linear);
 
 
                Point[] dstVertex = new Point[4];
                PointF[] srcVertex = new PointF[4];
                RectangleF srcCoords = new RectangleF(
                    SourceRect.X / (float)Surf.Width,
                    SourceRect.Y / (float)Surf.Height,
                    SourceRect.Right / (float)Surf.Width,
                    SourceRect.Bottom / (float)Surf.Height);
 
                srcVertex[0] = new PointF(srcCoords.Left, srcCoords.Top);
                srcVertex[1] = new PointF(srcCoords.Width, srcCoords.Top);
                srcVertex[2] = new PointF(srcCoords.Width, srcCoords.Height);
                srcVertex[3] = new PointF(srcCoords.Left, srcCoords.Height);
 
                dstVertex[0] = new Point(DestRect.X, DestRect.Y);
                dstVertex[1] = new Point(DestRect.X + DestRect.Width, DestRect.Y);
                dstVertex[2] = new Point(DestRect.X + DestRect.Width, DestRect.Y + DestRect.Height);
                dstVertex[3] = new Point(DestRect.X, DestRect.Y + DestRect.Height);
 
                float colorA = Blender.A / 255.0f;
                float colorR = Blender.R / 255.0f;
                float colorG = Blender.G / 255.0f;
                float colorB = Blender.B / 255.0f;
                GL.Begin(BeginMode.Quads);
 
 
                for (int i = 0; i < 4; i++)
                {
                    GL.Color4(colorR, colorG, colorB, colorA);
 
                        GL.TexCoord2(srcVertex[i].X, srcVertex[i].Y);
                    GL.Vertex2(dstVertex[i].X, dstVertex[i].Y);
                }
                GL.End();
            }
        }

Thanks for your help in advance! Let me know if there are more details I can provide.

Chris


Comments

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.
the Fiddler's picture

There's lots of room for improvement here that will hopefully bring you to a steady 60fps on all modern hardware.

A few ideas:

  1. Disable blending when you don't need it. Blending requires lots of horsepower and can hurt lower-end GPUs. (Check performance on the i7 with blending completely disabled - it should be measurably better).
  2. Set texture parameters once before calling GL.TexImage2D. Don't change unless absolutely necessary, because texture parameters can invalidate texture data (causing a very costly copy/revalidation cycle in the driver).
  3. Try to minimize state changes. Most GL.Enable/Disable and Bind calls cause state changes, which take time. In general, sort your resources by blend-mode (opaque first), material/shader and texture, in that order. This way, you can draw whole batches without changing states.
  4. If possible, try to use a texture atlas. This way, you can replace relatively costly BindTexture calls by cheaper TexCoord changes.
  5. Avoid immediate mode rendering (GL.Begin-GL.End), because it forces the GPU to work in step with the CPU and massacres performance. Display lists and/or vertex buffer objects can improve performance tremendously - they will often perform orders of magnitude faster than immediate mode.
  6. Additionally, store as much of the map as possible into a single display list / VBO and draw it with a single command (GL.DrawList or GL.DrawElements, respectively). This typically requires the use of a texture atlas.
cody's picture

Looks like you found the slowest way to do it.:)

A few thoughts:

1. Get rid of the texture parameter stuff and glBegin/glEnd in your render function.
2. Do you render the quads outside the viewport? If yes, dont do it.
3. Dont use immediate mode. Use vertex buffer objects instead. Maybe even if you put your whole map in one buffer and render it with one glDrawElements you will be faster.
4. Use a quadtree to throw away invisible parts of the map fast.

(Where 1 is the easiest solution and 4. the hardest)

I think there can be done quite much regarding performance. Even an old graphics card should be able to render this. Also the ION is quite fast but the ATOM is slow. So graphics is not the bottleneck i guess.

Keep up the good work.

Chris The Avatar's picture

Thank you for the tips

I humbly admit my lack of knowledge in OpenGL/3d programming techniques so I may ask for elaboration on what you mean for some of whats written :-)

In response to Fiddler:

1. I couldnt really disable blending even when not changing colors as it turns out once I did PNGs no longer rendered properly...
2. I eliminated 1 redudant location however it acctually wasnt changing the mode (so no improvement really here)
3/4. Only thing that gets enabled or disabled through my process is lighting... however I took what you said about bind calls into consideration and was able to boost performance on my netbook by 8 - 15 FPS (see below), I havent tried the I7 laptop yet... I have no idea what you mean by texture atlas (and this may just be terminology for me. If that is the idea of "rendering/copying a bunch of smaller textures to a larger one" and binding that one, this will not work in my case, the system loads textures on demand because it truely doesnt know what it will need next, so it is actually more ineffecient to render it twice when the textures change frequently)... I tested the idea with a technique that agatelib uses (runs ontop of an older version of opentk).
5/6. How can I use the other techniques you mention? (Display lists and/or vertex buffer objects)

in response to Cody:
1. Are you suggesting the same thing as fiddler did (Display lists/vertex buffer objects)? or are you suggesting I call glBegin/End at another code point? Doesnt the begin/end need to be called bewtween each quad unless one of those other techniques are used? Nothing obviously renders without them :-)
2. if you ask me if I render quads outside the view port, you mean while the viewport is active? It is avoided... I use the viewport to clip the edges of the tiles around the map since they are not square. (See below)
3/4. Examples would be helpful.

I was able to increase performance on my netbook a good bit (again 8-15 fps) by reducing calls to bind. I only called bind if the texture wasnt already bound and I am not sure if this helped but I only called GL.Color if it was nessecary to do so...

Surface lastSurf;
...
public void DrawImage(Surface Surf, System.Drawing.Rectangle SourceRect, System.Drawing.Rectangle DestRect, System.Drawing.Color Blender)
{
if (Surf != null)
{
if (lastSurf != Surf)
{

GL.BindTexture(TextureTarget.Texture2D, Surf.GlTextureID);
}

GL.TexParameter(TextureTarget.Texture2D,
TextureParameterName.TextureMinFilter,
(int)TextureMinFilter.Linear);
GL.TexParameter(TextureTarget.Texture2D,
TextureParameterName.TextureMagFilter,
(int)TextureMagFilter.Linear);

Point[] dstVertex = new Point[4];
PointF[] srcVertex = new PointF[4];
RectangleF srcCoords = new RectangleF(
SourceRect.X / (float)Surf.Width,
SourceRect.Y / (float)Surf.Height,
SourceRect.Right / (float)Surf.Width,
SourceRect.Bottom / (float)Surf.Height);

srcVertex[0] = new PointF(srcCoords.Left, srcCoords.Top);
srcVertex[1] = new PointF(srcCoords.Width, srcCoords.Top);
srcVertex[2] = new PointF(srcCoords.Width, srcCoords.Height);
srcVertex[3] = new PointF(srcCoords.Left, srcCoords.Height);

dstVertex[0] = new Point(DestRect.X, DestRect.Y);
dstVertex[1] = new Point(DestRect.X + DestRect.Width, DestRect.Y);
dstVertex[2] = new Point(DestRect.X + DestRect.Width, DestRect.Y + DestRect.Height);
dstVertex[3] = new Point(DestRect.X, DestRect.Y + DestRect.Height);

float colorA = Blender.A / 255.0f;
float colorR = Blender.R / 255.0f;
float colorG = Blender.G / 255.0f;
float colorB = Blender.B / 255.0f;

GL.Begin(BeginMode.Quads);

for (int i = 0; i < 4; i++)
{
if (Blender != Color.White)
{
GL.Color4(colorR, colorG, colorB, colorA);
}

GL.TexCoord2(srcVertex[i].X, srcVertex[i].Y);
GL.Vertex2(dstVertex[i].X, dstVertex[i].Y);
}
lastSurf = Surf;
GL.End();
}
}

the Fiddler's picture

Try moving the two GL.TexParameter calls from the rendering code (where they are now) to the texture loading code. These calls are persistent - you only need to call them once for each bound texture and they will "stick" with that texture forever. This might give a speed up on some drivers.

A "texture atlas" is a single large texture that holds multiple smaller ones, which can be *much* faster than separate small textures (as you've noticed, GL.BindTexture is quite costly). It can be made to work with dynamic textures using some eviction policy, but that requires a lot of fine-tuning (e.g. different atlases for short- and long-lived textures, or even no atlas at all for completely dynamic ones).

What cody is suggesting is to avoid rendering tiles that fall outside the area visible by the camera. Viewport clipping will reject those tiles anyway but it's going to be much faster if you never draw them in the first place. In other words, if your map is 100x100 tiles large and your camera can view a 20x20 area, then draw only those 20x20 tiles rather than all 100x100. (The on-screen result is going to be identical but drawing 400 tiles is going to be faster than 10000).

The example browser has a few samples on display lists and vertex buffer objects. The idea is that you specify the geometry once at load time and then draw it with a single draw call (contrast with immediate-mode which respecifies vertices/texcoords every frame). The downside each draw call can only bind a single texture (which is why it works best with a texture atlas). I don't think this can work without rewriting your renderer, though.

There is still room for optimization but the biggest performance gains will require a paradigm shift. Your code is treating the hardware as an old-school 2d blitter (load bitmaps and blit them to screen one by one). However, this model doesn't apply to modern hardware anymore. GPUs now operate (simplistically) on three sets of data: geometry, a set of textures and a shader that controls how these textures will be applied to the geometry.

  • Since you are using built-in vertex lighting, you can ignore shaders for now.
  • A modern card can render thousands of polygons per frame without a noticeable impact to framerate, as long as the geometry is submitted in large batches (>= 1024 vertices/batch for optimal performance).
  • Even IGPs can use at least 128MB of memory nowadays. An uncompressed 1024x1024 texture atlas will take up 4MB of that memory and will be able to hold every single texture in this screenshot, including animations.

Considering the above, would it be possible to split the map into, say, 32x32 regions, give each region own atlas (or share them if possible) and render each region with a single draw call? This would help performance tremendously.

Chris The Avatar's picture

I use basic rectangular geomtry to determine if a tile/objects should be drawn ever before it gets rendered to opengl, so only what you see (other than the tile/object edges that get chopped are rendered). Essentially I hold a some of the logical map in memory say a 100,100 xy + all virtual zs associated in the xy buffer. as your character moves a move request is added to a movement queue that another thread comes picks up and loads more of the map from the disk in that direction and discards the opposite end of the map, in addition all the dynamic objects (such as characters/objects in the game are loaded from a remote server and the client doesnt know until sent.... You never really know what will be comng next, even the animtions which if you take into account the animations on the screen would easily overflow the 1024x1024 texture...

But if I read you right, your saying load all the static information in say my 100x100 buffer (essentially a single instance of each image used in that 100x100 buffer) to a single texture? What happens when you move one tile out... Unlike diablo or eschalon, for example, there are no areas to a map, it is a continual flow as you move the map loads and moves with you. There are other maps such as caves etc but I am not sure how I would effeciently load and unload small chunks so fast because in a sense that is what I am doing now... My loading style is probably very similar to how the Ultima VII/Online series works. I load textures on demand and they are discarded if 10 seconds goes by without there use... additionally the first time the texture is loaded from disk I cache a Bitmap object in regular memory so its faster the next time around to load. If system memory fills up to much I dump some of the cache...

On the vertex buffer... is it more effecient/possible to load multiple vertex buffers and draw each one using multiple textures since you noted the geometry doesnt need to be redefined? (help me out I am just trying to understand... I realize that I probably sound like a fool with some of these questions, I am an OOP/Communications focused programmer, I deal with the server sides of this in my job :-). To me when I sat down to write this I architected around the data not the graphics, and that is something that seems to be significantly different about 3d programming is that the focus is more about the graphics than the data behind those graphics)...

I will try moving the GL.TexParam code you mentioned to the load... BTW when I tried to take it out performance actually dropped by 10 frames on my main pc which usually gets about 150 frames per sec... I should of mentioned that before.

Is there anyway I can draw multiple quads within one begin/end statement using the architecture I have now... Basicly when does the begin/end need to be recalled?

Thanks Again!

Chris

cody's picture

You need to call glBegin/glEnd when you bind another texture or change texture parameters. so basically you can draw all quads that use the same texture in one batch.

I think I missunderstood "viewport clipping". I thought you mean the viewport clipping that the graphics card does.

So the easiest way to get better performance might be using vertex buffers. Problem is of course that your data is dynamic.
So instead of rendering directly with immediate mode you could write your data in a vertex buffer(once per frame). Then you have one draw call per texture which should be much better than 2 calls per vertex and you only have to bind each texture once. And this works without changing your whole data layout.

Chris The Avatar's picture

Hi all so what I ended up doing after thinking in the shower (my best thoughts always seem to come out there :-) this morning was rather than try to actively and dynamicly manage these textures in giant textures which didnt look all that feasible in the short term was to take my most commonly used graphics that show up on the map such as structures/tiles/landscape and lump them in a few 1024x1024 textures (took 3 just to fill a small subset of other less common things that are drawn). My logic being I would potentially further reduce texture binds significantly using my current method and eventually use vertex buffers with these textures. (which i hope to look into relatively soon in the next few weeks). I would also be able to reduce disk reading as the map loads since the textures would already be cached in memory for most static based tiles on the map.

Ultimately after implementing the change I gained about a 4% in drawing performance improvement (I was expecting a little more but not bad) and reduced the occassional jitter during walking/running to almost nothing (due to disk loading). So the net book with the atom processor went from 15- 25 FPS to ~ 39 - 75 FPS (mostly playable) thanks to all the tips you have provided me with.)

I also implemented the methods that are intended for batch drawing coords but they acctually decreased performance by 10% in the end, so I took them out and went back to the begin/end method. (see below)

GL.TexCoordPointer(2, TexCoordPointerType.Float,
                                   Marshal.SizeOf(typeof(TexCoord)), srcCoordList);
                GL.VertexPointer(2, VertexPointerType.Float,
                                 Marshal.SizeOf(typeof(VertexCoord)), dstCoordList);
                GL.DrawArrays(BeginMode.Quads, 0, coordIndex);

Cody,

I tried to reduce the begin/end calls only when different so that they were only called when textures are bound and at the end of the frame... Unfortunately I must be missing something because, the first texture is correctly drawn in the frame and then I endup with white. If you have any ideas on why my result may come out this way let me know.

Thank you both(Fiddler, Cody) very much! Your help has greatly improved my performance on the lesser PCs of the world which is what I was shooting for!

the Fiddler's picture

White textures generally indicate an OpenGL error. Check GL.GetError or compile and test with the debug version of OpenTK.dll.

Nitpick: your code above is using vertex arrays, not vertex buffer objects. Unlike VBOs, vertex arrays are pretty slow.

Do take a look at display lists, though, as they would allow you to keep your current rendering architecture while improving performance significantly. Here is a good DL tutorial.

Chris The Avatar's picture

Thanks for the tutorial, I read through the first few pages, this method doesnt look to bad in there basic example, ill see if I can get this implemented in the next couple of weeks.

Thanks for all your help again!

cody's picture

Display lists are quite outdated though, I think you shouldnt use them. Also they only work with static geometry.