OpenGL Performance Optimization
OpenGL Performance Optimization
YANG Jianjyang@cad.zju.edu.cn
This article is relatively long, I hope everyone can read it. ^ _OOPenGL State MachineTypical D3D9 Hardware architectureLess State ChangeGL_TRIANGLE_STRIP instead of GL_TRIANGLESTexture LoadingTexture Composite Texture MipMapMulti Pass vs. Single PassTexture CompressionAvoid Pixel OperationsVertex Array, Display List, VERTEX BUFFER Object advanced tech: vs and ps. less Operations for Depth Test, Stencil Test, Alpha Test, fast shadow Misc: LOD, CULL, SWPABUFFERS, WGLMAKECURRENT ETC
The last lecture is about the OpenGL Driver architecture. It is estimated that everyone has a lot of questions, and I have read it again, I found that some problems are not clear enough, and there is no purpose of the driver architecture. Its main purpose is to use the OpenGL application framework structure when we understand the architecture of the driver.
Today I will discuss how to optimize the performance of OpenGL applications with OpenGL state machines and a typical D3D9 hardware architecture. MSDN's OpenGL Help also mentioned the issue of performance optimization, but this is already ARTICLE for many years, and many new technologies continue to emerge with the development of graphics accelerated hardware, we should follow the pace of development in the times. The content I will talk about today should be uncomfortable, I hope everyone will finish and supplement.
1 OpenGL state machine (state machine) OpenGL state machine is only 1.1 version, is also the most classic, you can refer to the following link: ftp://ftp.sgi.com/opengl/doc/opENGL1.1/State.pdfftp : //ftp.sgi.com/opengl/doc/opengl1.1/state.ps
They are expressed in state machines that are identical and different formats. There is only one picturecript of PostScript throughout the file. This picture is actually a hardware program process of SGI RealityEngine.
First, hardware accepts the vertex information entered by the application, (Color, Normal, Texture, Edgeflag, Vertex,), after the World Coordinate Transform (GLTranslate, Glrotate, Glscale), followed by User Clip Plane, then enter the view change and clipping (Projection Matrix) ), Then the viewport transform (VIMTIVE SETUP, ragization processing (FLAT or PHONG) generates fragment fragment, the following to each sequential texture, texture blend, depth test (DEPTH TEST) , Template test (Stencil test), transparent test (alpha test), transparent mix (APHA BLEND), then written to color buffer, depth buffer, template buffer. The whole process is as follows: Application | Vertex Information (Material, Normal, Textcoord, EdgeFlag, Vertex Position) | Lighting | World Matrix Transform | User Clip Plane Clipping | Projection Matrix Transform and Clip | ViewPort | Primitive Setup (point, Line, Triangle) | Rasterization (Flat or Phong) ==> Generate Fragment | Fragment Texture Addressing () == Texture In Video memory | Fragment Texture Blend (blend Diffuse, Specular and Texture of Fragment) | Depth Test == with Depth Buffer | Stencil Test == With Stencil Buffer | Alpha Test == With Alpha Channel of Color Buffer | Alpha Blend == With Color Buffer | Fragment Write To FrameBuffers
We can see that OpenGL processes a geometric figures, which requires a lot of processing. Everyone should work with each step of this figure. Here there are several concepts to explain.
The first concept is Fragment, a piece or a chill. Each piece corresponds to a pixel point on the screen, which is the rasterization engine to generate the flash shading or pHONG shading. The fragment generated by the Rasterization engine contains information: Screen coordinates; Color information, Diffuse and Specular; Depth information and template information; Texture coordinate, U, V
The second concept is Texture Blend, which refers to a method of texture color and fragment color (Diffuse and Specular) synthesis. That is to mean the effect of Gltexenv, determine the clip according to different parameters to keep the TEXEL (texture element) or mix using the TEXEL (Texture Unit) and the segment. The third concept is transparently integrated alpha blend. If a piece is tested, the template test and transparent test, then it will be transparently fused with the pixels corresponding to the buffer.
I believe that everyone has a certain understanding of the OpenGL state machine. In fact, this is also the main reference model for Direct3D8's previous graphics pipelines.
============================================================================================================================================================================================================= ================ If we can reduce one operation in the pipeline, we can achieve performance improvement, of course, the premise is that we can draw the correct image.
2 Typical D3D9 Hardware Architecture The above OpenGL state machine is actually the SGI's Reality Engine and other Direct3D7 and its version of the version of the graphics hardware pipeline structure. Let me introduce you to the typical hardware architecture of D3D9 (or said Direct3D9 reference model).
Application | IDirect3DDevice9 :: DrawIndexedPrimitive | D3D Driver (Display Driver) Send Commands to Hardware by AGP | | following is hardware Command Interpreter || "Fetch" Indexed Primitive data to Vertex Shader Cache (access index buffer and | Vertex Buffer) || " Put "Cached data to Vertex Shader Input | Vertex Shader do Transform, Light and Vertex Blend | Vertex Shader Output Vertices in Screen coordinate Space, | Screen Pos, Diffuse, Texture Coord || User Clip plane | Guard band clip | Primitive Setup (Point , Triangle | Rasterizaiton | Pixel Shader | Depth Test | Stencil Test | Alpha Test | Alpha Blend | Frame Buffers
We can see that the D3D9 pipeline and OpenGL 1.1 pipeline are very different. OpenGL's vertex data is to immediately mode (IMMEDIATE MODE) by calling the OpenGL API, and D3D9 works from FETCH and PUT, read from Vertex Buffer to Vertex Sahder. Input register; OpenGL 1.1 light calculation and geometric transformation is the -fixed function graphics processing (FGP) completed by traditional fixed pipeline (TNL: Transform and lighting), while D3D9 is implemented through Vertex Shader, it is more complicated than FFGP. You can complete more features; OpenGL 1.1 Texture Mapping and Texture Blend are independent, and D3D9 is programmable by Pixel Shader, PS is programmable. Vertex Shader and Pixel Shader of D3D8 / D3D9 are huge progress in two graphics architecture, which of course makes graphical programming more flexible, and more difficult and complicated.
For D3D8 / D3D9 hardware architecture, our program optimization has more work, optimizes Vertex Shader and Pixel Shader.
Today my focus is on the performance optimization of traditional graphics flow wire (TNL).
3 basic optimization method
3.1 Reduce the state change of OpenGL If our application continues to change the state of OpenGL, the load of the driver and AGP data is transmitted, the burden of graphics hardware increases. Because whenever we change an OpenGL state, the data of multiple registers of the hardware may be involved, the driver must send the modified hardware register to the hardware through the AGP bus, accounting for a large number of CPU resources and the AGP bandwidth and hardware commands explain The time of time.
Advice1: Place the status of the status similar to the status as possible to reduce the OpenGL status change. Advice2: Use the status collection to reduce the CPU processing time of the driver,
3.2 Avoiding illumination calculations, especially high-light calculations, calculation of Specular, is one of the most time consuming operations in light calculations. DIFFUSE calculations are relatively common, and general graphics hardware optimize the DIFFUSE operation.
3.3 Type Optimization We use most of the primitive types of Triangle. If we use GL_TRIANGLES every time we will waste a lot of CPU time and AGP bandwidth and graphics hardware resources. The reasons are as follows: (1) Use GL_TRIANGLES, we draw a triangle, we will send three fixed-point data, if we use G: _Triangle_fan or GL_TRIANGLE_STRIP, then we can average a vertex for each triangle. (2) The general hardware design has opened a certain Cache area. If you use GL_TRIANGLE, we will not be able to use a graphic hardware Cache, a lot of graphics hardware TNL time. (3) Use gl_trianlges will consume 200% of the hardware TNL time than GL_TRIANGLE_STRIP.
According to the test, I did three years ago on the test of OpenGL on Geoforce 3 and Geoforce Quadro 3, GL_TRIANGLE_STRIP was 100% ~ 200% faster than GL_TRianLGES.
Recommendation: Use GL_TRIANGLE_STRIP as much as possible to replace GL_TRIANGLES. Mature Software of Triangular Stripe: http://www.cs.sunysb.edu/~stripe/3.4 Lighting Use GLMATERIAL instead of GLCOLOR Under illumination, if the program uses GLMATERIAL, the driver only loads the material attribute over and over the hardware. Using GLColor will cause the driver to load color information for each fixed point. More CPU time and AGP bandwidth will be taken.
4 Texture Optimization This topic is more, so I use it as an independent topic.
4.1 Optimizing Texture Loading beginners OpenGL A common performance optimization problem is to reset the texture parameters each time you use a texture, and call the GLTEXIMAGE2D function. In fact, OpenGL has a naming mechanism for textures and Display Lists, GlbindTexture, GlDeleteTexture, GlbindTexture. Let's compare the effect.
Method 1: Call GLTEXIMAGE2D before using the texture and reset the texture parameters. Then the driver will continue to call IDirectdraw7 :: CreateSurface and copy the data from the user memory to the driver system memory, and then copy from the system memory area to Video Memory. Method 2: Use Gltexenv and GLTexImage2D to set the current texture parameters and texture content, then call GLBINDTEXTURE, such as the 5th texture; if you need to use the texture, call the GlbindTexture function again, GlbindTexture will set the 5th texture to the current texture, and Parameters last set parameters, you can decide whether to modify the parameters as needed. Method 2 The main advantage is that the application only calls GLTexImage2D to save a large number of CPUs and AGP times because the CPU is the most time consumption, Overhead is Very High.
Advice: When the application needs multiple Textures, then call GLGENTEXTURES to generate a life name texture after calling WGLmakeCurrent, and use GlbindTexture to make a texture binding; . When you need to use texture, you will call GlbindTexture again.
Further reading: OpenGL Spec & OpenGL Manual: http://www.opengl.org/developrs/documentation/specs.htmlglut Examples: http://www.opengl.org/developers/documentation/glut.html
4.2 Try to use MIPMAP texture General graphics hardware support MIPMAP, if the application uses MIPMAP, the graphics hardware calculates TEXEL according to the texture LOD corresponding to the current segment, which saves a lot of texture element Video Memory addressing time, and graphics hardware pair Texture Make Cache, the size of the size in MIPMAP (LEVEL is relatively large) can save a lot of calculation time. If the application only provides the largest texture of Level 0, the graphics hardware will use this texture for texture, not only to waste a lot of computing resources, but also consume a lot of graphics chip bandwidth.
Advice: 1. Do not use a particularly large texture.> 256 x 256 2. Using MIPMAP. Tips: Glubuild * DMIPMAPS can transform non-2 ^ N textures with MIPMAPS standard OpenGL texture. However, Glubuild * DMIPMAPS does not support automatic MIPMAP of compressed texture. Further reading: Glu Manual: ftp://ftp.sgi.com/opengl/doc/opengl1.2/glu1.3.pdf
4.3 Texture Portfolio In the game or visualization, we will always encounter a lot of very small textures. A better way is to combine these textures into a relatively large texture, such as 256x256, so that the driver is loading the texture At the address of Video Memory, the driver only needs to load a home. This method is often seen in multiple styling software, such as human body modeling software POSE, which combines a person's hair, face, eyes, and the like into a texture.
Advice: combine multiple small textures into a large texture, then modify the texture coordinates of the triangular design, or use GLmatrixMode (GL_Texture) to do geometric conversion.
4.4 Use multitExture to replace MULTI-pass
OpenGL 1.2.1 Extension: GL_ARB_MULTITEXTURE
Direct3D7 (OpenGL .2.1) and Higher Version Supported Display Cards support MUTLITEXTURE features, we can make full use of this feature to make multipedia map replace multi-pass.
For example, we hope to draw a cola bottle, and this Coke bottle requires two labels, using multi-passs we can draw three times,
// Draw the nature of the bottle, such as green, glmaterial (...);
glDisable (GL_BLEND); glDepthFunc (GL_LEQUAL); glBegin (GL_TRIANGLE_STRIP); // Texture glNormal (); glVertex (); ... .glEnd (); // draw inside label glDpethFunc (GL_EQUAL); glEnable (GL_BLEND); glBindTexture ( 0,); glbegin (); gltextcoord (); glvertex (); glend ();
// Draw a second layer tag GldPethFunc (GL_EQUAL); GLENABLE (GL_Blend); GlbindTexture (1,); glbegin (); gltextcoord (); glvertex (); glend ();
If you use MutliTexture (OpenGL.2.1 expansion), we just need to get the job done Single Pass: glMaterial (); glDepthFunc (GL_LEQUAL); glDisable (GL_BLEND); glActiveTExtureARB (GL_TEXTURE0_ARB); glTexEnv (,, GL_MODULATE); glBindTExture (0) ; glActiveTExtureARB (GL_TEXTURE1_ARB); glTexEnv (,, GL_MODULATE); glBindTExture (1); glBegin (GL_TRIANGLE_STRIP); glNormal (); glMultiTexCoord2fARB (GL_TEXTURE0_ARB, u0, v0); glMultiTexCoord2fARB (GL_TEXTURE1_ARB, u1, v1); glVertex (); glEnd (); MUTLITEXTURE method will save 4 arithmetic steps of the pipeline than the first method, Depth Test, Alpha Test, Alpha Blend, and Write to Frame Buffers. Advice: Check OpenGL Extension support to use MultiteXTure as much as possible. Further reading: OpenGL SPECS: http://www.opengl.org/developers/documentation/specs.html
OpenGL EXTENSION registry: http://oss.sgi.com/projects/ogl-sample/registry
4.5 Using compressed texture
OpenGL texture compression support include: GL_COMPRESSED_RGB_S3TC_DXT1_EXTGL_COMPRESSED_RGBA_S3TC_DXT1_EXTGL_COMPRESSED_RGBA_S3TC_DXT3_EXTGL_COMPRESSED_RGBA_S3TC_DXT5_EXT compressed textures than non-compressed textures with a faster CPU and a smaller storage space requirements, and very easy to use graphics hardware texture of Cache. Therefore, it is possible to significantly improve application performance, and the amount of texture data for special applications is huge. Disadvantages: The color space requiring the texture is extremely regular, otherwise it will cause serious color distortion.
Recommendation: Check the following three OpenGL Extensions, use compression texture as much as possible. GL_ARB_TEXTURE_COMPRESSION GL_EXT_TEXTURE_COMPRESSION_S3TC GL_S3_S3TC
Advice: Check OpenGL Extension support to use MultiteXTure as much as possible. Further reading: OpenGL SPECS: http://www.opengl.org/developers/documentation/specs.html
OpenGL EXTENSION registry: http://oss.sgi.com/projects/ogl-sample/registry
We can use the DirectX SDK tool to generate compression texture DXTEX, or get tools from Nivdia and Tutorial: http: //developer.nvidia.com/object/nv_texture_tools.html
4.6 Reasonable texture Dimensions Hardware systems generally use 4x4, 8x8, up to 64x64 texture Cache strategy, if your texture is relatively simple, with a smaller texture size as much as possible under the request of the sense of sensory.
5 Vertex Array
Compared with Glbegin, Glend, Vertex Array has the highest memory replication efficiency for drivers, because drivers only need memory data movement, Glbend, Glend, and Display List, require three data movement. Therefore, using GldRawarrays and GlarrayElement as much as possible. For Vertex Array, OpenGL has the following Extensions: GL_EXT_vertex_arrayGL_ATI_element_arrayGL_EXT_draw_range_elementsGL_EXT_compiled_vertex_arrayGL_SUN_mesh_arrayGL_ATI_vertex_attrib_array_object
Among them, three are often using OpenGL Extensions, such as Quakeiiii, CS, Half Life, and the like.
Further reading: OpenGL SPECS: http://www.opengl.org/developers/documentation/specs.html
OpenGL EXTENSION registry: http://oss.sgi.com/projects/ogl-sample/registry
6 Buffer Object
In fact, the content I have told above is a traditional OpenGL element definition, essentially through glbegin and glend definition, all of which are scheduled to draw immediately. Direct 3D is defined by the properties of the Vertex Buffer and Index Buffer and its composition vertices. Vertex Buffer and Index Buffer are saved in Video Memory so that the application does not need to send the grounded vertex data to the hardware each time, thereby accelerating the processing speed. In order to make up for this defect, NVIDIA and ATI have launched the following Extension: GL_ARB_VERTEX_BUFFER_OBJECT
At the same time, this extection is also serving OpenGL's Vertex Proram (ie D3D9 Vertex Shader), and there is more related content about this extension, I don't start this. Here you tell you that it is an OpenGL Extension that is fast than all immediately schema element definition methods. The reasons are as follows: (1) It only needs to copy to OpenGL applications once again, then the driver only reports its physical address to the graphics; (2) and for the primary mode definition, the driver is each time Data needs to be copied from the memory to the AGP Non-Local Video Memory and then sent to the graphics hardware processor via the AGP bus.
Please refer to OpenGL Extension Registry: http://oss.sgi.com/projects/ogl-sample/registry
7 Advanced Tech: Vertex Program and Fragment Program (D3D Vertex Shader and Pixel Shader)
This content is too long, I put it in the topic of D3D9.
8 Less Operation for Depth Test, Stencil Test and Alpha Test
In fact, Depth Test, Stencil Test, Alpha Test can affect 30% of OpenGL pixel fillers. That is, if you optimize them, you can get 30% performance.
I have been testing the performance of Quake III to get the following results; Disable Depth Test 2% GAIN Disable Alpha Test 6% GAIN DISABLE Alpha Blend 2% GAIN DISABLE ALPHA BLEND 2% GAIN DISABLE Depth Clear Always15% GAIN In fact, Quake III itself can be further optimized, Everyone knows that Quake III is the most classic game engine. It draws a pattern of BSP. It uses multiped patterns and Alpha Bles to get a very good illumination effect. The order of drawing is from the farthest object to the nearest place. The object, from the far and near order, if Quakeiii is changed from the near and far order, the number of triangular occlusion relationships in the Quake III, when drawing the pattern by near and far, DEPTH TEST will thrown away 5% ~ 10% or even more pieces (pixels), then the operation behind the pipeline will not be executed, thereby obtaining performance, I believe this will bring 5% to 15% performance improvement.
So the roaming of outdoor scenes, I suggest that you have adopted near and far. Maybe it will bring great performance improvement.
9 Fast Shadow is doing similar work, I want to throw bricks in the future, as a separate special introduction.
10 Misc: LOD, CULL, SWPABUFFERS, WGLMAKECURRENT ETC
The last part of the small title is quirky, it is a hodgepodi.
10.1 LOD
Many people know, I don't say much, is less geometric data (Vertex) and texture operations (Texture Lod: MIPMAP).
10.2 Cull Facecull Face is the north to delete, if not drawing the triangle on the back, theoretically, nearly 50% performance is improved, provided that the TNL or Vertex Shader is enough. GLENABLE (GL_CULL_FACE); GLCULLFACE (GL_BACK);
In my test of Quakeiii, although Quakeiii is BSP tree based, QUAKEII should not have a back object, I still have improved performance (different CPU and bus speed).
10.3SWAPBUFFERS in fact, the full screen OpenGL program is call idirectdrawsurface7 :: flip or idirect3ddevice8 :: present, then each FLIP operation will be less than the window of the OpenGL program 1024x768x4 Bytes display memory data movement, the resolution is 1024x768x32bits According to different applications, it can obtain considerable performance improvement, you can calculate it yourself.
10.4 WGLmakecurrentWglmakecurrent is a very time consuming operation, I have tested GeOForce3 Ti500 in 2001. In the best case, GeOForce3 Ti500 can do 5000 times / sec. At that time, the CPU speed seems to be 800m or 1.4GHz. I am not very clear. At the same time, WGLmakecurrent may bring side effects, and some images may be lost. One of the typical tests, INDY3D uses this method. When I follow this program, I feel that Sense8 (developing VTK's company) program is too bad.
Advice: Be sure to avoid calling WGLmakecurrent. Written for 3.5 hours, the finger is very painful, take a break.
11 Avoiding pixel operations in OpenGL's implementation, using pure software to realize replication from the system memory to Video Memory, then the execution of the entire graphics pipeline, waiting for the hardware to use the CPU to complete, they will Great reduced the performance efficiency of the program. Glbitmap, GldrawpixelsglreadpixelsglcopyPixels
Workaround: Use texture alternative pixel operation, such as building you want to output a line on the screen, such as "Qauke III Arena", first producing a texture, it contains all letters and numbers, I can't post BMP images here, I painted one Storage structure: Represents 2D texture of RGBA, which is the alphabetic sequence of Quake III. A B C D E F G H i J KLM N o P Q R S T U V WX Y ZA B C D .... 1 2 3 4 5 6 9 8 9 0 Use two triangles to generate a letter or number.
Supplement: 4.5 Using Compressed Texture We can use the DirectX SDK tool to generate compressed texture DXTEX, or get tools from Nivdia and Tutorial: http: //developer.nvidia.com/Object/nv_texture_tools.html