ATI Radeon HD 2900 XT

ATI’s DirectX 10 capable GPU is finally here in the form of the Radeon HD 2900 XT. Does this $399 video card have what it takes to compete with NVIDIA’s 8800 series? We explore the architecture, image quality, and real world gaming that shows a different experience than canned benchmarks.

continued...

Process Technology

Article Image

ATI is using a unique 80nm HS process for the HD 2900 XT. While this process does not have the power efficiency of 65nm that the HD 2600 and HD 2400 appreciate, it was designed to allow ATI to hit high clock speeds with the HD 2900 XT. The power utilization is quite high on the Radeon HD 2900 XT and we will show you full power testing toward the end of this evaluation which you don’t want to miss. The Radeon HD 2600 and 2400 at 65nm are the most power efficient designs yet in the GPU desktop space.

Unified Architecture

Article Image

As we would have expected ATI it using a unified shader architecture that allows processors, called streaming processors, to compute vertex, pixel and geometry data. If you read our DirectX 10 & the Future of Gaming article you will see what the Radeon HD 2000 series is all about and the benefits of DirectX 10 and a unified architecture. With that in mind we won’t waste time recapping that information, so if you are lost on what a unified architecture is please read that article first.

Article Image Article Image

Here is the layout of the Radeon HD 2000 series architecture. We start with the command processor which goes to the setup engine, then the dispatcher which issues out the threads (pixels) to the stream processing units and texture units as needed, this information is read into and out of memory, travels to the ROPs, and finally out to memory. There is a little something up at the top you will see that is something new called a Programmable Tesselator, we will talk more about that later.

Dispatch Processor

Article Image Article Image

The key to the new Radeon HD 2000 series is the Ultra-Threaded dispatch processor. ATI is very proud of this dispatch processor stating that it is very smart at determining where threads need to go. As games become more complex utilizing a greater amount of vertex and pixel data and eventually geometry data as well, the GPU must keep all streaming processors busy at all times and make sure the load is being balanced as it should. That is the purpose in a nutshell of the thread dispatcher.

There are literally thousands of threads running around the GPU and the dispatcher must issue each one out and prepare new threads to go through the pipeline. Keep in mind that as threads are running around the GPU only so many are being worked on by the stream processors at a time.

Stream Processors

Article Image

The Radeon HD 2000 series is based on a Superscalar VLIW (Very Large Instruction Word) design. This differs from the previous Radeon X1000 series which was a combination of Vector + Scalar. This is why the dispatcher is extremely important in ATI’s Superscalar design; it is the key to making a Superscalar architecture efficient. If the dispatcher is inefficient at keeping all parts of the GPU fed with instructions performance will suffer.

The GeForce 8 series is based on a Scalar architecture. Theoretically a Superscalar architecture is better, but it all comes down to the dispatcher. NVIDIA tried a VLIW architecture with the GeForce FX series, and we all know how that turned out. There were promises of performance improvements through the compiler, but it did not go so well for them with that architecture. It remains to be seen how well ATI’s Superscalar architecture will hold up in future games, the potential is there for great things, we will see.

Article Image

Here is a close up of one of the stream processor blocks. Each of the “yellow” cylinders inside this block is a stream processor, and there are 64 blocks total which gives us the number of 320 streaming processors. This sounds like a lot compared to the GeForce 8 series, but it all depends on how you count them and both ATI and NVIDIA look at this differently.

ATI states that these are 320 true streaming processors. NVIDIA has been circulating a rather frank document to press stating their views on how they think ATI is counting the processors. They are stating that ATI is counting the standard ALUs and special-function ALUs to reach the number of 320. NVIDIA states that for their GPUs, if you count their units in the same way and add it up, an 8800 GTX has 128 standard ALUs and 128 special-function units for a total of 256 processors.

Whether this is the case or not it is quite clear that their architectures differ in a very interesting way. ATI does physically have more processors; they are arranged differently with the Superscalar architecture and they are clocked at the same speed as the core frequency. With the GeForce 8 series there are less streaming processors, but NVIDIA has created separate clock domains and has clocked them at high speeds, 1.2 GHz on the GTS and 1.35 GHz on the GTX. Which one is better geared for the future of games is impossible to predict right now.

Tesselator

Article Image Article Image Article Image Article Image

Included in the ATI Radeon HD 2000 series is a new unit called the Programmable Tesselator. This unit is contained within the setup engine, the very first part of the entire pipeline. It sits right above the vertex assembler and communicates directly with it. As the name implies this unit is capable of doing tessellation. Understanding tessellation is very easy; it simply means having a shape repeated without overlapping or gaps between them. For gaming it means take a shape made of triangles and breaking those triangles into many more triangles to create a smoother surface. The tesselator in the Radeon HD 2000 series is capable of performing different types of tessellation, integer subdivision, floating point subdivision level per primitive and floating point subdivision level per edge.

Currently in DirectX 10 having a tesselator processor is not required. However, in future versions of DirectX the tesselator is part of the pipeline and will be required. Eventually NVIDIA will have to add this in as well. ATI is currently ahead of the game as far as this technology goes. However, it is one of those things that has to be utilized by the game developer. The game developer has to specifically make their games utilize the tesselator engine, if they do not it is simply a piece of hardware in the GPU which goes un-utilized. This somewhat reminds us of Truform, ATI never could get support with that feature. However, this is a different scenario because the tesselator will be part of the DirectX spec in the future. It is helpful to game developers to have it in there now so they can start experimenting with it.

Texture Units

Article Image Article Image Article Image

The ATI Radeon HD 2900 XT has 16 texture units and can perform 16 bilinear filtered FP16 pixels per clock. In comparison the GeForce 8800 GTX has twice as many texture units, 32 and does 32 FP16 pixels per clock, and the GTS has 50% more with 24 FP16 pixels per clock. It seems that ATI is focusing more on shader processing like they did with the Radeon X1K architecture. The GeForce 8800 GTS and GTX seem to have much higher texture filtering performance available.

Thankfully ATI has enabled their "High Quality" AF option as the default filtering setting now. There is no more check box you have to worry about, it cannot be turned off. You will now receive the angle-independent AF algorithm at all times now by default.

ROPs

Article Image Article Image

There are also 16 ROPs in the ATI Radeon HD 2000 series. The GeForce 8800 GTS has 20 ROPs and the GTX has 24. The Radeon HD 2900 XT can perform 32 pixels per clock for Z, the GeForce 8800 GTS can do 40 and the GTX does 48. Some new antialiasing features have been added as well as greater MRT support and a 128-bit floating point format. Z and stencil compression has been improved as well as an improved hierarchical Z buffer and 32-bit floating point Z-buffer support. Interesting stance to take again, improved capabilities and features in the ROPs but the same format that was in the previous generation.