Why a GPU mines faster than a CPU
Some Bitcoin users might wonder why there is a huge disparity between the mining output of a CPU versus a GPU.
First, just to clarify, the CPU, or central processing unit, is the part of the computer that performs the will of the software loaded on the computer. It's the main executive for the entire machine. It is the master that tells all the parts of the computer what to do - in accordance with the program code of the software, and, hopefully, the will of the user.
Some computers have multiple CPU's, and some CPU's have multiple cores (which is almost the same thing as having multiple CPU's in a single physical package).
The CPU is usually a removable component that plugs into the computer's main circuit board, or motherboard and sits underneath a large metallic heat sink or fan.
The GPU, or graphics processing unit, is a part of the video rendering system in a computer. In many cases, the GPU is located on a separate circuit board called a video card, which is in turn plugged into a slot on the motherboard. The typical function of a GPU is to assist with the rendering of 3D graphics and visual effects so that the CPU doesn't have to.
Many computers don't have GPU's at all. A GPU isn't a requirement, it's just that they are appearing as standard equipment in computers nowadays, since newer operating systems support enhanced visual effects that depend on the GPU. For example, the translucent windows in Windows 7, or the row of icons in Mac OS X that bulges near the mouse pointer - these are effects that are made feasible with a GPU.
A GPU is like a CPU, but there are important internal differences that make them suited toward their special tasks. These are the differences that make Bitcoin mining far more favorable on a GPU.
Short Answer
A typical CPU core can execute 4 32-bit instructions per clock (using a 128-bit SSE instruction), whereas a GPU like the Radeon HD 5970 can execute 3200 32-bit instructions per clock (using its 3200 ALUs or shaders). This is a difference of 800 times more instructions per clock. As of 2011, the fastest CPUs have up to 6, 8, or 12 cores and a somewhat higher frequency clock (2000-3000 MHz vs. 725 MHz for the Radeon HD 5970), but this is far from enough to compensate for this huge difference.
A CPU is an executive
A CPU is designed primarily to be an executive and make decisions, as directed by the software. For example, if you type a document and save it, it is the CPU's job to turn your document into the appropriate file type and direct the hard disk to write it as a file. CPU's can also do all kinds of math, as inside every CPU is one or more "Arithmetic/Logic Units" (ALU's). CPU's are also highly capable of following instructions of the "if this, do this, otherwise do that". A large bulk of the structures inside a CPU are concerned with making sure that the CPU is ready to deal with having to switch to a different task on a moment's notice when needed.
CPU's also have to deal with quite a few other things which add complexity, including:
- enforcing privilege levels and the boundaries between user programs and the operating system
- creating the illusion of "virtual memory" to programs
- for the most popular processors, being backwards compatible with legacy code
A GPU is a laborer
A GPU is very different. Yes, a GPU can do math, and can also do "this" and "that" based on specific conditions. However, GPU's have been designed so they are very good at doing video processing, and less executive work.
Video processing is a lot of repetitive work, since it is constantly being told to do the same thing to large groups of pixels on the screen. In order to make this run efficiency, video processors are far heavier on the ability to do repetitive work, than the ability to rapidly switch tasks.
GPU's have large numbers of ALU's, more so than CPU's. As a result, they can do large amounts of bulky mathematical labor in a greater quantity than CPU's.
Analogy
One way to visualize it is a CPU works like a small group of very smart people who can quickly do any task given to them. A GPU is a large group of relatively dumb people who aren't individually very fast or smart, but who can be trained to do repetitive tasks, and collectively can be more productive just due to the sheer number of people.
It's not that a CPU is fat, spoiled, or lazy. Both CPUs and GPUs are creations made from billions of microscopic transistors crammed on a small piece of silicon. On silicon chips, size is expensive. The structures that make CPUs good at what they do take up lots of space. When those structures are omitted, that leaves plenty of room for many "dumb" ALU's, which individually are very small.
The ALUs of a GPU are partitioned into groups, and each group of ALUs shares management, so members of the group cannot be made to work on separate tasks. They can either all work on nearly identical variations of one single task, in perfect sync with one another, or nothing at all. Trying different hashes repeatedly - the process behind Bitcoin mining - is a very repetitive task suitable for a GPU, with each attempt varying only by the changing of one number (called a "nonce") in the data being hashed.
The ATI Radeon 5970 is a popular video card for Bitcoin mining and, to date, offers the best known performance of any video card for this purpose.
This particular card has 3,200 "Stream Processors", which can be thought of as 3,200 very dumb execution units that can be trained to all do the same repetitive task, just so long as they don't have to make any decisions that interrupts their flow. Those execution units are contained in blocks. The 5970 uses a VLIW-5 architecture, which means the 3,200 Stream Processors are actually 640 "Cores," Each able to process 5 instruction per clock cycle. Nvidia would call these cores "Cuda Cores", but as mentioned in this article, they are not VLIW, meaning they cannot do as much work per cycle. This is why comparing graphics cards by core count alone is not an accurate method of determining performance, and this is also why nVidia lags so far behind ATI in SHA-256 hashing.
Since ALU's are what do all the work of Bitcoin mining, the number of available ALU's has a direct effect on the hash output. Compare that to a 4-core CPU that can switch tasks on a dime, but has ALU's in some small multiple of four, if not just four ALU's alone. Trying a single SHA256 hash in the context of Bitcoin mining requires around 1,000 simple mathematical steps that must be performed entirely by ALU's.
That, in a nutshell, is why GPU's can mine Bitcoins so much faster than CPU's. Bitcoin mining requires no decision making - it is repetitive mathematical work for a computer. The only decision making that must be made in Bitcoin mining is, "do I have a valid block" or "do I not". That's an excellent workload to run on a GPU.
Why are ATI GPUs faster than Nvidia GPUs?
Firstly, ATI designs GPUs with many simple ALUs/shaders (VLIW design) that run at a relatively low frequency clock (typically 1120-3200 ALUs at 625-900 MHz), whereas Nvidia's microarchitecture consists of fewer more complex ALUs and tries to compensate with a higher shader clock (typically 448-1024 ALUs at 1150-1544 MHz). Because of this VLIW vs. non-VLIW difference, Nvidia uses up more square millimeters of die space per ALU, hence can pack fewer of them per chip, and they hit the frequency wall sooner than ATI which prevents them from increasing the clock high enough to match or surpass ATI's performance. This translates to a raw ALU performance advantage for ATI:
- ATI Radeon HD 6990: 3072 ALUs x 830 MHz = 2550 32-bit instruction per second
- Nvidia GTX 590: 1024 ALUs x 1214 MHz = 1243 32-bit instruction per second
This approximate 2x-3x performance difference exists across the entire range of ATI and Nvidia GPUs. It is noticeably visible in all ALU-bound GPGPU workloads such as Bitcoin, password bruteforcers, etc.
Secondly, another difference favoring Bitcoin mining on ATI GPUs instead of Nvidia's is that the mining algorithm is based on SHA-256, which makes heavy use of the 32-bit integer right rotate operation. This operation can be implemented as a single hardware instruction on ATI GPUs, but requires three separate hardware instructions to be emulated on Nvidia GPUs (2 shifts + 1 add). This alone gives ATI another 1.7x performance advantage (~1900 instructions instead of ~3250 to execute the SHA-256 compression function).
Combined together, these 2 factors make ATI GPUs overall 3x-5x faster when mining Bitcoins.