This article comes from computer architecture expert Wang Wei. He believes that "the performance of 10,000 times after the end of Moore's Law" will not be a science fiction, but a fact that happens before our eyes.
In 2008, "Three Body 2: Dark Forest" wrote:
It's really hard. Soon after you hibernate, there are six new generations of supercomputer large-scale research projects starting at the same time. Three of them are traditional structures, one is non-von structure, and the other two are quantum and biomolecular computer research projects. . But two years later, the chief scientists of the six projects told me that the computing power we wanted was simply impossible. The quantum computer project was the first to be interrupted, and the existing physics theory could not provide enough support, and the research hit the wall of Tomoko. Then the biomolecular computer project was also dismissed, and they said it was just an illusion. The last stop is the non-von structural computer. This structure is actually a simulation of the human brain. They say that our egg has not yet formed, and there is no chicken. In the end, only three traditional structural computer projects are still in operation, but there has been no progress for a long time.
Fortunately, the computer we want is still there, and its performance is ten thousand times that of the strongest computer you hibernate. Traditional structure? The traditional structure, which can squeeze so much juice from Moore's Law, is very surprising in the computer science community. But this time, dear, this time it’s really over.
That was the last year of my Ph.D. in computer architecture, when I was scornful: how could Moore's Law still have so much oil and water to squeeze. The process limit is close at hand, and without Mozi's shot, Moore's Law will die. The traditional structure is nothing, and the CPU architecture has been studied. Since 2000, there have been few new things.
Therefore, this "10,000 times" is really a good science fiction sci-fi.
Looking back at the nine years after the publication of the three-body 2, the progress of the process is difficult, the micro-architecture highlights, and the performance of the CPU is squeezed toothpaste every generation. Everything seems to confirm my pessimistic expectations - the performance of computer hardware seems to really not improve.
However, since last year, "sci-fi" events have come:
In March 2016, AlphaGo defeated Li Shishi, which used 1202 CPUs and 176 GPUs.
In April 2016, NVidia released the Pascal architecture with a peak performance of 11TFLOPS/s. Huang Renxun said in an interview with Xinzhiyuan that the semiconductor technology iteration is slowing down, but the GPU Pascal architecture has nearly ten times better performance than the previous generation in two years. Therefore, we can say that we are in the era of "Super Moore's Law".
On May 11 this year, NVidia released the Volta architecture with a peak performance of 120TFLOPS/s.
On May 11 this year, Google announced the second generation of TPU, peak performance of 180TFLOPS / s, and can be accessed through Google Cloud
On May 23 this year, AlphaGo returned to the rivers and lakes and defeated Ke Jie without any suspense. On the 24th, David Silver, CEO of DeepMind CEO and AlphaGo project, said in an interview at the press conference that AlphaGo is actually Running on a single machine in Google Cloud, this machine is built on top of the second-generation TPU (it is said that this machine uses 4 TPUs)
Today, when Moore's Law has been seriously decelerated or even failed, we have actually seen a substantial increase in computing power, and this arms race of computing power continues!
And I have also changed my pessimistic expectations. I believe that in the near future, after the end of Moore's Law, the performance will increase by 10,000 times. It will not be a science fiction, but a fact that happens before our eyes.
Is this crazy? The technical otaku who design computer hardware, why do this? Based on the technical route represented by the TPU and the new business model. And listen to me slowly.
Why is the CPU inefficient?Before explaining what can be done 10,000 times after Moore's Law, let's talk about why CPU and GPU can't do the heavy lifting.
If you ask me, what is the biggest feature of the CPU? I would say: It gives the programmer an illusion that the latency of accessing a large memory location is the same, and is similar to the delay of doing an addition, almost zero.
It is very difficult to create this illusion. It is very different to know the Logic production line used by the CPU and the Memory production line for memory. Simply put, due to some underlying physical laws, the Memory line cannot achieve the high speeds required by the CPU, and the Logic line cannot achieve the large capacity required for memory. To make matters worse, the Memory process is slower and slower than the Logic process. From 1986 to 2000, Logic accelerated 55% per year, while Memory was only 10%.
What is "fast" and "slow"? "Fast" in spoken language can mean that the delay is small (the time from the start to the end interval is short), or it can mean that the bandwidth is large (the amount of time passed in a unit of time is large), and that "the high-speed rail is fast", referring to the former, saying "the speed of the network" Fast, referring to the latter. The bandwidth of the memory actually grows. The CPU of the 486 era runs 100MHz, and the memory bandwidth of the SDRAM is 100MT/s. Now the CPU reaches 2GHz~3GHz, and the DDR4 memory bandwidth is 3200MT/s. Although memory bandwidth has increased by tens of times, the delay in returning data from issuing read requests to memory banks has only decreased more than twice in the past two decades.
Not to mention laymen, many junior programmers don't know that memory latency is so bad. Even senior programmers can ignore it in coding most of the time. Why? This is all credit for the CPU. The CPU uses a lot of sophisticated techniques to hide memory latency, such as:
The CPU uses a very large amount of on-chip storage to do the cache (cache), put the data that the program frequently accesses on the chip, so that you do not have to access the memory.
The CPU uses complex techniques to guess which data the program is about to access, and prefetches the data from memory to the chip in advance.
When a certain program is stuck due to waiting for memory data, the CPU executes the next fragment in an out-of-order manner.
Using Hyper-Threading Technology, when a program gets stuck because it is waiting for memory data, choose another program to execute
On the CPU's silicon, most of the area is used to create the illusion that "memory access is near zero delay." The logic used to do the operation is even less than 1% - which is why it is inefficient. The root of it.
The CPU was born in the era of Logic and Memory. At that time, programmers were used to assuming "memory access is almost zero-latency". To ensure compatibility with the software, the CPU has maintained this illusion at all costs for years. It is hard to return, and today, software has been unable to take full advantage of the power provided by integrated circuit manufacturing processes through the CPU.
Why is the GPU inefficient?In one sentence, sum up the biggest feature of the GPU: it gives the programmer an illusion that you feel that hundreds of thousands of small programs are running on the GPU, and they are safe with each other.
The architecture of the GPU, in a nutshell, is to use CPU-like hyperthreading techniques to the extreme to hide the long latency of memory access. There are thousands of small cores in the GPU, each of which can be seen as a small CPU. At the same time, it runs up to hundreds of thousands of small programs at the same time. Most programs will get stuck because they wait for memory. There are only a few thousand programs executed on the CPU.
Because there are thousands of small cores working at the same time, the GPU is much more computationally intensive per unit time than the CPU. But it also has the soft underbelly, that is: these hundreds of thousands of small programs, it is impossible to get along with each other, they will grab the storage bandwidth, and it is very fierce. The management cost of the GPU is quite high:
To do a complex cache, in order to prepare a piece of data from the display access is used by many small cores
There are only 8 memory access interfaces, and there are thousands of small cores that can issue memory access requests. They must analyze the requests they send and pinch the requests to access adjacent addresses as a request to the video memory.
The memory bandwidth must be made much higher than the CPU to feed thousands of small cores.
On thousands of small cores, the small programs that run every clock cycle may be different, and the context of each applet must be preserved for future wake-up. The size of the on-chip memory paid for the storage context is comparable to the huge cache on the CPU.
Compared with the CPU, the GPU's ability to create artifacts is a bit weaker. A little experienced GPU programmers understand that as many as possible, hundreds of thousands of small programs running on the GPU should be presented with certain rules when accessing memory. Otherwise, the efficiency of the GPU. Will be greatly discounted.
GPU positioning, not just graphics acceleration, but all applications with massive data parallel computing, so it must be very versatile, can not limit the hundreds of thousands of small programs running on it. In fact, these hundreds of thousands of small programs can access all the locations of the video memory at will, and the location of the access is different. In this case, the GPU must also ensure the correctness of the function, even if it runs slower. . The cost of managing and organizing hundreds of thousands of unrestricted applets at the cost of silicon area and memory bandwidth is the root cause of GPU inefficiencies.
Why FPGA is just a transition solutionBoth the CPU and GPU architectures have a very heavy historical burden, reflected in:
They all have strong versatility and can't just be optimized for a certain field.
They all have strong compatibility, and programs written in the past must be able to run.
They all have a stable and large team of programmers. If these programmers’ thinking is not changed, they can’t give up those “illusionsâ€.
These are also great and sweet burdens, and because of them, CPU and GPU vendors can make a splash in their existing markets and keep competitors out.
If you throw away these burdens and design a new architecture, you can do it:
Optimize only for a certain area
Does not consider compatibility with past software
Program it in a whole new way, not sticking to the previous mindset
The architecture thus designed will greatly exceed the general architecture of CPU and GPU for its target areas. The reason is very easy to understand, versatility and optimization can not be both. There have been precedents in history. When the demand for computational performance in the fields of computational chemistry and astrophysics could not be met, scientists have developed dedicated Anton and Grape-DR computers for them. But they are too professional and not known to the public.
Nowadays, when the architecture of CPU and GPU can no longer meet the speed, power consumption and cost requirements of artificial intelligence applications, finding a new architecture has become a common choice. In the process of finding a new architecture, FPGA has played a pioneering role.
What is an FPGA? If the CPU and GPU are "universal" at the architectural level, the FPGA is "universal" at the lower level of the circuit. After programming the FPGA through the hardware description language, it can simulate the architecture of any kind of chip, including the architecture of the CPU and GPU. In layman's terms, the FPGA is a programmable "universal chip." It is ideal for exploratory, low volume products.
We've seen a lot of FPGA solutions that achieve better speed, power, or cost metrics than GPUs. However, FPGAs still can't get rid of the restriction that "universal is not optimal". The reason why it can still show considerable advantages is that in a software and hardware system, the impact of the algorithm is far greater than the hardware architecture, and the impact of the hardware architecture is far greater than the cost of doing the "universal" at the circuit level. It's much smaller than the cost of doing "universal" at the architectural level.
Once the FPGA has made a way out of a dedicated architecture, it will retreat behind the scenes and give way to a more specialized ASIC.
TPU represents the future direction
This time AlphaGo played against Ke Jie, using the second generation TPU from Google. The characteristics of the TPU are :
Optimized only for linear algebra
Programs that are not compatible with CPU or GPU
Program it in a whole new way
Implemented in ASIC instead of FPGA
Most of the algorithms used for deep learning can be mapped to the underlying linear algebra operations. Tensor in TPU (Tensor Processing Unit) is the basic data type in linear algebra. Linear algebra operations have two major characteristics: Tensor's flow is unconventional and predictable; the computational density is high, that is, each data undergoes many calculations. These two features make linear algebra operations especially suitable for hardware acceleration - all the logic used to create "illusions" is no longer necessary, and each transistor can be used for meaningful computation or storage.
The TPU cannot run Java or C++ programs running on the CPU, nor can it run CUDA programs on the GPU. Although there is no public information yet, it is probably programmed in such a way that TensorFlow expresses the neural network in an intermediate format, which is then converted by the compiler into a unique program on the TPU. This intermediate format is called TensorFlow XLA, and it will also be a tool for TensorFlow to support other linear algebra accelerators.
The reason why Google chooses ASIC instead of FPGA is that it is worse than its vision. Insiders know that ASIC performance is far superior to FPGA, but there are still many people who dare not choose ASIC. Why? The risk of doing ASIC yourself is too great: the cycle is long, the investment is high, and the threshold is high. Once the chip is wrong, it is no different from the stone. When Apple decided to make chips on its own, it did not directly set up the team, but acquired PA Semi; after so many years, the results were so great, but still dare not use the self-developed CPU in Mac computers to eliminate Intel chips. In just a few years, Google set up a team, designed a reasonable architecture, made a chip capable of work, and dared to deploy its own products on its own cloud, can only say "service!"
Google is a great company, and it is generally considered impossible to do before it publishes papers for MapReduce, GFS, and BigTable. I believe that many people will think that TPU is impossible to complete before seeing AlphaGo equipped with TPU defeating Ke Jie. History has proven that Google can at least imitate seven or eight points. Now everyone should believe that in a sufficiently important application area, optimization and customization can be done at the transistor level, rather than just a level of off-the-shelf chip architecture. This is not only feasible, but also necessary, because you will not do so, competitors will do the same.
The open source era of hardwareThe popular expression of Moore's Law is that the performance of computers that can be bought for every dollar will more than double every 18-24 months. Over the past three decades, thanks to Moore's Law, we have witnessed more than a million times the price/performance improvement. The 10,000 times that we can see in the future should also be calculated according to the "computer performance that can be bought at unit cost."
The general architecture of CPU and GPU, their historical burden not only leads to the difficulty of optimization, but also leads to: First, the excess profits caused by monopoly; Second, the R&D costs brought about by excessive complexity. As a result, the price of the chip remains high.
In the future, when custom chips in specific areas become popular, the price of these chips will also be significantly reduced. The reasons are: First, there is no longer a monopoly; Second, there is no research and development cost brought by historical burden; Third, the research and development costs brought by open source are reduced.
Hardware open source has been tried in the past, but there is no big success, the reasons are diverse. But in the long run, all the infrastructure, shared by the majority of manufacturers, will eventually move toward open source. If Intel's CPU is earth (all optimizations can't be lower than it), then Linux, Python, and PHP are the lowest-level infrastructure on the ground. They are open source; if GPU+CUDA is earth, then The framework for deep learning is the lowest level of infrastructure, and they are all open source. If the future transistor is the earth, there is no doubt that the architecture of the chip will also have various open source solutions.
This is just the beginning. This month NVidia did two interesting things: sponsored the open-source CPU architecture RISCV's workshop in Shanghai; announced that the Xavier autopilot chip for linear algebra hardware acceleration module DLA will be open source. Dachang supports open source, and it is not about charity, but to kill competitors and gain control of the industry's de facto standards. But the consequences of open source must be to lower the design threshold and reduce the research and development costs of the entire industry.
Our Stars of the Sea: Full-stack optimization from application to transistor
For the professionals engaged in computer architecture, this is the best era. The advanced technology of semiconductor manufacturing is progressing slowly, but the application requirements of software are still emerging, the software and hardware interfaces are gradually blurred, and the cost of mature technology is declining. . In order to optimize a specific application, deep-to-transistor-level full-stack optimization becomes a realistic option. As long as the dedicated architecture is properly designed, the versatile architecture of GPUs and CPUs can be easily overridden using mature processes, even if they use the most advanced manufacturing processes.
This is a brand new world. The past interest patterns and design ideas will be broken. No one can predict what kind of change will happen. But this is our starry sea, let's explore and adventure together!
Wang Wei, Peking University's Ben Shuobo, has fallen into the pit of the computer architecture since he read the books of Hennessy and Patterson, and has not climbed out yet. A total of 14 years of CPU before and after, from the basic software, chip architecture, to physical implementation have a little experience. In 2016, he joined Bitcoin and worked on the design and implementation of artificial intelligence acceleration chips.
Fiber Optic Connector,Optic Fiber Connector,Lc Fiber Connector,Lc Upc Connector
Guangzhou Jiqian Fiber Optic Cable Co.,ltd , https://www.jq-fiber.com