The Prospect of A Promising Track: Decentralized Computing Power Market (Part 1)

AdvancedJan 04, 2024
This article explores the potential and challenges of the decentralized computing power market, highlighting the difficulties it faces and introducing two typical projects - Gensyn and Together.AI.
The Prospect of A Promising Track: Decentralized Computing Power Market (Part 1)

Foreword

Since the birth of GPT-3, generative AI has ushered in an explosive turning point in the field of artificial intelligence with its amazing performance and broad application scenarios. This has led to tech giants flocking to the AI track. However, this surge has brought with it numerous problems. The training and inference operations of large language models (LLMs) require plenty of computing power. With the iterative upgrade of these models, the demand and cost for computing power are exponentially increasing. Taking GPT-2 and GPT-3 as examples, the difference in the number of parameters between GPT-2 and GPT-3 is 1166 times (GPT-2 has 150 million parameters while GPT-3 has 175 billion). The cost for a single training session of GPT-3 was calculated based on the pricing models of public GPU clouds at the time, which reaches up to $12 million. This was 200 times that of GPT-2. In practical use, every user query requires inference computation. Based on the 13 million independent users at the beginning of this year, the corresponding demand for chips would be more than 30,000 A100 GPUs. The initial investment cost would then be a staggering $800 million, with an estimated daily model inference cost of $700,000.

Insufficient computing power and high costs have become serious challenges faced by the entire AI industry. Remarkably, a similar issue appears to be confronting the blockchain industry. On one hand, the fourth Bitcoin halving and the approval of ETFs are imminent. As future prices rise,, miners’ demand for computing hardware will inevitably increase significantly. On the other hand, Zero-Knowledge Proof (ZKP) technology is booming, and Vitalik has emphasized multiple times that ZK’s impact on the blockchain field in the next ten years will be as important as the blockchain itself. While this technology holds promise for the blockchain industry’s future, ZK also consumes a lot of computing power and time in generating proofs due to its complex calculation process, just like AI.

In the foreseeable future, the shortage of computing power will become inevitable. So, will the decentralized computing power market be a profitable business venture?

Decentralized Computing Power Market Definition

The decentralized computing power market is actually equivalent to the decentralized cloud computing track, but I personally think this term is more apt to describe the new projects that will be discussed later. The decentralized computing power market should be considered a subset of DePIN (Decentralized Physical Infrastructure Networks), whose goal is to create an open computing power market, where anyone with idle computing power resources can offer their resources incentivized by tokens, mainly serving B2B clients and developer communities. In terms of more familiar projects, networks such as Render Network, which is based on decentralized GPU rendering solutions, and Akash Network, a distributed peer-to-peer marketplace for cloud computing, both belong to this track.

The following text will start with the basic concepts and then discuss three emerging markets under this track: the AGI computing power market, the Bitcoin computing power market, and the AGI computing power market in the ZK hardware acceleration market. The latter two will be discussed in “The Prospect of A Promising Track: Decentralized Computing Power Market (Part 2)”.

Overview of Computing Power

The concept of computing power can be traced back to the invention of the computer. The original computer used mechanical devices to complete computing tasks, and computing power referred to the computational ability of the mechanical device. With the development of computer technology, the concept of computing power has also evolved. Today’s computing power usually refers to the collaborative work of computer hardware (CPUs, GPUs, FPGAs, etc.) and software (operating systems, compilers, applications, etc.).

Definition

Computing power refers to the amount of data that a computer or other computing device can process within a certain period of time or the number of computing tasks it can complete. Computing power is usually used to describe the performance of a computer or other computing devices. It is an important metric of the processing capabilities of a computing device.

Metrics

Computing power can be measured in various ways, such as computing speed, energy consumption, computing accuracy, and parallelism. In the computing field, commonly used computing power metrics include FLOPS (floating point operations per second), IPS (instructions per second), TPS (transactions per second), etc.

FLOPS measures the computer’s ability to process floating-point operations (mathematical operations with decimal points that require consideration of precision issues and rounding errors). It measures how many floating-point operations a computer can complete per second. FLOPS is a measure of a computer’s high-performance computing capabilities and is commonly used to measure the computing capabilities of supercomputers, high-performance computing servers, graphics processing units (GPUs), etc. For example, if a computer system has 1 TFLOPS (one trillion floating-point operations per second), it means that it can complete 1 trillion floating-point operations per second.

IPS measures the speed at which a computer processes instructions. It is a measure of how many instructions a computer can execute per second and is a measure of a computer’s single instruction performance, typically used to measure the performance of a central processing unit (CPU). For example, a CPU with an IPS of 3 GHz (3 billion instructions per second) means it can execute 3 billion instructions per second.

TPS measures the ability of a computer to process transactions. It gauges how many transactions a computer can complete per second, typically used to measure database server performance. For example, a database server has a TPS of 1,000, meaning it can handle 1,000 database transactions per second.

In addition, there are some computing power metrics for specific application scenarios, such as inference speed, image processing speed, and voice recognition accuracy.

Type of Computing Power

GPU computing power refers to the computational ability of graphics processing units (GPUs). Unlike central processing units (CPUs), GPUs are hardware specifically designed to process graphical data such as images and videos. They has a large number of processing units and efficient parallel computing capabilities, and can perform a large number of floating-point operations simultaneously. Since GPUs were originally designed for gaming graphics processing, they typically have higher clock speeds and greater memory bandwidth than CPUs to support complex graphical computations.

Difference between CPUs and GPUs

Architecture: CPUs and GPUs have different computing architectures. CPUs typically have one or more cores, each of which is a general-purpose processor capable of performing a variety of different operations. GPUs, on the other hand, have a large number of Stream Processors and Shaders, which are specially used to execute computations related to image processing;

Parallel Computing: GPUs generally have higher parallel computing capabilities. A CPU has a limited number of cores, and each core can only execute one instruction, but a GPU can have thousands of stream processors that can execute multiple instructions and operations simultaneously. Therefore, GPUs are generally better suited than CPUs to perform parallel computing tasks, such as machine learning and deep learning, which require extensive parallel computations;

Programming Design: Programming for GPUs is relatively more complex compared to CPUs. It requires the use of specific programming languages ​​(such as CUDA or OpenCL) and specific programming techniques to utilize the parallel computing capabilities of GPUs. In contrast, CPU programming is simpler and can use general-purpose programming languages ​​and tools.

The Importance of Computing Power

In the era of the Industrial Revolution, oil was the lifeblood of the world and penetrated into every industry. In the upcoming AI era, computing power will be the “digital oil” of the world. From major companies’ frantic pursuit of AI chips and Nvidia’s stock exceeding one trillion dollars, to the United States’ recent blockade of high-end chips from China, including computing power capacity, chip size, and even plans to ban the GPU clouds, the importance of computing power is self-evident. Computing power will be a commodity of the next era.

Overview of Artificial General Intelligence

Artificial Intelligence (AI) is a new technical science that studies, develops, and applies theories, methods, technologies for simulating, extending, and expanding human intelligence. It originated in the 1950s and 1960s and, after evolution of over half a century, it has experienced intertwined developments through three waves: symbolism, connectionism and agent-based approaches. Today, as an emerging general-purpose technology, AI is driving profound changes in social life and across all industries. A more specific definition of generative AI currently is: Artificial General Intelligence (AGI), an artificial intelligence system with a wide range of understanding capabilities that can perform tasks and operate in various domains with intelligence similar to or surpassing human levels. AGI basically requires three elements, deep learning (DL), big data, and substantial computing power.

Deep Learning

Deep learning is a subfield of machine learning (ML), and deep learning algorithms are neural networks modeled after the human brain. For example, the human brain contains millions of interconnected neurons that work together to learn and process information. Likewise, deep learning neural networks (or artificial neural networks) are composed of multiple layers of artificial neurons working together within a computer. These artificial neurons, known as nodes, use mathematical computations to process data. Artificial neural networks are deep learning algorithms that use these nodes to solve complex problems.

Neural networks can be divided into the input layer, hidden layers and the output layer. The connections between these different layers are made up of parameters.

Input Layer: The input layer is the first layer of the neural network and is responsible for receiving external input data. Each neuron in the input layer corresponds to a feature of the input data. For example, in image processing, each neuron may correspond to the value of a pixel in the image.

Hidden Layers: The input layer processes data and passes it on to deeper layers within the network. These hidden layers process information at different levels, adjusting their behavior when receiving new information. Deep learning networks can have hundreds of hidden layers, which allow them to analyze problems from multiple different perspectives. For example, if you are given an image of an unknown animal that you needs to be classified, you can compare it to animals you already know. For example, you can tell what kind of animal it is by the shape of its ears, the number of legs, and the size of its pupils. Hidden layers in deep neural networks work in a similar way. If a deep learning algorithm is trying to classify an image of an animal, each hidden layer will process different features of the animals and try to classify them accurately.

Output Layer: The output layer is the last layer of the neural network and is responsible for generating the output of the network. Each neuron in the output layer represents a possible output category or value. For example, in a classification problem, each neuron in the output layer may correspond to a category, while in a regression problem, the output layer may have only one neuron whose value represents the prediction result;

Parameters: In neural networks, the connections between different layers are represented by weights and biases, which are optimized during the training process to enable the network to accurately identify patterns in the data and make predictions. The increase in parameters can improve the model capacity of the neural network, that is, the model’s ability to learn and represent complex patterns in the data. But correspondingly, the increase in parameters will increase the demand for computing power.

Big Data

In order to be effectively trained, neural networks usually require large, diverse, and high-quality data from multiple sources. This data is the basis for machine learning model training and validation. By analyzing big data, machine learning models can learn patterns and relationships within the data, which allows them to make predictions or classifications.

Massive Computing Power

The demand for substantial computing power arises from several aspects of neural networks: complex multi-layered structures, a large number of parameters, the need to process vast amounts of data, and iterative training methods (during the training phase, the model must iterate repeatedly, performing forward and backward propagation calculations for each layer, including computations for activation functions, loss functions, gradients, and weight updates), the nend for high-precision calculations, parallel computing capabilities, optimization and regularization techniques, and model evaluation and verification processes. As deep learning progresses, the requirement for massive computing power for AGI is increasing by about 10 times every year. The latest model so far, GPT-4, contains 1.8 trillion parameters, with a single training cost of more than $60 million and a computing power requirements of 2.15e25 FLOPS (21.5 quintillion floating-point operations). The demand for computing power for future model training is still expanding, and new models are being developed at an increasing rate.

AI Computing Power Economics

Future Market Size

According to the most authoritative estimates, the “2022-2023 Global Computing Power Index Assessment Report” jointly compiled by the International Data Corporation (IDC), Inspur Information, and the Global Industry Research Institute of Tsinghua University, the global AI computing market size is expected to increase from $19.5 billion in 2022 to $34.66 billion in 2026. The generative AI computing market is projected to grow from $820 million in 2022 to $10.99 billion in 2026. The share of generative AI computing in the overall AI computing market is expected to rise from 4.2% to 31.7%.

Monopoly in Computing Power Economy

The production of AI GPUs has been exclusively monopolized by NVIDIA and they are extremely expensive (the latest H100 has been sold for $40,000 per unit). As soon as GPUs are released, they are snapped up by tech giants in Silicon Valley. Some of these devices are used for training their own new models. The rest are rented out to AI developers through cloud platforms, such as those owned by Google, Amazon, and Microsoft, which control a vast amount of computing resources such as servers, GPUs and TPUs. Computing power has become a new resource monopolized by these giants. Many AI developers cannot even purchase a dedicated GPU without a markup. In order to use the latest equipment, developers have to rent AWS or Microsoft cloud servers. Financial reports indicate that this business has extremely high profits. With AWS’s cloud services boasting a gross profit margin of 61%, while Microsoft’s gross profit margin is even higher at 72%.

So do we have to accept this centralized authority and control, and pay a 72% profit margin for computing resources? Will the giants who monopolized Web2 also dominate the next era?

Challenges of Decentralized AGI Computing Power

When it comes to antitrust, decentralization is usually seen as the optimal solution. Looking at existing projects, can we achieve the massive computing power required for AI through DePIN storage projects combined with protocols like RDNR for idle GPU utilization? The answer is no. The road to slaying the dragon is not that simple. Early projects were not specifically designed for AGI computing power and are not feasible. Bringing computing power onto the blockchain faces at least the following five challenges:

  1. Work verification: To build a truly trustless computing network that provides economic incentives to participants, the network must have a way to verify whether the deep learning computations were actually performed. The core issue here is the state dependency of deep learning models; in these models, the input for each layer depends on the output from the previous layer. This means that you cannot just validate a single layer in a model without taking into account all the layers before it. The computation for each layer is based on the results of all previous layers. Therefore, in order to verify the work completed at a specific point (such as a specific layer), all the work from the beginning of the model to that specific point must be exectuted;

  2. Market: As an emerging market, the AI ​​computing power market is subject to supply and demand dilemmas, such as the cold start problem. Supply and demand liquidity need to be roughly matched from the beginning so that the market can grow successfully. In order to capture potential supply of computing power, participants must be provided with clear incentives in exchange for their computing resources. The market needs a mechanism to track completed computations and pay providers accordingly in a timely manner. In traditional marketplaces, intermediaries handle tasks such as management and onboarding, while reducing operational costs by setting minimum payment thresholds. However, this approach is costly when expanding the market size. Only a small portion of the supply can be economically captured, leading to a threshold equilibrium state in which the market can only capture and maintain a limited supply without being able to grow further;

  3. Halting problem: The halting problem is a fundamental issue in computational theory, which involves determining whether a given computing task will finish in a finite amount of time or run indefinitely. This problem is undecidable, meaning there is no universal algorithm that can predict whether any given computation will halt in a finite time. For example, smart contract execution on Ethereum also faces a similar halting problem. It is impossible to determine in advance how much computing resources the execution of a smart contract will require, or whether it will complete within a reasonable time.

(In the context of deep learning, this problem will be more complex as models and frameworks will switch from static graph construction to dynamic building and execution.)

  1. Privacy: The design and development with privacy consciousness is a must for project teams. Although a large amount of machine learning research can be conducted on public datasets, in order to improve the performance of the model and adapt to specific applications, the model usually needs to be fine-tuned on proprietary user data. This fine-tuning process may involve the processing of personal data, so privacy protection requirements need to be considered.

  2. Parallelization: This is a key factor in the current projects’ lack of feasibility. Deep learning models are usually trained in parallel on large hardware clusters with proprietary architectures and extremely low latency, and GPUs in distributed computing networks would incur latency due to frequent data exchanges and would be limited by the performance of the slowest GPU. When computing sources are untrustworthy and unreliable, how to achieve heterogeneous parallelization is a problem that must be solved. The current feasible method is to achieve parallelization through transformer models, such as Switch Transformers, which now have highly parallelized characteristics.

Solutions: Although current attempts at a decentralized AGI computing power market are still in the early stages, there are two projects that have initially solved the consensus design of the decentralized networks and the implementation of the decentralized computing power networks in model training and inference. The following will use Gensyn and Together as examples to analyze the design methods and issues of the decentralized AGI computing power market.

Gensyn

Gensyn is an AGI computing power market that is still in the construction stage, aiming to solve the various challenges of decentralized deep learning computing and reduce the costs associated with current deep learning. Gensyn is essentially a first-layer proof-of-stake protocol based on the Polkadot network, which directly rewards solvers (those who solve computational tasks) through smart contracts in exchange for their idle GPU devices for computing and performing machine learning tasks.

Returning to the previous question, the core of building a truly trustless computing network lies in verifying the completed machine learning work. This is a highly complex issue that requires finding a balance between the intersection of complexity theory, game theory, cryptography, and optimization.

Gensyn proposes a simple solution where solvers submit the results of machine learning tasks they have completed. To verify that these results are accurate, another independent verifier attempts to re-perform the same work. This approach can be called single replication because only one verifier would re-execute the task. This means that there is only one additional piece of work to verify the accuracy of the original work. However, if the person verifying the work is not the original requester, then the trust issue still exists. Verifiers themselves may not be honest, and their work needs to be verified. This leads to a potential problem where if the person verifying the work is not the original requester, then another verifier will be needed to verify their work. But this new verifier might also not be trusted, so another verifier is needed to verify their work, which could go on forever, creating an infinite replication chain. Here we need to introduce three key concepts and interweave them to build a participant system with four roles to solve the infinite chain problem.

Probabilistic learning proofs: Constructs certificates of completed work using metadata from the gradient-based optimization process. By replicating certain stages, these certificates can be quickly verified to ensure that the work has been completed as expected.

Graph-based precise positioning protocol: Using a multi-granularity, graph-based precise positioning protocols, and consistent execution of cross-evaluators. This allows for the re-running and comparison of verification work to ensure consistency, which is ultimately confirmed by the blockchain itself.

Truebit-style incentive game: Use stakes and slashing to build an incentive game to ensure every economically reasonable participant will act honestly and perform their expected tasks.

The participant system consists of submitters, solvers, verifiers, and whistleblowers.

Submitters:

Submitters are the end-users of the system who provide tasks to be computed and pay for the units of work completed;

Solvers:

Solvers are the primary workers of the system, performing model training and generating proofs that are checked by the verifier;

Verifiers:

Verifiers are key to linking the non-deterministic training process with deterministic linear computations, replicating parts of the solver’s proof and comparing distances with expected thresholds;

Whistleblowers:

Whistleblowers are the last line of defense, checking the work of verifiers and raising challenges in hopes of receiving generous bounty payments.

System Operation

The game system designed by the protocol operates through eight stages, covering four main participant roles, to complete the whole process from task submission to final verification.

Task Submission: Tasks consist of three specific pieces of information:

Metadata describing the task and hyperparameters;

A model binary file (or basic architecture);

Publicly accessible, pre-processed training data.

To submit a task, the submitter specifies the details of the task in a machine-readable format and submits it to the chain along with the model binary file (or machine-readable architecture) and a publicly accessible location of the preprocessed training data. Public data can be stored in simple object storage such as AWS’s S3, or in a decentralized storage such as IPFS, Arweave, or Subspace.

Profiling: The profiling process establishes a baseline distance threshold for proof of learning verification. Verifiers will periodically fetch profiling tasks and generate mutation thresholds for the comparison of learning proofs. To generate the threshold, the verifier will deterministically run and rerun parts of the training using different random seeds, generating and checking their own proofs. During this process, the verifier establish an overall expected distance threshold for the non-deterministic work of the solution that can be used for verification.

Training: After profiling, tasks enter the public task pool (similar to Ethereum’s Mempool). Select a solver to execute the task and remove the task from the task pool. Solvers perform the task based on the metadata submitted by the submitter and the model and training data provided. When executing training tasks, solvers also generate proofs of learning by regularly checking points and storing metadata (including parameters) during the training process, so that verifiers can replicate the following optimization steps as accurately as possible.

Proof generation: Solvers periodically store model weights or updates and their corresponding indices of the training dataset to identify the samples used to generate the weight updates. Checkpoint frequency can be adjusted to provide stronger guarantees or to save storage space. Proofs can be “stacked,” meaning they can start from a random distribution used to initialize the weights, or from pretrained weights generated using their own proofs. This enables the protocol to build a set of proven, pre-trained base models that can be fine-tuned for more specific tasks.

Verification of proof: After the task is completed, solvers register the task completion on the chain and display their proof of learning at a publicly accessible location for verifiers to access. Verifiers pull verification tasks from the public task pool and perform computational work to rerun part of the proof and execute distance calculations. The chain, along with the threshold calculated during the profiling stage, then uses the resulting distance to determine if the verification matches the proof.

Graph-based pinpoint challenge: After verifying the proof of learning, whistleblowers can replicate the work of verifiers to check if the verification work itself was executed correctly. If whistleblowers believe that the verification has been executed incorrectly (maliciously or not), they can challenge it to contract arbitration for a reward. This reward can come from solver and validator deposits (in the case of a true positive), or from a lottery pool bonus (in the case of a false positive), with arbitration performed using the chain itself. Whistleblowerss (acting as verifiers in their case) will only verify and subsequently challenge work only if they expect to receive appropriate compensation. In practice, this means that whistleblowers are expected to join and leave the network based on the number of other active whistleblowers (i.e., with live deposits and challenges). Therefore, the expected default strategy for any whistleblower is to join the network when there are fewer other whistleblowers, post a deposit, randomly select an active task, and begin their verification process. After one task, they will grab another random active task and repeat until the number of whistleblowers exceeds their determined payout threshold, at which point they will leave the network (or more likely, switch to another role in the network — verifier or solver — based on their hardware capabilities) until the situation reverses again.

Contract arbitration: When verifiers are challenged by whistleblowers, they enter a process with the chain to find out the location of the disputed operation or input, and ultimately the chain will perform the final basic operation and determine whether the challenge is justified. To keep whistleblowers honest and to overcome the verifier’s dilemma, periodic forced errors and jackpot payments are introduced here.

Settlement: During the settlement process, participants are paid based on the conclusions of probabilistic and deterministic checks. Different payment scenarios arise depending on the results of previous verifications and challenges. If the work is deemed to have been performed correctly and all checks have passed, both the solution providers and verifiers are rewarded based on the operations performed.

Project Brief Review

Gensyn has designed a sophisticated game-theoretic system on the verification layer and incentive layers, which allows for quick identification and rectification of errors by pinpointing divergences within the network. However, there are still many details missing in the current system. For example, how to set parameters to ensure that rewards and penalties are reasonable without setting the threshold too high? Have you considered extreme scenarios and the different computing power of solvers in the game-theoretic aspects? There is no detailed description of heterogeneous parallel execution in the current version of the whitepaper.Gensyn still has a long way to go.

Together.ai

Together.ai is a company that focuses on open-source, decentralized AI computational solutions for large models. It aims to achieve that anyone can access AI anywhere. Strictly speaking, Together is not a blockchain project, but it has preliminarily solved the latency issues within decentralized AGI computational networks. Therefore, the following article only analyzes Together’s solutions and does not evaluate the project itself.

How to achieve training and inference of large models when decentralized networks are 100 times slower than data centers?

Let us imagine the distribution of GPUs participating in a decentralized network. These devices will be spread across different continents and cities, each needing to connect with varying latencies and bandwidths. As shown in the figure below, a simulated distributed scenario shows devices located in North America, Europe, and Asia, with differing bandwidths and latencies between them. What needs to be done to effectively link them together?

Distributed training computational modeling: The diagram below shows the situation of training a base model across multiple devices, featuring three types of communication: Forward Activation, Backward Gradient, and Lateral Communication.

Combining communication bandwidth and latency, two forms of parallelism need to be considered: pipeline parallelism and data parallelism, corresponding to the three types of communication in the multi-device scenario:

In pipeline parallelism, all layers of the model are divided into several stages, where each device processes one stage, which is a sequence of consecutive layers, such as multiple Transformer blocks. During forward propagation, activations are passed to the next stage, and during backward propagation, gradients of the activations are passed back to the previous stage.

In data parallelism, devices independently compute gradients for different micro-batches but need to synchronize these gradients through communication.

Scheduling optimization:

In a decentralized environment, the training process is often constrained by communication. Scheduling algorithms generally assign tasks that require extensive communication to devices with faster connections. Considering the dependencies between tasks and the heterogeneity of the network, it is first necessary to model the cost of specific scheduling strategies. In order to capture the complex communication cost of training base models, Together proposes a novel formulation and decomposes the cost model into two levels using graph theory:

Graph theory is a branch of mathematics that studies the properties and structures of graphs (networks). A graph consists of vertices (nodes) and edges (lines connecting nodes). The main purpose of graph theory is to study various properties of graphs, such as the connectivity, coloring, and the nature of paths and cycles in graphs.

The first level is a balanced graph partitioning problem (dividing the vertex set of a graph into several subsets of equal or nearly equal size while minimizing the number of edges between subsets). In this partitioning, each subset represents a partition, and communication costs are reduced by minimizing the edges between partitions, corresponding to the communication costs of data parallelism.

The second level involves a joint graph matching and traveling salesman problem (a combinatorial optimization problem that combines elements of graph matching and the traveling salesman problem). The graph matching problem involves finding a match in the graph that minimizes or maximizes some cost. The traveling salesman problem seeks the shortest path that visits all nodes in the graph, corresponding to the communication costs of pipeline parallelism.

The above diagram is a schematic of the process. Due to the complex calculations involved in the actual implementation, the process described in the diagram is simplified for easier understanding. For detailed implementation, one can refer to the documentation on Together’s official website.

Suppose there is a set of NN devices, DD, with uncertain communication delays (matrix AA) and bandwidths (matrix BB), based on the device set DD, we first generate a balanced graph partition. Each partition or group of devices contains approximately an equal number of devices, and they all handle the same pipeline stage. This ensures that during data parallelism, each device group performs a similar amount of work. According to the communication delays and bandwidths, a formula can calculate the “cost” of transferring data between device groups. Each balanced group is merged to create a fully connected coarse graph, where each node represents a pipeline stage, and the edges represent the communication cost between two stages. To minimize communication costs, a matching algorithm is used to determine which device groups should work together.

For further optimization, this problem can also be modeled as an open-loop traveling salesman problem (open-loop means that there is no need to return to the starting point of the path) to find an optimal path to transmit data across all devices. Finally, Together uses an innovative scheduling algorithm to find the optimal allocation strategy for the given cost model, thereby minimizing communication costs and maximizing training throughput. According to tests, even if the network is 100 times slower under this scheduling optimization, the end-to-end training throughput is only about 1.7 to 2.3 times slower.

Communication Compression Optimization:

For the optimization of communication compression, Together introduced the AQ-SGD algorithm (for detailed calculation process, please refer to the paper “Fine-tuning Language Models over Slow Networks using Activation Compression with Guarantees”). The AQ-SGD algorithm is a novel activation compression technique designed to address communication efficiency issues during pipeline parallel training over slow networks. Different from previous methods of directly compressing activation values, AQ-SGD focuses on compressing the changes in activation values ​​of the same training sample in different periods. This unique method introduces an interesting “self-executing” dynamic, where the algorithm’s performance is expected to gradually improve as the training stabilizes. The AQ-SGD algorithm has been rigorously theoretically analyzed and proven to have good convergence rates under certain technical conditions and bounded error quantization functions. The algorithm can be effectively implemented without adding additional end-to-end runtime overhead, although it requires the use of more memory and SSD to store activation values. Through extensive experiments on sequence classification and language modeling datasets, AQ-SGD has been shown to compress activation values to 2–4 bits without sacrificing convergence performance. Furthermore, AQ-SGD can be integrated with state-of-the-art gradient compression algorithms to achieve “end-to-end communication compression,” meaning that data exchanges between all machines, including model gradients, forward activation values, and backward gradients, are compressed to low precision, thereby significantly improving the communication efficiency of distributed training. Compared to the end-to-end training performance in a centralized computing network (such as 10 Gbps) without compression, it is currently only 31% slower. Combined with the data on scheduling optimization, although there is still a certain gap between centralized computing networks, there is great hope for catching up in the future.

Conclusion

In the dividend period brought by the AI ​​wave, the AGI computing power market is undoubtedly the market with the greatest potential and the most demand among the various computing power markets. However, highest development difficulty, hardware requirements, and capital demands are bringing challenges to this industry. Combining the two projects introduced above, we are still some time away before the AGI computing power market is launched. The real decentralized network is also much more complicated than the ideal scenario. It is currently not enough to compete with the cloud giants.

At the time of writing, I also observed that some small-scale projects that are still in their infancy (the PPT stage) have begun to explore some new entry points, such as focusing on the less challenging AGI inference stage rather than the training stage. However, in the long run, the significance of decentralization and permissionless systems is profound. The right to access and train AGI computing power should not be concentrated in the hands of a few centralized giants. Humanity does not need a new “theocracy” or a new “pope,” nor should it pay expensive membership fees.

Disclaimer:

  1. This article is reprinted from [YBB Capital]. All copyrights belong to the original author [Zeke]. If there are objections to this reprint, please contact the Gate Learn team, and they will handle it promptly.
  2. Liability Disclaimer: The views and opinions expressed in this article are solely those of the author and do not constitute any investment advice.
  3. Translations of the article into other languages are done by the Gate Learn team. Unless mentioned, copying, distributing, or plagiarizing the translated articles is prohibited.
Start Now
Sign up and get a
$100
Voucher!
Create Account