New ‘DASH’ Framework significantly Boosts Speed of Reproducible AI Training
A novel scheduling framework called DASH (Deterministic Attention Scheduling for High-Throughput) is poised to accelerate the development of large language models (LLMs) by dramatically improving the efficiency of deterministic training – a crucial process for ensuring reliable and reproducible results. Researchers from Shanghai Jiao Tong University and ByteDance Seed, alongside colleagues, have demonstrated that DASH can improve throughput by up to 1.28x, addressing a critical bottleneck in the field of artificial intelligence.
Deterministic training, which guarantees bitwise identical results across multiple runs, is essential for scientific rigor and practical deployment of LLMs. Though, achieving this determinism frequently enough comes at a significant performance cost. As one analyst noted, “The trade-off between reproducibility and speed has been a major headache for LLM developers.” Existing attention implementations, such as FlashAttention-3, can experience throughput reductions of up to 37.9% when deterministic backward passes are enabled due to the serialisation of gradient accumulation.
The core of the problem, researchers discovered, lies in inefficient scheduling of compute and gradient-reduction phases, leading to underutilization of hardware resources. To overcome this, the team formulated the deterministic attention backward pass as a scheduling problem on a Directed Acyclic Graph (DAG), a method of representing complex workflows. This allowed them to derive schedules designed to minimise the “critical path length” – the longest sequence of operations that determines the overall execution time.
DASH incorporates two complementary scheduling strategies. Descending Q-Tile Iteration employs a reversed query-block traversal to reduce pipeline stalls specifically in causal attention, while Shift Scheduling is a theoretically optimal schedule within the DAG model, reducing pipeline stalls for both full and causal attention masks. This innovative approach tackles the core issue of aligning tile execution with accumulation ordering, a misalignment identified as the primary cause of performance degradation.
Experiments conducted on NVIDIA H800 GPUs using CUDA 12.6 and Triton 3.4 demonstrated DASH’s effectiveness. The team benchmarked performance with a fixed total of 16,384 tokens, varying sequence lengths from 512 to 16,384, and tested hidden dimensions of 2,048 with head dimensions of 64 and 128, all using BF16 precision random inputs. The results showed a consistent enhancement in throughput,with DASH effectively narrowing the performance gap of deterministic attention.
Detailed analysis revealed that at a sequence length of 16,384 and a KV block size of 128, computation was distributed across 128 Streaming Multiprocessors (SMs). However, the researchers also observed that inter-SM communication latency, particularly accesses to remote L2 cache segments, ranging from 200 to over 500 cycles, became a limiting factor. While Shift Scheduling offered computational benefits, it proved more sensitive to this communication overhead at extreme parallelism.
“The key insight was realizing that the performance gap wasn’t inherent to serialisation itself, but rather a result of suboptimal tile scheduling and a rigid accumulation order,” explained a senior official involved in the research. By modelling the deterministic backward pass as a DAG, the team was able to design strategies that optimise the critical path length, ensuring a more balanced workload and reducing contention during serial reduction operations.
The open-sourcing of the DASH code at https://github.com/SJTU-Liquid/deterministic-FA3 facilitates further research and adoption within the LLM community. This move is expected to accelerate innovation and collaboration in the field.
The work establishes a new benchmark for deterministic LLM training, offering a pathway to more efficient and reliable large-scale model development. the findings suggest that a nuanced approach, balancing theoretical optimality with practical hardware constraints, is essential for maximising performance in this domain. By providing a suite of solutions, DASH empowers practitioners to achieve high throughput attention while maintaining reproducibility in LLM training.
