Turbo Management System
Turbo
Turbo serves as a computational cluster management platform enabling users to orchestrate and schedule tasks across multiple machines. ByteCompute GPU Clusters come with Turbo pre-installed for immediate distributed training capabilities, while also permitting custom scheduler integration. Computing tasks are submitted to Turbo's central control node, where the scheduler dynamically allocates resources to available GPU nodes according to current utilization.
Turbo Basic Concepts
- Jobs: Computational workloads submitted to the cluster, encompassing scripts, executable programs, or other task-based operations.
- Nodes: Individual computing units within the cluster that execute jobs, which may be implemented as physical hardware or virtualized environments.
- Head Node: The central access point preconfigured in every ByteCompute GPU Cluster, where users authenticate to create, submit, and retrieve computational tasks.
- Partitions: Logical groupings of nodes designated for job execution, configurable with distinct attributes including node quantity and memory allocation.
- Priorities: Ranking mechanisms that establish job execution order, granting precedence to higher-priority tasks over lower-priority ones.
Using Turbo
- Job Submission: Jobs can be submitted to the cluster using the
sbatchcommand. Jobs can be submitted in batch mode or interactively using thesruncommand. - Job Monitoring: Jobs can be monitored using the
squeuecommand, which displays information about the jobs that are currently running or waiting to run. - Job Control: Jobs can be controlled using the
scancelcommand, which allows users to cancel or interrupt jobs that are running.
Troubleshooting Turbo
- Error Messages: Diagnostic notifications generated by Turbo to assist in problem identification and resolution.
- Log Files: System-generated records enabling cluster status monitoring and issue diagnosis through operational history analysis.
