Job Description
Qualifications:
- Minimum 5 years of relevant experience in performance testing, system optimization, and HPC environments.
- Proficiency in Linux system administration, including cluster setup and management.
- Hands-on experience with Kubernetes (K8S) for container orchestration in AI/ML workloads.
- Familiarity with CUDA and GPU configurations for AI/ML performance optimization.
- In-depth knowledge of high-speed networking (e.g., InfiniBand, Ethernet) and related technologies.
- Understanding of AI/ML frameworks such as PyTorch, TensorFlow, and deployment requirements for large language models (LLMs).
- Ability to conduct performance testing and benchmarking for servers, GPUs, and HPC systems.
- Capability to design, configure, and troubleshoot network topologies and components.
- Server Problem-Solving and Monitoring.
- Familiar in Virtualization (KVM…etc)/ Network file server / Linux command and Maintain OS / Build and maintain Docker service and K8S platform
