aiacademy: AI environment, clkao
Tags: aiacademy
-
What yo will learn today
Problems
-
Setup machine learning enviroment is gard
-
Reproducible researches require lots of tooling
-
It may take lots of time(months) to apply AI/ML reserach result in pruduction
ModelOps
Paradigm: Interactivity & Agile
Jupyter Architecture
-
願望清單
- CLKAO 的同事,每年玩一種語言
JupyterHub (on K8s) Architecture
Container & Orchestration
- Containers: Isolated process space, filesystem
- Orchestrator: Decide where to run things
- Kubernetes: declare desired state
- Container Networking: connect (or disconnect) w/ other pod
- Persisten Storage: when container requires persistent fs
-
docker
- 從 linux 的 LSC 建置起來的
-
Docker and Kubernetes Orchestration
kubernetes
GPU
- openCL 公有
- NVIDIA 私有
aiacademy: Jupyterhub
- keycloak
-
Spawner
- Juyterhub Routing
Components
-
JupyterHub + JypyterLab
- kubespawner
- pre-project storage, gpu quota
-
Keycloask: SSO
-
Gitlab: courses material and data management
- Image Bilding & Registry
-
Custom DaemenSet
- git-sync
What can possibly go wrong?
-
What can cause cascade failures?
- SPoF!
- storage: needs to be distributed & HA
- ex: ceph
- hub fafilures
- 死不透的 issues
- storage: needs to be distributed & HA
- hardware failure - memory, gpu card, power
-
clkao 建議學習項目
The littlest JupyterHub
- TLJH (The Littlest JupyterHub) vs. Z2JH (from zero to JupyterHub on Kubernetes)