Survey For Issues during AI Training or Development
The goal of this survey is to get a deeper understanding of the current issues during AI Training and Development that solo-devs/small teams/researchers are experiencing.
What best describes you
*
Solo Developer
Researcher
A member of a small development team or startup.
A member of a medium development team (50+ employees)
A member of a large development team (200+ employees)
Other
Where do you train most?
*
AWS
GCP
Azure
RunPod
Lambda
Paperspace
Vast
Other
What's your typical setup
*
Single GPU
Multi-GPU single node
Multi-Node
Other
Typical GPU's you rent/use
*
A100
H100
L40S
4090
Other
In the last 30 days, how many training runs were disrupted by infrastructure issues?
0
1
2-3
4-5
6-10
10+
If you've had a failed run, what was the cause for it ? (please check all that apply)
*
I've had only successful runs this month.
Driver/CUDA mismatch
Random crash mid-training
NCCL hang / multi-GPU communication issues
VM wouldn’t see GPU / nvidia-smi broken
Training was much slower than expected
OOM that “shouldn’t” happen
Spot/preempt termination
Disk/IO bottleneck killed throughput
Network bottleneck (multi-node or dataset)
Other
How long did it take you to detect it was the infra and not your code, that caused the issue ?
*
Immediately (<5 min)
5-15 min
15-60 min
1-4 hours
Other
What's the cost of a typical incident (In hours, average per incident)
*
<15 min
15-60 min
1-2 hours
2-4 hours
4-8 hours
1 day
Other
How often do you abandon a VM and start fresh because debugging feels too slow ?
*
Never
Rarely
Sometimes
Often
Almost Always
Have you had any silent problems ?
*
GPU was throttling (thermal/power)
PCIe/NVLink/NCCL Bandwidth way below expected
One GPU in a multi-GPU box underperformed
Intermittent device disconnect (“GPU disappeared without cause”)
Random NaNs/instability that felt like hardware
Performance varies wildly across “same” instance type
Other
Have you ever been in a situation where comparing the difference between models and versions has been a major bottleneck
*
Yes
No
I don't know
Have you ever been in a situation where analyzing the data during/after training has been a major bottleneck ?
*
Yes
No
I don't know
What do you use today to track/compare runs?
*
Local Logs
TensorBoard
W&B
MLflow
Aim
Other
What’s the biggest pain with your current approach?
*
Too much manual work
Hard to compare runs
Setup is heavy
Too slow to query
Data is scattered
Privacy
SaaS constraints
Cost
Other
Which answers would you most want to have automatically?
*
Why is this run slower ?
Is it cheaper per 1M tokens/per epoch?
Regression vs baseline
GPU underutilized
Memory increased/OOM risk
NCCL/multi-GPU comm degraded
Instability increased (i.e NaNs/crashes)
Other
If you've had any other issues and pains regarding Infrastructure/Initial training setup(VM's, GPU's, etc.) that you wish were solved, please enter them below.
Not required
If you've had any other issues and pains regarding Performance Metrics/Analytics/Logging during training, please enter them below.
Not required
We are incredibly grateful for your time, if you would like to know when the project is done, so that your development becomes easier, please enter your email so we can award you free early access.
Not required
Submit
Should be Empty: