Survey For Issues during AI Training or Development

Survey For Issues during AI Training or Development

The goal of this survey is to get a deeper understanding of the current issues during AI Training and Development that solo-devs/small teams/researchers are experiencing.
What best describes you*

Solo DeveloperResearcherA member of a small development team or startup.A member of a medium development team (50+ employees)A member of a large development team (200+ employees)Other
Where do you train most?*

AWSGCPAzureRunPodLambdaPaperspaceVastOther
What's your typical setup*

Single GPUMulti-GPU single nodeMulti-NodeOther
Typical GPU's you rent/use*

A100H100L40S4090Other
In the last 30 days, how many training runs were disrupted by infrastructure issues?

012-34-56-1010+
If you've had a failed run, what was the cause for it ? (please check all that apply)*

I've had only successful runs this month.Driver/CUDA mismatchRandom crash mid-trainingNCCL hang / multi-GPU communication issuesVM wouldn’t see GPU / nvidia-smi brokenTraining was much slower than expectedOOM that “shouldn’t” happenSpot/preempt terminationDisk/IO bottleneck killed throughputNetwork bottleneck (multi-node or dataset)Other
How long did it take you to detect it was the infra and not your code, that caused the issue ?*

Immediately (<5 min)5-15 min15-60 min1-4 hoursOther
What's the cost of a typical incident (In hours, average per incident)*

<15 min15-60 min1-2 hours2-4 hours4-8 hours1 dayOther
How often do you abandon a VM and start fresh because debugging feels too slow ?*

NeverRarelySometimesOftenAlmost Always
Have you had any silent problems ?*

GPU was throttling (thermal/power)PCIe/NVLink/NCCL Bandwidth way below expectedOne GPU in a multi-GPU box underperformedIntermittent device disconnect (“GPU disappeared without cause”)Random NaNs/instability that felt like hardwarePerformance varies wildly across “same” instance typeOther
Have you ever been in a situation where comparing the difference between models and versions has been a major bottleneck*

YesNoI don't know
Have you ever been in a situation where analyzing the data during/after training has been a major bottleneck ?*

YesNoI don't know
What do you use today to track/compare runs?*

Local LogsTensorBoardW&BMLflowAimOther
What’s the biggest pain with your current approach?*

Too much manual workHard to compare runsSetup is heavyToo slow to queryData is scatteredPrivacySaaS constraintsCostOther
Which answers would you most want to have automatically?*

Why is this run slower ?Is it cheaper per 1M tokens/per epoch?Regression vs baselineGPU underutilizedMemory increased/OOM riskNCCL/multi-GPU comm degradedInstability increased (i.e NaNs/crashes)Other
If you've had any other issues and pains regarding Infrastructure/Initial training setup(VM's, GPU's, etc.) that you wish were solved, please enter them below.
Not required
If you've had any other issues and pains regarding Performance Metrics/Analytics/Logging during training, please enter them below.
Not required
We are incredibly grateful for your time, if you would like to know when the project is done, so that your development becomes easier, please enter your email so we can award you free early access.
Not required
Should be Empty: