This is an exciting topic for me to cover, so I hope you enjoy it! One of the areas that a lot of us practitioners don’t often have direct experience with, is the impact of our operations on Machine Learning and Artificial Intelligence workloads. Perhaps we provide the infrastructure that the data scientists are using, but then are kept largely separate from what goes on under the surface, behind the scenes, etc.
Anyone who has had to deal with the “Virtualization” problem of the past 20+ years has found that it all stemmed from under-utilized or wholly unused resources sitting within the Infrastructure, that if collected and pooled together could provide better performance, resilience and countless benefits, not to mention financial benefits! What makes this especially fun is, we can DO that with GPUs for ML and AI!
We may often be largely shielded from the Infrastructure that runs these workloads because those teams “Do their own thing” and “know what they’re doing” which time and time again has been disproven. (Interestingly, not all data scientists are Infrastructure or Operations savvy…) so that’s what makes this particularly compelling. Compound that with many times they are silo’d amongst themselves (Remind me sometime to tell you about the 100+ engineers at a company working in machine learning who had never met and had separate infrastructure for their workloads…)
More often than not, this is actually the reality of how GPUs (and frankly TPUs, VPUs and more operated) something we pretty much “Took care of” in the CPU space, thanks to virtualization and countless other efficiencies over the decades.
So what is very cool is, technology originally announced and showcased at VMworld 2019 through VMware’s acquisition of Bitfusion is NOW going to be available, shipping in July and available in vSphere 7!
How this works is instead of having a 1 User : 1 Node relationship which is very much the common handling for operation today, whether on-premises or leveraging Cloud GPU instances, we’re able to pool the resources and carve up the GPUs so Researchers, Scientists and teams will collectively have access to more resources and accomplish much more with the same or an even smaller footprint.
How this works (today) with things in beta, is the Researcher, Scientist, Engineer, whoever, makes a request via the bitfusion command to make a call for how many resources they want or need.
It is my understanding, just like traditional virtualization, things will ‘share’ giving you more flexibility, so when a single researcher decides to make a request for ALL resources it won’t starve out the rest of the cluster.
And as you can see the degradation of performance is relatively negligible when operating at Full GPU, or Partial GPU with multiple workers. The benefits are profoundly scalable and will let the teams accomplish exponentially more, vs letting infrastructure sit idle, collecting depreciation.
And this is just the beginning, not to mention the integration with Jupyter Notebooks out the gate, and a number of other ‘recipes’ that are available within the current Bitfusion community today. Here are some videos to get started on and exposed to some of the VERY cool capabilities within here to provide value and benefit to another part of the organization building silos within silos!
I know this is a lot to ingest, so take it in byte-sized pieces of course! This is opening the door to applying the Virtualization collapses waste to another realm which has been ripe with this need for some time! We’ll dig deeper into this and other technologies in the future. Did you find this useful? Want to know more and dig deeper? Let us know in the comments below! Thanks!