PostHole
Compose Login
You are browsing eu.zone1 in read-only mode. Log in to participate.
rss-bridge 2025-07-15T20:36:00+00:00

SE Radio 677: Jacob Visovatti and Conner Goodrum on Testing ML Models for Enterprise Products

Jacob Visovatti and Conner Goodrum of Deepgram speak with host Kanchan Shringi about testing ML models for enterprise use and why it's critical for product reliability and quality. They discuss the challenges of testing machine learning models in enterprise environments, especially in foundational AI contexts. The conversation particularly highlights the differences in testing needs between companies that build ML models from scratch and those that rely on existing infrastructure. Jacob and Conner describe how testing is more complex in ML systems due to unstructured inputs, varied data distribution, and real-time use cases, in contrast to traditional software testing frameworks such as the testing pyramid.

To address the difficulty of ensuring LLM quality, they advocate for iterative feedback loops, robust observability, and production-like testing environments. Both guests underscore that testing and quality assurance are interdisciplinary efforts that involve data scientists, ML engineers, software engineers, and product managers. Finally, this episode touches on the importance of synthetic data generation, fuzz testing, automated retraining pipelines, and responsible model deployment—especially when handling sensitive or regulated enterprise data.

Brought to you by IEEE Computer Society and IEEE Software magazine.


Jacob Visovatti and Conner Goodrum of Deepgram speak with host Kanchan Shringi about testing ML models for enterprise use and why it’s critical for product reliability and quality. They discuss the challenges of testing machine learning models in enterprise environments, especially in foundational AI contexts. The conversation particularly highlights the differences in testing needs between companies that build ML models from scratch and those that rely on existing infrastructure. Jacob and Conner describe how testing is more complex in ML systems due to unstructured inputs, varied data distribution, and real-time use cases, in contrast to traditional software testing frameworks such as the testing pyramid.

To address the difficulty of ensuring LLM quality, they advocate for iterative feedback loops, robust observability, and production-like testing environments. Both guests underscore that testing and quality assurance are interdisciplinary efforts that involve data scientists, ML engineers, software engineers, and product managers. Finally, this episode touches on the importance of synthetic data generation, fuzz testing, automated retraining pipelines, and responsible model deployment—especially when handling sensitive or regulated enterprise data.

Brought to you by IEEE Computer Society and IEEE Software magazine.



Show Notes

Related Episodes

  • SE Radio 534: Andy Dang on AI / ML Observability
  • SE Radio 610: Phillip Carter on Observability for Large Language Models

Other References


Transcript

Transcript brought to you by IEEE Software magazine.
This transcript was automatically generated. To suggest improvements in the text, please contact [email protected] and include the episode number and URL.

Kanchan Shringi 00:00:19 Hello all. Welcome to this episode of Software Engineering Radio. Our guests today are Conner Goodrum and Jacob Visovatti. Conner is a Senior Data Scientist and Jacob is Senior Engineering Manager at Deepgram. Deepgram is a foundational AI company specializing in voice technology and enabling advanced voice applications across many businesses and sectors, including healthcare and customer service. Deepgram solutions include conversational AI agents. Welcome to this show Conner and Jacob. Before we get started, is there anything you’d like to add to your bio, Conner?

Conner Goodrum 00:00:55 No, that about sums it up. Thanks very much for having me. Excited to talk today.

Kanchan Shringi 00:00:59 Jacob?

Jacob Visovatti 00:01:00 No, thank you. Likewise. Very excited to be here. Glad I’ve got my man Conner right alongside me.

Kanchan Shringi 00:01:05 Thank you. So our topic and our focus today is testing ML models for enterprise use cases, enterprise products. Just to start context, could you explain the relationship between a data science model, an ML model and an LLM?

Conner Goodrum 00:01:26 Well, I would say that everybody’s got their own vernacular about how all these things fit together. Largely the way that I consider them, an LLM is just one type of ML model and similarly we use data science approaches to train various types of models, one of which could be an LLM, but they all have their sort of specific use cases and applications.

Jacob Visovatti 00:01:47 Yeah, maybe just to build on that Conner, when we think about the field of data science, I guess I could say traditionally, even though it’s a relatively new discipline, I think we see a lot of initial applications that maybe grew almost out of the big data movement that was the key buzzword but 10, 15, 20 years ago, right? And we see things like teams of analysts inside a larger enterprise that are developing models maybe to forecast revenue growth across market segments. And we have generally well-structured inputs applied to a narrow range of questions and mostly for an internal audience. And of course there’s a lot of people doing great work there. And I don’t mean to oversimplify how complex that kind of work can be, it’s extremely hard stuff and forecasting revenues is pretty darn important for any company to get right. And I think what’s really interesting now and what I think provokes this kind of conversation is now we see the intense productization of those techniques at a greater scale, especially insofar as they more and more approximate human intelligence and therefore are justifiably called AI. So when we think about machine learning models in this context we’re thinking about things like accepting unstructured data and the model is no longer a limited set of results that are going to be curated and delivered in human time to a known audience, but it’s going to be delivered in real time to wide audiences with consumer focuses without any human in the loop checking on those results in the meantime, which of course exposes a whole host of concerns on the quality front.

Kanchan Shringi 00:03:23 Thanks for that Jacob. So I think that leads me to my next question. Given this expanded focus, is that what leads companies to think of themselves as an AI-first company or a foundational AI company and what is the relation between these two terms?

Jacob Visovatti 00:03:41 I think justifiably AI-first companies are those whose product really revolves around delivering value to their end customer through some kind of AI tooling. I think that the really useful designation or distinction that you brought up there is foundational versus not. So, there are a lot of “AI-first” companies that are delivering really cool products that are built on top of other more foundational technologies. And the difference between some of those companies that are doing really neat things and a company like Deepgram or other big players in the space, like open AI and Anthropic, is we’re developing new models from scratch — maybe influenced by what’s going on across the industry, informed by the latest developments in the research world, the academic world, but we’re essentially developing new things from scratch, empowering other people to build all sorts of applications on top of almost infrastructural AI pieces.

Kanchan Shringi 00:04:36 The kind of testing that a foundational AI company has to do is also different from what potentially an AI-first company that uses AI infrastructure would do and it would probably build upon the testing that a foundational AI company has in place. Is that a fair summarization?

Conner Goodrum 00:04:56 Absolutely. I would say in building upon other people’s models, it’s easy to sort of point the finger when something goes wrong and be able to say like, oh well we’re using this provider’s model to do this part of our software stack and therefore we can really only test inputs and outputs. Being on the foundational side, we really have the control to be able to go in and tweak parameters or adjust the model itself in an attempt to design them out rather than working around them. And that’s a huge, huge advantage.

[...]


Original source

Reply