Alex Ratner: Accelerating AI with Programmatic Data Labeling

June 25, 2024 00:37:44
Alex Ratner: Accelerating AI with Programmatic Data Labeling
Alex Ratner: Accelerating AI with Programmatic Data Labeling

Jun 25 2024 | 00:37:44


Show Notes

Ninad Singh, Director at Ayna.AI, interviews Alex Ratner, CEO of Snorkel AI. They discuss programmatic data labeling, automated annotation, and AI customization. Alex highlights Snorkel AI's transformative role in AI development, scaling startups, and improving performance through specialized data solutions. Snorkel AI aims to revolutionize AI training and development, making it more efficient and tailored to business needs.

Alex Ratner is the CEO of Snorkel AI, a pioneering company that transforms AI development by programmatically labeling and curating data. With a PhD from Stanford, where the Snorkel project began, and a physics degree from Harvard, Alex blends deep technical expertise with practical experience in finance and consulting. Under his leadership, Snorkel AI has raised $135 million, serving diverse industries like banking and government. His unique skills and strategic insights make him a key figure in the AI space.


Discussion Points


Ayna Insights is brought to you by Ayna.AI—a managed service provider that combines domain expertise and transformation capabilities to create alpha—performance superior to market indices—in the industrial and industrial technology sector. The host of this episode, Ninad Singh, is a Director at Ayna.AI.


For More Information

Alex Ratner LinkedIn

Snorkel AI

Ayna.AI Website

Ninad Singh LinkedIn

View Full Transcript

Episode Transcript

[00:00:03] Speaker A: Welcome to INA Insights, where prominent leaders and influencers shaping the industrial and industrial technology sector discuss topics that are critical for executives, boards and investors. INA Insights is brought to you by Ina AI, a firm focused on working with industrial companies to make them unrivaled. Segment of one leaders to learn more about INa AI, please visit our website at www. Dot Ina dot AI. [00:00:40] Speaker B: Good morning everyone. Welcome to another episode of our AI startup podcast hosted by Ina AI. Our guest today is Mister Alex Ratner, who's CEO of Snorkel AI. Snorkel AI makes AI development fast and practical. By transforming manual AI development processes into programmatic solutions, Snorkel AI enables enterprises to develop AI that works for their unique workloads using their proprietary data and knowledge ten to 100 times faster. Five of the top ten us banks, several government agencies, and Fortune 500 companies use Snorkel today. Snorkel has raised $135 million so far and currently employs over 100 people. Prior to Snorkel, Alex completed his PhD at Stanford and was involved in both info lab and AI Labs, which is where the snorkel project really began. He also has a degree in physics from Harvard. Alex, welcome to our podcast. We're super excited to have you and are looking forward to talking about your journey so far and get your thoughts on what's next. [00:01:48] Speaker C: Yeah, well, thanks so much for having me on the podcast. Really excited to be here. [00:01:53] Speaker B: Alex. For those not familiar, could you just explain what Snorkel does? [00:01:58] Speaker C: Yeah, so at Snorkel we build a, we call a data development platform for AI. So we help companies label and curate their data and use that to teach and adapt and actually also evaluate AI models, AI applications for their specific domain and use case. I'll take a step back and just give a little bit more detail there. AI basically has three main ingredients, and this is true in the large language model chat GPT era, so to speak. It's true in the deep learning era. And before that, it's a basic fact of how AI works, or rather the subset called machine learning, which is really what everyone means when they say AI. So three ingredients, there's the models and algorithms, which is the actual AI black box that takes in a question and outputs an answer. For example, there's the infrastructure and compute, which powers the us economy through the Nvidia stock rocket ship. And then there's the data. The data is often the messiest, but the most critical and unique part of all of this, actually, the big tech companies have spent billions of dollars in the last couple of years alone, labeling and curating data to teach AI, basically what to do. And so if the models and algorithms are kind of like the brain, and the infrastructure is, I don't know, the body that gives all that blood and oxygen and fuel, the data is all the. Well, it's all the information coming in. If you think about an AI model, like a GPT style model, it's kind of like a really smart college student today. Data is how you actually onboard them, how you teach them and train them to be an underwriter, a lawyer, a doctor, a specialist, whoever is actually going to power critical use cases in your. And we help curate that curriculum label and develop that data that goes in to teach and train AI. [00:03:53] Speaker B: That's very interesting. So our audience is predominantly industrial companies. What's interesting to me is that you work not only with pure play AI development firms, but also with larger Fortune 500 companies that are not necessarily tech or AI focused, what are those companies doing with their data? [00:04:12] Speaker C: It's a great question. So there's first two pieces to what companies do with their data. Just taking a step back. We work with a whole range of verticals, actually less so in industrials right now, which is part of, obviously, the interest in getting some contact with a new area. And we're always very excited to learn and partner with our first couple of customers in a new vertical, in a really mutually beneficial way to explore and dive deep there. But worth noting, we're in about ten different verticals, so banking is one of our biggest, with six of the top ten us banks. Now, a lot of areas where the data and the use cases are a, a little bit more domain specific. They're not out there on the Internet, where today's big chat GPT style models are trained, and b, they're really important to get right. Or at least they have use cases where accuracy is really. And reliability and trustworthiness is, is really important, meaning it's not okay to have a really impressive sounding chat bot that makes a crazy hallucination 15% of the time needs to be very accurate. And again, those kinds of settings are where the training, the tuning, think about the onboarding the college student to become a specialist is really important. So that's a little bit of intuition. Okay, so now back to your question. What do our customers and the data science teams that we primarily interact with within them, what are AI and data science teams more broadly doing with enterprise data? I say it goes two ways. One is they're using data to teach AI to work well in their settings, and then two, they're using AI to help leverage and extract and parse value from their data. So it goes both ways. The first way is what I was talking about before, how just if you brought on a really smart new grad hire and you were trying to onboard them or train them to become a specialist in a critical use case in say in the industrial setting, you'd have to teach them, you'd have to give them training manuals, you'd have to show them examples of how to correctly perform the specialist functions that you're trying to train them for. That's what we call an AI training data or tuning data. And it has to be carefully labeled and curated to teach AI to do the right thing. It's also one of the biggest areas of real non commodity value in the space. If you look at how AI has progressed, and I mean this all in the most positive and excited way, because it's an incredible time for the field. But the models and the algorithms and the infrastructure that they run on have largely standardized and commoditized the models that people have been using, something called a transformer model. If you want to get in the weeds to power a GPT four and all the other models out there, fairly standard architecture that's been now used for a number of years, the infrastructure is pretty readily available. There's still bottlenecks around GPU's and all of that, but that'll ease and more or less it's standardized and available. The real delta in the space is your unique data as an enterprise and your unique subject matter expertise. And that is sort of every company in the world deciding to open source their data and their knowledge and give it away for free. Give away their crown jewels, that's not going to be commoditized, that's not going to be standardized, that's not going to be one size fits all. We are a development platform that helps enterprises curate and prepare all of this data, all of their crown jewels, for training and customizing AI to work really well and uniquely well in their type of environment. That's direction one, then direction two. I'll mention more briefly, one of the things that AI in general has gotten really good at is parsing data. So a lot of the use cases for AI are around extracting certain things from unstructured data like text or image data, tagging, classifying, answering questions, summarizing, et cetera. So it might be a little bit confusing, but really the two go together, because if you want to use AI to do all of this parsing and leveraging of your unique data, you need to teach it to work well on that data. So it goes both ways. But that's definitely one of the biggest areas of use cases, both for our customers and for users of AI in general. [00:08:35] Speaker B: That's really interesting. That's not something I thought about earlier about it working in your environment. Alex, you have the unenviable position of being our first infrastructure related guests. We've had a lot of folks who developed applications or different wrappers on our podcast before. Could you just spend a couple of minutes explaining the entire value chain in AI ML and how you fit in? I know you mentioned the college kid example. I think that was a great example to use. [00:09:07] Speaker C: Yeah. So, I mean, I'll give a simplified view of the stack. Obviously it's a little snorkel centric, and data centric given the perspective it's coming from. But we take a step back, and most of this does come down to those three key ingredients of the infrastructure, and compute the models and algorithms and the data, but just in different flavors. So if you think about the AI stack today, let's start with what's often usually pictured at the and if you think about the stack, bottom is usually like the hardware, where the stuff actually trains and runs on the top is kind of the application layer, where it actually gets turned into something that is usable in a business process or by a consumer. So the bottom layer in the kind of typical AI stack as well. I say typical, but there's a lot in flight these days. And so I think people are still kind of figuring things out. But in our view, and I think I'd say the majority view, bottom of the stack, you think about the actual hardware, and there's a ton of activity going on, very disruptive activity in exciting ways around hardware to run AI, both for training it or tuning it, and for running it in production. As a quick double click there, you cut me off when I'm going into uninteresting territory. But just as a step back, AI models today generally go through a couple of distinct stages. There's pre production, there's something called pre training, and with large language models like chat, GBT, this is, it's actually a decades old approach, but it's been scaled up incredibly over the last couple of years and yielded benefits to said scaling beyond what any of us, including us stodgy academics, ever thought possible. So hugely exciting. But pre training is basically sending one of these models out on the Internet and just soaking up data and really just trying to learn to, to predict patterns in text or image data. On the Internet, say. And it turns out if you do that at sufficient scale, these models actually learn lots of generalizable skills and knowledge. I think people attribute more to them than they actually learn, which gets to the hype cycle around all this, but they still learn an incredible amount. So that's often called pre training. Then there's usually a post training stage. This, in disclosure, is where we usually operate. That's when you're now taking that college student, let's call them, and you're turning them into a specialist. You're training them to be an underwriter at X Insurance company who knows exactly the right protocols and is really, really accurate and specialized at their job. Often terms like fine tuning instruction, tuning alignment, RLHF, these are kind of buzzwords that you may have heard about even prompting as well. All of that. I'd bucket into the post training stage. You've got this generic generalist base now you've got to adapt it and tune it for your specific setting. That's where we fit in. That's where most enterprises who are a step ahead are customizing AI. They're not going in. Most, other than maybe a handful of the biggest ones, are not going and retraining their own giant models. They're not doing pre training. They're doing post training on top of either closed or open source base models. And then think about it like buying a suit and then tailoring it. Post training is like the tailoring stage. And then there's serving, or people often use the phrase inference time when you're actually, you have that model that works, at least it works for a little while before things change in the world and you're now serving it. So all of the pre training, post training and inference, they have different profiles and they will probably start to work. We're seeing this happen on different types of hardware, but let's just blur that all together. They all require lots of, lots of compute. So there's the hardware layer where there's a ton of innovation. You've got GPU's obviously dominating, but you have a lot of new innovative chip architectures that were basically custom built for today's AI model architecture. You can see there's a bunch of head to head between companies like Grok and Sambanova in the AI news these days about racing to see how fast they can run AI models at inference time and how cheap they can make the training. So you've got all that hardware innovation, then you've got the compute platform that actually enable you to run the models and the algorithms, whether that be the pre training, the post training, or just serving and using them in production, there are some specialized vendors that do this. A lot of the LLM providers like OpenAI, are opening up their own APIs for doing these operations. And then you've got the model platforms like a vertex or a sagemaker or an Azure. We've got all that. Then there's obviously the models themselves. That's one big part of the stack. So there's a ton of activity. You've got OpenAI and anthropic and folks who are building closed proprietary models, you've got a truly massive amount of activity in the open source around open source models can get back into that, but got that chunk of activity. And then you get to the data stuff and I bucket that into two pieces. One is connecting the models to the data that you need. This is often referred to. So in this category are things like vector databases, if you've heard of that kind of phrase, or rag is a term there. This is basically all of the piping that lets the model go and retrieve some data from somewhere. Think about this like the step where you give that new college grad access to all the company wikis, databases, et cetera, right? Pretty hard for them to do their job. Pretty hard for even an expert underwriter or doctor to do their job if they don't have access to the basic data, the patient's chart, all the databases. So all of that piping and those systems, that's a big chunk of activity. Then the next layer is where we would put ourselves, which is, okay, great. You've got your model, your generic uncustomized model. You've got it connected to all the relevant data it needs to pull from vector databases, traditional databases, again, often referred to as rag. You've got all these information retrieval systems and data infrastructure pipes in place that get the right context to the model. Now you've got to train it to do the job it's supposed to do accurately, and you've got to evaluate it and then go back and retrain it when something changes in the world that needs to be updated. And that's where data and data labeling plays such a big part. That's where we play. And then I put the last layer as, ok, now you've got this AI model that's been evaluated and it's up to spec, it's been tuned to your spec. Now you got to plug it into some kind of interface, some kind of business process, some kind of workflow. That's what we think of as the application layer. And I'd note, I think that is one of the most critical areas for innovation, because you can have a model with so so accuracy, but if you use it in a creative way, that's not going to mess up anything. If it makes a mistake, you can get a lot of value out of it, like a copilot for example, that just offers hints for helping you with doing software development. Conversely, you can have a really, really accurate model, but if you don't put it in a usable interface or plug it into the right business process or workflow, it's going to sit in the shelf and be useless. So I think the application layer, that's where also a lot of the vertical specific knowledge comes in. So if we look at that stack and I'll wrap up here and think about it from the perspective of an enterprise, maybe getting into AI and saying what should I buy? Versus where do I have some unique value? A lot of this stuff is really standardizing. It's probably the data development layer where you're teaching the model about your specific business, your data, your knowledge, and the application layer where you're plugging it into something valuable for your business where most of the specialization and therefore the differentiated value for companies looking to invest in AI is going to be reaped. The rest is really standardizing in a really powerful way. [00:17:12] Speaker B: That's really interesting. I love how you kept the common theme of our college grad, learning how to be a useful part of society, but I think you're completely right. I think most people only see the application or rapper layer in a slightly more selfish way. How is this new technology helpful to me? But understanding the different components is just so important in actually making it usable. If you're not willing to invest in the right kind of data labeling and the right kind of tuning or post training as you put it, to make it usable, build that accuracy, the application will never get there. [00:17:49] Speaker C: Yeah, and it depends, it depends. Use case to use case, this is one of the most important things right now, is basically like most of the customers we work with, again, often central data science teams. We interact a lot with the line of businesses that they serve in these organizations. They're just under, they have a massive amount of interest from the line of business in all kinds of applications and they're trying to triage. And so one important thing is kind of understanding what tool to use for what job. There are some use cases where, and the intuition I would give is think about things that are not so bespoke or specific to your domain or your industry that are a little bit softer in terms of what happens if there's an error. No one dies, no one gets fined, we're off the shelf and a wrapper around it, probably more than good enough. And then there are these really bespoke critical use cases where that's not the case and where a lot of this careful evaluation, tuning, customization is probably going to be critical to go from flashy demo to production. As you can guess, we live in that ladder category, but again, they're different tools for different jobs. [00:18:58] Speaker B: Absolutely. So in all of this, it seems like data labeling is critical, just making it more usable in very specific domains. How are you helping automate the space at snorkel? [00:19:11] Speaker C: Yeah, so we say that our mission is to make data labeling and data development for AI, programmatic software development. So our view is that, I mean, if you take a step back, and it's not really just our view anymore. We've been working on this first at Stanford, at University of Washington, where I'm on faculty, and at the company, obviously been working for about a decade on this idea of data being the key interface to getting AI to learn something. That's kind of come to a head in the LLM era, because LLMs, if you just take a step back, don't need to understand the details, but they're just massively complex. They have, whether they have 7 billion or 700 billion parameters, think of those as knobs to tune them. They're well beyond the point where you can't just go and tweak some of the parameters to get it to do something differently. Think about the metaphor being like a lot of the old tools we used to use in data science being performing brain surgery to teach that college student to become an underwriter. They're fairly black box right now, and there's lots of pros to that. But the way that you get AI to work is not really about tweaking the model or the algorithm these days. And it's not just blindly scaling up more compute on more random public data. It's really all about the data that you feed in. So if we think about data in that way, I'm arching back to your question, I promise. Data is like the programming language for AI. It's basically the interface. It's how you get it to do something. And if you take that perspective, then you come away with two conclusions. And this is what has motivated us for the last decade of work. Number one, it can't just be auto magicked away in other words, we're not going for full automation, and we view those kinds of claims as somewhat nonsensical. We're trying to make it as efficient as possible for someone, say, a subject matter expert in, say, industrials, who knows what the AI model needs to do to actually be useful to tell the model what to do with the right data going in to program it. So for us, it's about acceleration of the human loop of the data development to teach the model what to do, not totally automating it away, given that data basically is how you. It's the programming interface for AI, in our view. The second thing, though, is, okay, if it's the programming interface, it's kind of like we're all coding in binary right now. Right now, data labeling is done very manually. There's one massive tranche of effort where basically the big tech and LLM kind of training companies are spending literal billions of dollars on outsourced data labeling, where someone is sitting across the globe in a room saying thumbs up, thumbs down, on what some chatbot says, and I'm not exaggerating, the aggregate spends there. Now, conversely, most of our enterprise customers, they can't just ship their data out to some untrained experts. So instead, they're begging their internal subject matter experts to do that data labeling, and it's just too slow and too clunky and making it look more like software development. So it can be ten, 100 times faster. So some degree of automation, but still with a human loop, is what we do. So we have these subject matter experts basically codify their knowledge, say, if I see this kind of pattern in this response, then I think it's a bad response. And if I see this, I think it's a good response. And again, could go much deeper into how snorkel's approach works. But think of it as trying to make this data labeling look like software development, fast, iterative, and also auditable and adaptable, rather than one click at a time, which is what it looks like today. [00:22:38] Speaker B: Got it. Hey, we've all been there, or at least I have in the sitting in a manual, excel and labeling something. So I think what you're talking about sounds very exciting. Also, I want to take a step back. We've spoken about the AI space for a little bit. As you help other people scale their businesses, their models, I want to know what have been the biggest challenges you face as you scale your own business. Because every vertical, I'm sure, is completely different. [00:23:07] Speaker C: Oh, man. It's a great question, and obviously we've been capitalized in order to be a very, on a certain kind of growth trajectory that's been very fast and we've been fortunate enough to get to hang on and perform well against that growth trajectory. So I think for us as a business, I don't know. I think the biggest challenge this is meta is the challenges change nearly every, every month, because at the rate that we're growing, like you mentioned, 100 people or 150 people now, we haven't even updated our bio blurb, apparently because things move pretty quickly when you're a startup in a space like this. So if I interpreted your question correctly in terms of challenges for us, just as a business, as a startup, I'd say just the, the meta one is that just the challenges change so frequently. One day you're thinking about, well, it's going from prototyping and building to scaling out on a pretty rapid curve, right? The cycle time. I think for a startup of going to, let's take go to market, for example, going from hey, let's get the basic messaging for this new product or feature or vertical for how we sell our platform to folks in the industrial vertical. Let's get this worked out where it's just a couple of people sitting in a room and then a couple of months later it's how do we scale this out as an enablement and training program for all of our sales reps. It's that kind of like quick turn from really rapid prototyping to then scale out. That's, I think, one of the fun challenges of being a startup. I may have misinterpreted the question, though, so let me, let me pause there and see if that's kind of what you were angling for. [00:24:44] Speaker B: No, absolutely. I think that's exactly what we're looking for. I think it's just always interesting to hear about different industries and what kind of challenges they're facing. So love that little tidbit. But Alex also Snorkel's been around since 2019, but I think you mentioned in the blog that this project has truly been around since about 2015. That's a long time in the AI world. How's the programmatic data labeling space changed since then? [00:25:15] Speaker C: It's a true point. The characteristic timescale of the AI space these days, at least if you follow from what you could follow on Twitter, is two minutes. By that, we're pretty ancient. I'd say that a lot has changed, but less has changed than people think by quite a bit. And so actually go back to company challenges. One of the ones that we've faced over the last year or two and done some things right and some things wrong. Moving regardless, is how much to kind of lean into that feeling of change versus how much to kind of stay the course, given that a lot of the underlying fundamentals in our view are not changing as much as people think. So what do I mean by that? These large. So I'll give you an example. The term large language model LLM, which is the term for one of the most common term. We also like the term out of Stanford and we like it at Snorkel. Two of a foundation model metaphor being these PT style models are kind of the foundations that you build the house on top of, but you still have some house building. There's another metaphor for you. I'm using just all the metaphors I'll throw in the kitchen sink here. But these LLMs, that term has existed for 20 years and we've had a tremendous scale up, have gone from spending a couple thousand bucks on a training run to companies spending billions of dollars training these models. So an incredible scale up and an immense amount of innovation around the techniques and everything around it. And that scale up has yielded bigger leaps in performance than any of us expected. We didn't expect this degree of mass hype. For better and worse, both that kind of chachp had spurred. So that's the kind of change in the space. But at the same time, those of us who have been in AI for more than two minutes know that these techniques have been scaled up, but they haven't. The fundamental changes are a lot more minimal than people think. I'll give you an example. We've been labeling data or we've been helping customers to label data to tune and evaluate large language models for the entirety of the company, and then before that at Stanford. But I'll just talk about the company last like four and a half years. Now, the large language models from three years ago would now be called small language models because of the scale up. But a lot of the techniques don't change. I think the biggest change in the space has really been new use cases that just the performance was not good on enough before that. Now the performance is approaching good enough that have been unlocked by the scale up. So a lot of the generative use cases like summarization and multi turn dialogue and question answering it with long form answers, but still there a lot of the techniques for getting these things to work and specifically how you use label data to tune and evaluate them have not fundamentally changed as much as people think so in many ways. A lot of what we've been working on is we've been building a flurry of new stuff, but a lot of the fundamentals of the role that data plays have not changed a ton. I think that we definitely see the market and enterprises in general arcing back to that realization. We're going through a hype cycle. Last year we were at the ascent up the first peak of the hype cycle, and there was a lot of are we going to need data labeling? Are we going to need data scientists? Are we going to need humans? Or is this the singularity and we're all obsolete? And we helped our course because we knew that this was a bit of overhype. Good that it brings attention to the space, bad that it miss, sets expectations, which most of our customers are now struggling with. Basically everyone had a chatbot demo that floated up to usually their C suite, even at these major companies, set expectations sky high, got them budget, but now there's incredible pressure to perform. And of course, folks are now running into the last mile, where these models need to be more carefully evaluated, tuned, adapted and labeled. Data is one of the key ingredients there. This has been a curve that we've partly reacted to quite a bit with adding tons of new features that really just integrate the latest LLM models techniques and also adapt our platform for the use cases that are now more popular because they're more feasible as the models have gotten more capable. But the core technology and the core positioning has really not changed, and it's more been waiting for folks to ride the hype cycle back to where they are now, which is seeing what actually needs to go into getting AI to go to production, which is largely about data. [00:30:03] Speaker B: Sounds to me like the changes over the last ten years have really reinforced the need for what snorkel does. You spoke about the high expectations. It's really the last mile now that a lot of these different models applications require to become truly useful and scaled in an enterprise fashion. [00:30:24] Speaker C: Yeah, and I think this goes, let's take a step back from data labeling specifically. I think this goes for a lot of AI infrastructure companies. A lot of the difficulty of getting AI models to production has eased, which makes it faster to get them to do useful things. Maybe we were starting from a model that was at 40% out of the box, or 0% out of the box before, and now it's starting at 70% out of the box. But if you look at the space, the net reduction in difficulty of getting models to production, even if you say, well, hypothetically, maybe that reduces the workload need for infrastructure companies in AI. The net expansion of number of use cases that have been unlocked both by the excitement as well as the increased capabilities of AI, is 100 fold bigger in our view than the fact that it's a little bit faster and maybe a little bit less data is needed to get a model to production. So most of us in the space, if we, I mean, look, the AI space is going to get a little shaken up and bloody and it's already happening. There are a lot of very non robust, I mean, look, bluntly, there have been hundreds of millions, actually billions of investment dollars poured into companies that are going to be obsolete next year. But those of us that are kind of anchored on some of the fundamental areas of AI, some of those basic ingredients that we went back to at the beginning of the session, we're very, very excited about, as you could guess, about these shifts. And I think it's a win win. It has gotten in some ways easier and faster for customers, which means they can move to production more quickly. But there's still lots of development and infrastructure building that's needed and there's net new, there's 100 x as much stuff to do out there business for us and use cases for customers than there was before. So it's a rising tide for infra companies and for customer data science teams in the space, in our view, absolutely. [00:32:28] Speaker B: Alex, I think that's been a very interesting deep dive into AI infrastructure, the entire ecosystem. I'd love to switch tracks and focus on your journey so far. Could you tell me a little bit more about your background and your career before snorkel? [00:32:45] Speaker C: Yeah, I mean, I'll give a truncated version because I know we're running close to time, but yeah, I guess you mentioned the physics degree, which I'm horrified whenever anyone brings it up because I'm worried I'm going to get asked a physics question and reveal how much I remember of that phase of my education. I worked in finance and consulting for a bit after college. I coded since I was a kid, so always loved software development. It was kind of a random thing that got me back into the current course, was actually around. I was doing some stuff, looking into the patent corpus and it's fascinating area. I mean, lots of ugliness around it in terms of litigation and the like, but fascinating that literally on a thumb drive you can fit everything that anyone has ever thought worth patenting. At least I think on thumb drive, or at least if you discard the images. I'm not tracking thumb drives these days and what their storage capacity is, but fascinating that you could have this much knowledge that is accessible but not really usable because it's really, really difficult. Like even just. I remember this example sticks out in my mind. Even just normalizing all the different ways that IBM is referred to in the patent corpus, which sounds trivial, but is actually really difficult. Even that's trivial. Everything with natural language and with complex technical concepts like that embedded in text and natural language is just super, super difficult. So that fascinated me. The fact that you figured you have all this information, pile of riches, so close, but it's so difficult to pull anything out of it. That led me into an area called natural language processing, and at the time is around 2013, this was just kind of, people were just starting to apply statistical or machine learning, or now what we would call AI techniques, including small or language models. They were just starting back maybe 2012, 2013, or it was just starting to become more popularized. That fascinated me. So that was my inroads back into, started a company that was a good learning experience, let's put it at that. When that wound down, clawed my way back to academia, kind of continued this course at Stanford, and then that's kind of when we started the snorkel project. And I'll wrap up by just saying I think the big epiphany there that really put us on the current course was just seeing all of the pain around data and data labeling that no one in the AI space was paying attention to data was, and to some degree still is, viewed as janitorial. It's viewed by data science and AI teams as not what we do. And I mean, that's kind of the core social arbitrage at the center of our, you know, business hypothesis is that relative to the incredible importance of data and data curation and labeling and quality on AI performance, we're still not paying enough attention to it and still viewing it as not our problem in the AI space. And it's been our mission over the last decade to kind of shift the center of gravity of AI development back to the data and build good tooling for that. So that was kind of the last step of the current snorkel journey. [00:35:51] Speaker B: Alex, that's been super interesting. The AI, and specifically the Gen AI space, is really having a big moment. You've spoken about this, but as someone in the space, some of these advancements might not be completely new to you. Is there an application or use case that you're most excited by right now and look forward to in the future. [00:36:11] Speaker C: Well, there's so many. And again, we're a horizontal dev platform, so we support so many in so many different verticals. I would just say, I think using AI to, to get value out of unstructured data. This is a longstanding focus of AI, but it's really been accelerated by the recent advances. Still, you need to train and teach and tune and develop these models, usually to get it to work on your most unique and complex data. But pardon me, most enterprises we work with, most industries, pardon me, have this iceberg under the surface. All this unstructured data, and being able to unlock that and use it and access it with AI is the one that most excites me. [00:36:51] Speaker B: Thanks for your time, Alex. I think everyone had something to take away from your conversation. This is the first time we've actually dug deeper into AI infrastructure. So thank you for your patience and time today. [00:37:02] Speaker C: No, thank you so much for humoring my AI rambles and for inviting me onto the podcast. It was awesome to get to meet and get to chat. [00:37:09] Speaker B: Perfect. [00:37:09] Speaker C: Thanks so much. [00:37:16] Speaker A: Thanks for listening to Ina insights. Please visit Ina AI for more podcasts, publications, and events on developments shaping the industrial and industrial technology sector.

Other Episodes


July 25, 2023 00:27:54
Episode Cover

Ameya Prabhu: Exploring India's Investment Potential

Gaurav Batra, CEO of Ayna.AI, welcomes Ameya Prabhu, Founder and Managing Partner of UAP Advisors and the Managing Director of NAFA Capital Advisors, Mumbai,...



November 22, 2023 00:37:20
Episode Cover

Bob Chapman: Inspiring Empathy Through Truly Human Leadership

What if managing employees wasn’t a responsibility, but a privilege? What if, before efficiencies and synergies, empathy was a driving force in leadership? In...


Episode 7

February 28, 2022 00:21:45
Episode Cover

Gregory Rustowicz: The “Great Transformation” at Columbus McKinnon

Gregory Rustowicz, Vice President of Finance and CFO of Columbus McKinnon, shares his thoughts on the company’s strong track record of value creation and...