Partner im RedaktionsNetzwerk Deutschland
Escucha Latent Space: Founders, Engineers, and News on Software 3.0, DevTools, Computer Vision, Data Science, AI UX en la aplicación
Escucha Latent Space: Founders, Engineers, and News on Software 3.0, DevTools, Computer Vision, Data Science, AI UX en la aplicación
(171.489)
Favoritos
Despertador
Sleep timer
Favoritos
Despertador
Sleep timer
Página inicialPodcastsTecnología
Latent Space: Founders, Engineers, and News on Software 3.0, DevTools, Computer Vision, Data Science, AI UX

Latent Space: Founders, Engineers, and News on Software 3.0, DevTools, Computer Vision, Data Science, AI UX

Podcast Latent Space: Founders, Engineers, and News on Software 3.0, DevTools, Computer Vision, Data Science, AI UX
Podcast Latent Space: Founders, Engineers, and News on Software 3.0, DevTools, Computer Vision, Data Science, AI UX

Latent Space: Founders, Engineers, and News on Software 3.0, DevTools, Computer Vision, Data Science, AI UX

Alessio + swyx
Guardar
The podcast by and for AI Engineers! We are the first place over 50k developers hear news and interviews about Software 3.0 - Foundation Models changing every d... Ver más
The podcast by and for AI Engineers! We are the first place over 50k developers hear news and interviews about Software 3.0 - Foundation Models changing every d... Ver más

Episodios disponibles

5 de 16
  • Building the AI × UX Scenius — with Linus Lee of Notion AI
    Read: https://www.latent.space/p/ai-interfaces-and-notionShow Notes* Linus on Twitter* Linus’ personal blog* Notion* Notion AI* Notion Projects* AI UX Meetup RecapTimestamps* [00:03:30] Starting the AI / UX community* [00:10:01] Most knowledge work is not text generation* [00:16:21] Finding the right constraints and interface for AI* [00:19:06] Linus' journey to working at Notion* [00:23:29] The importance of notations and interfaces* [00:26:07] Setting interface defaults and standards* [00:32:36] The challenges of designing AI agents* [00:39:43] Notion deep dive: “Blocks”, AI, and more* [00:51:00] Prompt engineering at Notion* [01:02:00] Lightning RoundTranscriptAlessio: Hey everyone, welcome to the Latent Space podcast. This is Alessio, partner and CTO in residence at Decibel Partners. I'm joined by my co-host Swyx, writer and editor of Latent Space. [00:00:20]Swyx: And today we're not in our regular studio. We're actually at the Notion New York headquarters. Thanks to Linus. Welcome. [00:00:28]Linus: Thank you. Thanks for having me. [00:00:29]Swyx: Thanks for having us in your beautiful office. It is actually very startling how gorgeous the Notion offices are. And it's basically the same aesthetic. [00:00:38]Linus: It's a very consistent aesthetic. It's the same aesthetic in San Francisco and the other offices. It's been for many, many years. [00:00:46]Swyx: You take a lot of craft in everything that you guys do. Yeah. [00:00:50]Linus: I think we can, I'm sure, talk about this more later, but there is a consistent kind of focus on taste that I think flows down from Ivan and the founders into the product. [00:00:59]Swyx: So I'll introduce you a little bit, but also there's just, you're a very hard person to introduce because you do a lot of things. You got your BA in computer science at Berkeley. Even while you're at Berkeley, you're involved in a bunch of interesting things at Replit, CatalystX, Hack Club and Dorm Room Fund. I always love seeing people come out of Dorm Room Fund because they tend to be a very entrepreneurial. You're a product engineer at IdeaFlow, residence at Betaworks. You took a year off to do independent research and then you've finally found your home at Notion. What's one thing that people should know about you that's not on your typical LinkedIn profile? [00:01:39]Linus: Putting me on the spot. I think, I mean, just because I have so much work kind of out there, I feel like professionally, at least, anything that you would want to know about me, you can probably dig up, but I'm a big city person, but I don't come from the city. I went to school, I grew up in Indiana, in the middle of nowhere, near Purdue University, a little suburb. I only came out to the Bay for school and then I moved to New York afterwards, which is where I'm currently. I'm in Notion, New York. But I still carry within me a kind of love and affection for small town, Indiana, small town, flyover country. [00:02:10]Swyx: We do have a bit of indulgence in this. I'm from a small country and I think Alessio, you also kind of identified with this a little bit. Is there anything that people should know about Purdue, apart from the chickens? [00:02:24]Linus: Purdue has one of the largest international student populations in the country, which I don't know. I don't know exactly why, but because it's a state school, the focus is a lot on STEM topics. Purdue is well known for engineering and so we tend to have a lot of folks from abroad, which is particularly rare for a university in, I don't know, that's kind of like predominantly white American and kind of Midwestern state. That makes Purdue and the surrounding sort of area kind of like a younger, more diverse international island within the, I guess, broader world that is Indiana. [00:02:58]Swyx: Fair enough. We can always dive into sort of flyover country or, you know, small town insights later, but you and I, all three of us actually recently connected at AIUX SF, which is the first AIUX meetup, essentially which just came out of like a Twitter conversation. You and I have been involved in HCI Twitter is kind of how I think about it for a little bit and when I saw that you were in town, Geoffrey Litt was in town, Maggie Appleton in town, all on the same date, I was like, we have to have a meetup and that's how this thing was born. Well, what did it look like from your end? [00:03:30]Linus: From my end, it looked like you did all of the work and I... [00:03:33]Swyx: Well, you got us the Notion. Yeah, yeah. [00:03:36]Linus: It was also in the Notion office, it was in the San Francisco one and then thereafter there was a New York one that I decided I couldn't make. But yeah, from my end it was, and I'm sure you were too, but I was really surprised by both the mixture of people that we ended up getting and the number of people that we ended up getting. There was just a lot of attention on, obviously there was a lot of attention on the technology itself of GPT and language models and so on, but I was surprised by the interest specifically on trying to come up with interfaces that were outside of the box and the people that were interested in that topic. And so we ended up having a packed house and lots of interesting demos. I've heard multiple people comment on the event afterwards that they were positively surprised by the mixture of both the ML, AI-focused people at the event as well as the interface HCI-focused people. [00:04:24]Swyx: Yeah. I kind of see you as one of the leading, I guess, AI UX people, so I hope that we are maybe starting a new discipline, maybe. [00:04:33]Linus: Yeah, I mean, there is this kind of growing contingency of people interested in exploring the intersection of those things, so I'm excited for where that's going to go. [00:04:41]Swyx: I don't know if it's worth going through favorite demos. It was a little while ago, so I don't know if... [00:04:48]Alessio: There was, I forget who made it, but there was this new document writing tool where you could apply brushes to different paragraphs. [00:04:56]Linus: Oh, this was Amelia's. Yeah, yeah, yeah. [00:04:58]Alessio: You could set a tone, both in terms of writer inspiration and then a tone that you wanted, and then you could drag and drop different tones into paragraphs and have the model rewrite them. It was the first time that it's not just auto-complete, there's more to it. And it's not asked in a prompt, it's this funny drag-an-emoji over it. [00:05:20]Linus: Right. [00:05:21]Swyx: I actually thought that you had done some kind of demo where you could select text and then augment it in different moods, but maybe it wasn't you, maybe it was just someone else [00:05:28]Linus: I had done something similar, with slightly different building blocks. I think Amelia's demo was, there was sort of a preset palette of brushes and you apply them to text. I had built something related last year, I prototyped a way to give people sliders for different semantic attributes of text. And so you could start with a sentence, and you had a slider for length and a slider for how philosophical the text is, and a slider for how positive or negative the sentiment in the text is, and you could adjust any of them in the language model, reproduce the text. Yeah, similar, but continuous control versus distinct brushes, I think is an interesting distinction there. [00:06:03]Swyx: I should add it for listeners, if you missed the meetup, which most people will have not seen it, we actually did a separate post with timestamps of each video, so you can look at that. [00:06:13]Alessio: Sorry, Linus, this is unrelated, but I think you build over a hundred side projects or something like that. A hundred? [00:06:20]Swyx: I think there's a lot of people... I know it's a hundred. [00:06:22]Alessio: I think it's a lot of them. [00:06:23]Swyx: A lot of them are kind of small. [00:06:25]Alessio: Yeah, well, I mean, it still counts. I think there's a lot of people that are excited about the technology and want to hack on things. Do you have any tips on how to box, what you want to build, how do you decide what goes into it? Because all of these things, you could build so many more things on top of it. Where do you decide when you're done? [00:06:44]Linus: So my projects actually tend to be... I think especially when people approach project building with a goal of learning, I think a common mistake is to be over-ambitious and sort of not scope things very tightly. And so a classic kind of failure mode is, you say, I'm really interested in learning how to use the GPT-4 API, and I'm also interested in vector databases, and I'm also interested in Next.js. And then you devise a project that's going to take many weeks, and you glue all these things together. And it could be a really cool idea, but then especially if you have a day job and other things that life throws you away, it's hard to actually get to a point where you can ship something. And so one of the things that I got really good at was saying, one, knowing exactly how quickly I could work, at least on the technologies that I knew well, and then only adding one new unknown thing to learn per project. So it may be that for this project, I'm going to learn how the embedding API works. Or for this project, I'm going to learn how to do vector stuff with PyTorch or something. And then I would scope things so that it fit in one chunk of time, like Friday night to Sunday night or something like that. And then I would scope the project so that I could ship something as much work as I could fit into a two-day period, so that at the end of that weekend, I could ship something. And then afterwards, if I want to add something, I have time to do it and a chance to do that. But it's already shipped, so there's already momentum, and people are using it, or I'm using it, and so there's a reason to continue building. So only adding one new unknown per project, I think, is a good trick. [00:08:14]Swyx: I first came across you, I think, because of Monocle, which is your personal search engine. And I got very excited about it, because I always wanted a personal search engine, until I found that it was in a language that I've never seen before. [00:08:25]Linus: Yeah, there's a towel tower of little tools and technologies that I built for myself. One of the other tricks to being really productive when you're building side projects is just to use a consistent set of tools that you know really, really well. For me, that's Go, and my language, and a couple other libraries that I've written that I know all the way down to the bottom of the stack. And then I barely have to look anything up, because I've just debugged every possible issue that could come up. And so I could get from start to finish without getting stuck in a weird bug that I've never seen before. But yeah, it's a weird stack. [00:08:58]Swyx: It also means that you probably are not aiming for, let's say, open source glory, or whatever. Because you're not publishing in the JavaScript ecosystem. Right, right. [00:09:06]Linus: I mean, I've written some libraries before, but a lot of my projects tend to be like, the way that I approach it is less about building something that other people are going to use en masse. And make yourself happy. Yeah, more about like, here's the thing that I built, if you want to, and often I learn something in the process of building that thing. So like with Monocle, I wrote a custom sort of full text search index. And I thought a lot of the parts of what I built was interesting. And so I just wanted other people to be able to look at it and see how it works and understand it. But the goal isn't necessarily for you to be able to replicate it and run it on your own. [00:09:36]Swyx: Well, we can kind of dive into your other AIUX thoughts. As you've been diving in, you tend to share a lot on Twitter. And I just kind of took out some of your greatest hits. This is relevant to the demo that you picked out, Alessio. And what we're talking about, which is, most knowledge work is not a text generation task. That's funny, because a lot of what Notion AI is, is text generation right now. Maybe you want to elaborate a little bit. Yeah. [00:10:01]Linus: I think the first time you look at something like GPT, the shape of the thing you see is like, oh, it's a thing that takes some input text and generates some output text. And so the easiest thing to build on top of that is a content generation tool. But I think there's a couple of other categories of things that you could build that are sort of progressively more useful and more interesting. And so besides content generation, which requires the minimum amount of wrapping around ChatGPT, the second tier up from that is things around knowledge, I think. So if you have, I mean, this is the hot thing with all these vector databases things going around. But if you have a lot of existing context around some knowledge about your company or about a field or all of the internet, you can use a language model as a way to search and understand things in it and combine and synthesize them. And that synthesis, I think, is useful. And at that point, I think the value that that unlocks, I think, is much greater than the value of content generation. Because most knowledge work, the artifact that you produce isn't actually about writing more words. Most knowledge work, the goal is to understand something, synthesize new things, or propose actions or other kinds of knowledge-to-knowledge tasks. And then the third category, I think, is automation. Which I think is sort of the thing that people are looking at most actively today, at least from my vantage point in the ecosystem. Things like the React prompting technique, and just in general, letting models propose actions or write code to accomplish tasks. That's also moving far beyond generating text to doing something more interesting. So much of the value of what humans sit down and do at work isn't actually in the words that they write. It's all the thinking that goes on before you write those words. So how can you get language models to contribute to those parts of work? [00:11:43]Alessio: I think when you first tweeted about this, I don't know if you already accepted the job, but you tweeted about this, and then the next one was like, this is a NotionAI subtweet. [00:11:53]Swyx: So I didn't realize that. [00:11:56]Alessio: The best thing that I see is when people complain, and then they're like, okay, I'm going to go and help make the thing better. So what are some of the things that you've been thinking about? I know you talked a lot about some of the flexibility versus intuitiveness of the product. The language is really flexible, because you can say anything. And it's funny, the models never ignore you. They always respond with something. So no matter what you write, something is going to come back. Sometimes you don't know how big the space of action is, how many things you can do. So as a product builder, how do you think about the trade-offs that you're willing to take for your users? Where like, okay, I'm not going to let you be as flexible, but I'm going to create this guardrails for you. What's the process to think about the guardrails, and how you want to funnel them to the right action? [00:12:46]Linus: Yeah, I think what this trade-off you mentioned around flexibility versus intuitiveness, I think, gets at one of the core design challenges for building products on top of language models. A lot of good interface design comes from tastefully adding the right constraints in place to guide the user towards actions that you want to take. As you add more guardrails, the obvious actions become more obvious. And one common way to make an interface more intuitive is to narrow the space of choices that the users have to make, and the number of choices that they have to make. And that intuitiveness, that source of intuitiveness from adding constraints, is kind of directly at odds with the reason that language models are so powerful and interesting, which is that they're so flexible and so general, and you can ask them to do literally anything, and they will always give you something. But most of the time, the answer isn't that high quality. And so there's kind of a distribution of, like, there are clumps of things in the action space of what a language model can do that the model's good at, and there's parts of the space where it's bad at. And so one sort of high-level framework that I have for thinking about designing with language models is, there are actions that the language model's good at, and actions that it's bad at. How do you add the right constraints carefully to guide the user and the system towards the things that the language model's good at? And then at the same time, how do you use those constraints to set the user expectations for what it's going to be good at and bad at? One way to do this is just literally to add those constraints and to set expectations. So a common example I use all the time is, if you have some AI system to answer questions from a knowledge base, there are a couple of different ways to surface that in a kind of a hypothetical product. One is, you could have a thing that looks like a chat window in a messaging app, and then you could tell the user, hey, this is for looking things up from a database. You can ask a question, then it'll look things up and give you an answer. But if something looks like a chat, and this is a lesson that's been learned over and over for anyone building chat interfaces since, like, 2014, 15, if you have anything that looks like a chat interface or a messaging app, people are going to put some, like, weird stuff in there that just don't look like the thing that you want the model to take in, because the expectation is, hey, I can use this like a messaging app, and people will send in, like, hi, hello, you know, weird questions, weird comments. Whereas if you take that same, literally the same input box, and put it in, like, a thing that looks like a search bar with, like, a search button, people are going to treat it more like a search window. And at that point, inputs look a lot more like keywords or a list of keywords or maybe questions. So the simple act of, like, contextualizing that input in different parts of an interface reset the user's expectations, which constrain the space of things that the model has to handle. And that you're kind of adding constraints, because you're really restricting your input to mostly things that look like keyword search. But because of that constraint, you can have the model fit the expectations better. You can tune the model to perform better in those settings. And it's also less confusing and perhaps more intuitive, because the user isn't stuck with this blank page syndrome problem of, okay, here's an input. What do I actually do with it? When we initially launched Notion AI, one of my common takeaways, personally, from talking to a lot of my friends who had tried it, obviously, there were a lot of people who were getting lots of value out of using it to automate writing emails or writing marketing copy. There were a ton of people who were using it to, like, write Instagram ads and then sort of paste it into the Instagram tool. But some of my friends who had tried it and did not use it as much, a frequently cited reason was, I tried it. It was cool. It was cool for the things that Notion AI was marketed for. But for my particular use case, I had a hard time figuring out exactly the way it was useful for my workflow. And I think that gets back at the problem of, it's such a general tool that just presented with a blank prompt box, it's hard to know exactly the way it could be useful to your particular use case. [00:16:21]Alessio: What do you think is the relationship between novelty and flexibility? I feel like we're in kind of like a prompting honeymoon phase where the tools are new and then everybody just wants to do whatever they want to do. And so it's good to give these interfaces because people can explore. But if I go forward in three years, ideally, I'm not prompting anything. The UX has been built for most products to already have the intuitive, kind of like a happy path built into it. Do you think there's merit in a way? If you think about ChatGPT, if it was limited, the reason why it got so viral is people were doing things that they didn't think a computer could do, like write poems and solve riddles and all these different things. How do you think about that, especially in Notion, where Notion AI is kind of like a new product in an existing thing? How much of it for you is letting that happen and seeing how people use it? And then at some point be like, okay, we know what people want to do. The flexibility is not, it was cool before, but now we just want you to do the right things with the right UX. [00:17:27]Linus: I think there's value in always having the most general input as an escape hatch for people who want to take advantage of that power. At this point, Notion AI has a couple of different manifestations in the product. There's the writer. There's a thing we called an AI block, which is a thing that you can always sort of re-update as a part of document. It's like a live, a little portal inside the document that an AI can write. We also have a relatively new thing called AI autofill, which lets an AI fill an entire column in a Notion database. In all of these things, speaking of adding constraints, we have a lot of suggested prompts that we've worked on and we've curated and we think work pretty well for things like summarization and writing drafts to blog posts and things. But we always leave a fully custom prompt for a few reasons. One is if you are actually a power user and you know how language models work, you can go in and write your custom prompt and if you're a power user, you want access to the power. The other is for us to be able to discover new use cases. And so one of the lovely things about working on a product like Notion is that there's such an enthusiastic and lively kind of community of ambassadors and people that are excited about trying different things and coming up with all these templates and new use cases. And having a fully custom action or prompt whenever we launch something new in AI lets those people really experiment and help us discover new ways to take advantage of AI. I think it's good in that way. There's also a sort of complement to that, which is if we wanted to use feedback data or learn from those things and help improve the way that we are prompting the model or the models that we're building, having access to that like fully diverse, fully general range of use cases helps us make sure that our models can handle the full generality of what people want to do. [00:19:06]Swyx: I feel like we've segway’d a lot into our Notion conversation and maybe I just wanted to bridge that a little bit with your personal journey into Notion before we go into Notion proper. You spent a year kind of on a sabbatical, kind of on your own self-guided research journey and then deciding to join Notion. I think a lot of engineers out there thinking about doing this maybe don't have the internal compass that you have or don't have the guts to basically make no money for a year. Maybe just share with people how you decided to basically go on your own independent journey and what got you to join Notion in the end. [00:19:42]Linus: Yeah, what happened? Um, yeah, so for a little bit of context for people who don't know me, I was working mostly at sort of seed stage startups as a web engineer. I actually didn't really do much AI at all for prior to my year off. And then I took all of 2022 off with less of a focus on it ended up sort of in retrospect becoming like a Linus Pivots to AI year, which was like beautifully well timed. But in the beginning of the year, there was kind of a one key motivation and then one key kind of question that I had. The motivation was that I think I was at a sort of a privileged and fortunate enough place where I felt like I had some money saved up that I had saved up explicitly to be able to take some time off and investigate my own kind of questions because I was already working on lots of side projects and I wanted to spend more time on it. I think I also at that point felt like I had enough security in the companies and folks that I knew that if I really needed a job on a short notice, I could go and I could find some work to do. So I wouldn't be completely on the streets. And so that security, I think, gave me the confidence to say, OK, let's try this kind of experiment.[00:20:52]Maybe it'll only be for six months. Maybe it'll be for a year. I had enough money saved up to last like a year and change. And so I had planned for a year off and I had one sort of big question that I wanted to explore. Having that single question, I think, actually was really helpful for focusing the effort instead of just being like, I'm going to side project for a year, which I think would have been less productive. And that big question was, how do we evolve text interfaces forward? So, so much of knowledge work is consuming walls of text and then producing more walls of text. And text is so ubiquitous, not just in software, but just in general in the world. They're like signages and menus and books. And it's ubiquitous, but it's not very ergonomic. There's a lot of things about text interfaces that could be better. And so I wanted to explore how we could make that better. A key part of that ended up being, as I discovered, taking advantage of this new technologies that let computers make sense of text information. And so that's how I ended up sort of sliding into AI. But the motivation in the beginning was less focused on learning a new technology and more just on exploring this general question space. [00:21:53]Swyx: Yeah. You have the quote, text is the lowest denominator, not the end game. Right, right. [00:21:58]Linus: I mean, I think if you look at any specific domain or discipline, whether it's medicine or mathematics or software engineering, in any specific discipline where there's a narrower set of abstractions for people to work with, there are custom notations. One of the first things that I wrote in this exploration year was this piece called Notational Intelligence, where I talk about this idea that so much of, as a total sidebar, there's a whole other fascinating conversation that I would love to have at some point, maybe today, maybe later, about how to evolve a budding scene of research into a fully-fledged field. So I think AI UX is kind of in this weird stage where there's a group of interesting people that are interested in exploring this space of how do you design for this newfangled technology, and how do you take that and go and build best practices and powerful methods and tools [00:22:48]Swyx: We should talk about that at some point. [00:22:49]Linus: OK. But in a lot of established fields, there are notations that people use that really help them work at a slightly higher level than just raw words. So notations for describing chemicals and notations for different areas of mathematics that let people work with higher-level concepts more easily. Logic, linguistics. [00:23:07]Swyx: Yeah. [00:23:07]Linus: And I think it's fair to say that some large part of human intelligence, especially in these more technical domains, comes from our ability to work with notations instead of work with just the raw ideas in our heads. And text is a kind of notation. It's the most general kind of notation, but it's also, because of its generality, not super high leverage if you want to go into these specific domains. And so I wanted to try to improve on that frontier. [00:23:29]Swyx: Yeah. You said in our show notes, one of my goals over the next few years is to ensure that we end up with interface metaphors and technical conventions that set us up for the best possible timeline for creativity and inventions ahead. So part of that is constraints. But I feel like that is one part of the equation, right? What's the other part that is more engenders creativity? [00:23:47]Linus: Tell me a little bit about that and what you're thinking there. [00:23:51]Swyx: It's just, I feel like, you know, we talked a little bit about how you do want to constrain, for example, the user interface to guide people towards things that language models are good at. And creative solutions do arise out of constraints. But I feel like that alone is not sufficient for people to invent things. [00:24:10]Linus: I mean, there's a lot of directions, I think, that could go from that. The origin of that thing that you're quoting is when I decided to come help work on AI at Notion, a bunch of my friends were actually quite surprised, I think, because they had expected that I would have gone and worked… [00:24:29]Swyx: You did switch. I was eyeing that for you. [00:24:31]Linus: I mean, I worked at a lab or at my own company or something like that. But one of the core motivations for me joining an existing company and one that has lots of users already is this exact thing where in the aftermath of a new foundational technology emerging, there's kind of a period of a few years where the winners in the market get to decide what the default interface paradigm for the technology is. So, like, mini computers, personal computers, the winners of that market got to decide Windows are and how scrolling works and what a mouse cursor is and how text is edited. Similar with mobile, the concept of a home screen and apps and things like that, the winners of the market got to decide. And that has profound, like, I think it's difficult to understate the importance of, in those few critical years, the winning companies in the market choosing the right abstractions and the right metaphors. And AI, to me, seemed like it's at that pivotal moment where it's a technology that lots of companies are adopting. There is this well-recognized need for interface best practices. And Notion seemed like a company that had this interesting balance of it could still move quickly enough and ship and prototype quickly enough to try interesting interface ideas. But it also had enough presence in the ecosystem that if we came up with the right solution or one that we felt was right, we could push it out and learn from real users and iterate and hopefully be a part of that story of setting the defaults and setting what the dominant patterns are. [00:26:07]Swyx: Yeah, it's a special opportunity. One of my favorite stories or facts is it was like a team of 10 people that designed the original iPhone. And so all the UX that was created there is essentially what we use as smartphones today, including predictive text, because people were finding that people were kind of missing the right letters. So they just enhanced the hit area for certain letters based on what you're typing. [00:26:28]Linus: I mean, even just the idea of like, we should use QWERTY keyboards on tiny smartphone screens. Like that's a weird idea, right? [00:26:36]Swyx: Yeah, QWERTY is another one. So I have RSI. So this actually affects me. QWERTY was specifically chosen to maximize travel distance, right? Like it's actually not ergonomic by design because you wanted the keyboard, the key type writers to not stick. But we don't have that anymore. We're still sticking to QWERTY. I'm still sticking to QWERTY. I could switch to the other ones. I forget. QORAC or QOMAC anytime, but I don't just because of inertia. I have another thing like this. [00:27:02]Linus: So going even farther back, people don't really think enough about where this concept of buttons come from, right? So the concept of a push button as a thing where you press it and it activates some binary switch. I mean, buttons have existed for, like mechanical buttons have existed for a long time. But really, like this modern concept of a button that activates a binary switch really gets like popularized by the popular advent of electricity. Before the electricity, if you had a button that did something, you would have to construct a mechanical system where if you press down on a thing, it affects some other lever system that affects as like the final action. And this modern idea of a button that is just a binary switch gets popularized electricity. And at that point, a button has to work in the way that it does in like an alarm clock, because when you press down on it, there's like a spring that makes sure that the button comes back up and that it completes the circuit. And so that's the way the button works. And then when we started writing graphical interfaces, we just took that idea of a thing that could be depressed to activate a switch. All the modern buttons that we have today in software interfaces are like simulating electronic push buttons where you like press down to complete a circuit, except there's actually no circuit being completed. It's just like a square on a screen. [00:28:11]Swyx: It's all virtualized. Right. [00:28:12]Linus: And then you control the simulation of a button by clicking a physical button on a mouse. Except if you're on a trackpad, it's not even a physical button anymore. It's like a simulated button hardware that controls a simulated button in software. And it's also just this cascade of like conceptual backwards compatibility that gets us here. I think buttons are interesting. [00:28:32]Alessio: Where are you on the skeuomorphic design love-hate spectrum? There's people that have like high nostalgia for like the original, you know, the YouTube icon on the iPhone with like the knobs on the TV. [00:28:42]Linus: I think a big part of that is at least the aesthetic part of it is fashion. Like fashion taken very literally, like in the same way that like the like early like Y2K 90s aesthetic comes and goes. I think skeuomorphism as expressed in like the early iPhone or like Windows XP comes and goes. There's another aspect of this, which is the part of skeuomorphism that helps people understand and intuit software, which has less to do with skeuomorphism making things easier to understand per se and more about like, like a slightly more general version of skeuomorphism is like, there should be a consistent mental model behind an interface that is easy to grok. And then once the user has the mental model, even if it's not the full model of exactly how that system works, there should be a simplified model that the user can easily understand and then sort of like adopt and use. One of my favorite examples of this is how volume controls that are designed well often work. Like on an iPhone, when you make your iPhone volume twice as loud, the sound that comes out isn't actually like at a physical level twice as loud. It's on a log scale. When you push the volume slider up on an iPhone, the speaker uses like four times more energy, but humans perceive it as twice as loud. And so the mental model that we're working with is, okay, if I make this, this volume control slider have two times more value, it's going to sound two times louder, even though actually the underlying physics is like on a log scale. But what actually happens physically is not actually what matters. What matters is how humans perceive it in the model that I have in my head. And there, I think there are a lot of other instances where the skeuomorphism isn't actually the thing. The thing is just that there should be a consistent mental model. And often the easy, consistent mental model to reach for is the models that already exist in reality, but not always. [00:30:23]Alessio: I think the other big topic, maybe before we dive into Notion is agents. I think that's one of the toughest interfaces to crack, mostly because, you know, the text box, everybody understands that the agent is kind of like, it's like human-like feeling, you know, where it's like, okay, I'm kind of delegating something to a human, right? I think, like, Sean, you made the example of like a Calendly, like a savvy Cal, it's like an agent, because it's scheduling on your behalf for something. [00:30:51]Linus: That's actually a really interesting example, because it's a kind of a, it's a pretty deterministic, like there's no real AI to it, but it is agent in the sense that you're like delegating it and automate something. [00:31:01]Swyx: Yeah, it does work without me. It's great. [00:31:03]Alessio: So that one, we figured out. Like, we know what the scheduling interface is like. [00:31:07]Swyx: Well, that's the state of the art now. But, you know, for example, the person I'm corresponding with still has to pick a time from my calendar, which some people dislike. Sam Lesson famously says it's a sign of disrespect. I disagree with him, but, you know, it's a point of view. There could be some intermediate AI agents that would send emails back and forth like a human person to give the other person who feels slighted that sense of respect or a personalized touch that they want. So there's always ways to push it. [00:31:39]Alessio: Yeah, I think for me, you know, other stuff that I think about, so we were doing prep for another episode and had an agent and asked it to do like a, you know, background prep on like the background of the person. And it just couldn't quite get the format that I wanted it to be, you know, but I kept to have the only way to prompt that it's like, give it text, give a text example, give a text example. What do you think, like the interface between human and agents in the future will be like, do you still think agents are like this open ended thing that are like objective driven where you say, Hey, this is what I want to achieve versus I only trust this agent to do X. And like, this is how X is done. I'm curious because that kind of seems like a lot of mental overhead, you know, to remember each agent for each task versus like if you have an executive assistant, like they'll do a random set of tasks and you can trust them because they're a human. But I feel like with agents, we're not quite there. [00:32:36]Swyx: Agents are hard. [00:32:36]Linus: The design space is just so vast. Since all of the like early agent stuff came out around auto GPT, I've tried to develop some kind of a thesis around it. And I think it's just difficult because there's so many variables. One framework that I usually apply to sort of like existing chat based prompting kind of things that I think also applies just as well to agents is this duality between what you might call like trust and control. So you just now you brought up this example of you had an agent try to write some write up some prep document for an episode and it couldn't quite get the format right. And one way you could describe that is you could say, Oh, the, the agent didn't exactly do what I meant and what I had in my head. So I can't trust it to do the right job. But a different way to describe it is I have a hard time controlling exactly the output of the model and I have a hard time communicating exactly what's in my head to the model. And they're kind of two sides of the same coin. I think if you, if you can somehow provide a way to with less effort, communicate and control and constrain the model output a little bit more and constrain the behavior a little bit more, I think that would alleviate the pressure for the model to be this like fully trusted thing because there's no need for trust anymore. There's just kind of guardrails that ensure that the model does the right thing. So developing ways and interfaces for these agents to be a little more constrained in its output or maybe for the human to control its output a little bit more or behavior a little bit more, I think is a productive path. Another sort of more, more recent revelation that I had while working on this and autofill thing inside notion is the importance of zones of influence for AI agents, especially in collaborative settings. So having worked on lots of interfaces for independent work on my year off, one of the surprising lessons that I learned early on when I joined notion was that if you build a collaboration permeates everything, which is great for notion because collaborating with an AI, you reuse a lot of the same metaphors for collaborating with humans. So one nice thing about this autofill thing that also kind of applies to AI blocks, which is another thing that we have, is that you don't alleviate this problem of having to ask questions like, oh, is this document written by an AI or is this written by a human? Like this need for auditability, because the part that's written by the AI is just in like the autofilled cell or in the AI block. And you can, you can tell that's written by the AI and things outside of it, you can kind of reasonably assume that it was written by you. I think anytime you have sort of an unbounded action space for, for models like agents, it's especially important to be able to answer those questions easily and to have some sense of security that in the same way that you want to know whether your like coworker or collaborator has access to a document or has modified a document, you want to know whether an AI has permissions to access something. And if it's modified something or made some edit, you want to know that it did it. And so as a compliment to constraining the model's action space proactively, I think it's also important to communicate, have the user have an easy understanding of like, what exactly did the model do here? And I think that helps build trust as well. [00:35:39]Swyx: Yeah. I think for auto GPT and those kinds of agents in particular, anything that is destructive, you need to prompt for, I guess, or like check with, check in with the user. I know it's overloaded now. I can't say that. You have to confirm with the user. You confirm to the user. Yeah, exactly. Yeah. Yeah. [00:35:56]Linus: That's tough too though, because you, you don't want to stop. [00:35:59]Swyx: Yeah. [00:35:59]Linus: One of the, one of the benefits of automating these things that you can sort of like, in theory, you can scale them out arbitrarily. I can have like a hundred different agents working for me, but if that means I'm just spending my entire day in a deluge of notifications, that's not ideal either. [00:36:12]Swyx: Yeah. So then it could be like a reversible, destructive thing with some kind of timeouts, a time limit. So you could reverse it within some window. I don't know. Yeah. I've been thinking about this a little bit because I've been working on a small developer agent. Right. Right. [00:36:27]Linus: Or maybe you could like batch a group of changes and can sort of like summarize them with another AI and improve them in bulk or something. [00:36:33]Swyx: Which is surprisingly similar to the collaboration problem. Yeah. Yeah. Yeah. Exactly. Yeah. [00:36:39]Linus: I'm telling you, the collaboration, a lot of the problems with collaborating with humans also apply to collaborating with AI. There's a potential pitfall to that as well, which is that there are a lot of things that some of the core advantages of AI end up missing out on if you just fully anthropomorphize them into like human-like collaborators. [00:36:56]Swyx: But yeah. Do you have a strong opinion on that? Like, do you refer to it as it? Oh yeah. [00:37:00]Linus: I'm an it person, at least for now, in 2023. Yeah. [00:37:05]Swyx: So that leads us nicely into introducing what Notion and Notion AI is today. Do you have a pet answer as to what is Notion? I've heard it introduced as a database, a WordPress killer, a knowledge base, a collaboration tool. What is it? Yeah. [00:37:19]Linus: I mean, the official answer is that a Notion is a connected workspace. It has a space for your company docs, meeting notes, a wiki for all of your company notes. You can also use it to orchestrate your workflows if you're managing a project, if you have an engineering team, if you have a sales team. You can put all of those in a single Notion database. And the benefit of Notion is that all of them live in a single space where you can link to your wiki pages from your, I don't know, like onboarding docs. Or you can link to a GitHub issue through a task from your documentation on your engineering system. And all of this existing in a single place in this kind of like unified, yeah, like single workspace, I think has lots of benefits. [00:37:58]Swyx: That's the official line. [00:37:59]Linus: There's an asterisk that I usually enjoy diving deeper into, which is that the whole reason that this connected workspace is possible is because underlying all of this is this really cool abstraction of blocks. In Notion, everything is a block. A paragraph is a block. A bullet point is a block. But also a page is a block. And the way that Notion databases work is that a database is just a collection of pages, which are really blocks. And you can like take a paragraph and drag it into a database and it'll become a page. You can take a page inside a database and pull it out and it'll just become a link to that page. And so this core abstraction of a block that can also be a page, that can also be a row in a database, like an Excel sheet, that fluidity and this like shared abstraction across all these different areas inside Notion, I think is what really makes Notion powerful. This Lego theme, this like Lego building block theme permeates a lot of different parts of Notion. Some fans of Notion might know that when you, or when you join Notion, you get a little Lego minifigure, which has Lego building blocks for workflows. And then every year you're at Notion, you get a new block that says like you've been here for a year, you've been here for two years. And then Simon, our co-founder and CTO, has a whole crate of Lego blocks on his desk that he just likes to mess with because, you know, he's been around for a long time. But this Lego building block thing, this like shared sort of all-encompassing single abstraction that you can combine to build various different kinds of workflows, I think is really what makes Notion powerful. And one of the sort of background questions that I have for Notion AI is like, what is that kind of building block for AI? [00:39:30]Swyx: Well, we can dive into that. So what is Notion AI? Like, so I kind of view it as like a startup within the startup. Could you describe the Notion AI team? Is this like, how seriously is Notion taking the AI wave? [00:39:43]Linus: The most seriously? The way that Notion AI came about, as I understand it, because I joined a bit later, I think it was around October last year, all of Notion team had a little offsite. And as a part of that, Ivan and Simon kind of went into a little kind of hack weekend. And the thing that they ended up hacking on inside Notion was the very, very early prototype of Notion AI. They saw this GPT-3 thing. The early, early motivation for starting Notion, building Notion in the first place for them, was sort of grounded in this utopian end-user programming vision where software is so powerful, but there are only so many people in the world that can write programs. But everyone can benefit from having a little workspace or a little program or a little workflow tool that's programmed to just fit their use case. And so how can we build a tool that lets people customize their software tools that they use every day for their use case? And I think to them, seemed like such a critical part of facilitating that, bridging the gap between people who can code and people who need software. And so they saw that, they tried to build an initial prototype that ended up becoming the first version of Notion AI. They had a prototype in, I think, late October, early November, before Chachapiti came out and sort of evolved it over the few months. But what ended up launching was sort of in line with the initial vision, I think, of what they ended up building. And then once they had it, I think they wanted to keep pushing it. And so at this point, AI is a really key part of Notion strategy. And what we see Notion becoming going forward, in the same way that blocks and databases are a core part of Notion that helps enable workflow automation and all these important parts of running a team or collaborating with people or running your life, we think that AI is going to become an equally critical part of what Notion is. And it won't be, Notion is a cool connected workspace app, and it also has AI. It'll be that what Notion is, is databases, it has pages, it has space for your docs, and it also has this sort of comprehensive suite of AI tools that permeate everything. And one of the challenges of the AI team, which is, as you said, kind of a startup within a startup right now, is to figure out exactly what that all-permeating kind of abstraction means, which is a fascinating and difficult open problem. [00:41:57]Alessio: How do you think about what people expect of Notion versus what you want to build in Notion? A lot of this AI technology kind of changes, you know, we talked about the relationship between text and human and how human collaborates. Do you put any constraints on yourself when it's like, okay, people expect Notion to work this way with these blocks. So maybe I have this crazy idea and I cannot really pursue it because it's there. I think it's a classic innovator's dilemma kind of thing. And I think a lot of founders out there that are in a similar position where it's like, you know, series C, series D company, it's like, you're not quite yet the super established one, you're still moving forward, but you have an existing kind of following and something that Notion stands for. How do you kind of wrangle with that? [00:42:43]Linus: Yeah, that is in some ways a challenge and that Notion already is a kind of a thing. And so we can't just scrap everything and start over. But I think it's also, there's a blessing side of it too, in that because there are so many people using Notion in so many different ways, we understand all of the things that people want to use Notion for very well. And then so we already have a really well-defined space of problems that we want to help people solve. And that helps us. We have it with the existing Notion product and we also have it by sort of rolling out these AI things early and then watching, learning from the community what people want to do [00:43:17]Swyx: with them. [00:43:17]Linus: And so based on those learnings, I think it actually sort of helps us constrain the space of things we think we need to build because otherwise the design space is just so large with whatever we can do with AI and knowledge work. And so watching what people have been using Notion for and what they want to use Notion for, I think helps us constrain that space a little bit and make the problem of building AI things inside Notion a little more tractable. [00:43:36]Swyx: I think also just observing what they naturally use things for, and it sounds like you do a bunch of user interviews where you hear people running into issues and, or describe them as, the way that I describe myself actually is, I feel like the problem is with me, that I'm not creative enough to come up with use cases to use Notion AI or any other AI. [00:43:57]Linus: Which isn't necessarily on you, right? [00:43:59]Swyx: Exactly. [00:43:59]Linus: Again, like it goes way back to the early, the thing we touched on early in the conversation around like, if you have too much generality, there's not enough, there are not enough guardrails to obviously point to use cases. Blank piece of paper. [00:44:10]Swyx: I don't know what to do with this. So I think a lot of people judge Notion AI based on what they originally saw, which is write me a blog post or do a summary or do action items. Which, fun fact, for latent space, my very, very first Hacker News hit was reverse engineering Notion AI. I actually don't know if I got it exactly right. I think I got the easy ones right. And then apparently I got the action items one really wrong. So there's some art into doing that. But also you've since launched a bunch of other products and maybe you've already hinted at AI Autofill. Maybe we can just talk a little bit about what does the scope or suite of Notion AI products have been so far and what you're launching this week? Yeah. [00:44:53]Linus: So we have, I think, three main facets of Notion AI and Notion at the moment. We have sort of the first thing that ever launched with Notion AI, which I think that helps you write. It's, going back to earlier in the conversation, it's kind of a writing, kind of a content generation tool. If you have a document and you want to generate a summary, it helps you generate a summary, pull out action items, you can draft a blog post, you can help it improve, it's helped to improve your writings, it can help fix grammar and spelling mistakes. But under the hood, it's a fairly lightweight, a thick layer of prompts. But otherwise, it's a pretty straightforward use case of language models, right? And so there's that, a tool that helps you write documents. There's a thing called an AI block, which is a slightly more constrained version of that where one common way that we use it inside Notion is we take all of our meeting notes inside Notion. And frequently when you have a meeting and you want other people to be able to go back to it and reference it, it's nice to have a summary of that meeting. So all of our meeting notes templates, at least on the AI team, have an AI block at the top that automatically summarizes the contents of that page. And so whenever we're done with a meeting, we just press a button and it'll re-summarize that, including things like what are the core action items for every person in the meeting. And so that block, as I said before, is nice because it's a constrained space for the AI to work in, and we don't have to prompt it every single time. And then the newest member of this AI collection of features is AI autofill, which brings Notion AI to databases. So if you have a whole database of user interviews and you want to pull out what are the companies, core pain points, what are their core features, maybe what are their competitor products they use, you can just make columns. And in the same way that you write Excel formulas, you can write a little AI formula, basically, where the AI will look at the contents of the page and pull out each of these key pieces of information. The slightly new thing that autofill introduces is this idea of a more automated background [00:46:43]Swyx: AI thing. [00:46:44]Linus: So with Writer, the AI in your document product and the AI block, you have to always ask it to update. You have to always ask it to rewrite. But if you have a column in a database, in a Notion database, or a property in a Notion database, it would be nice if you, whenever someone went back and changed the contents of the meeting node or something updated about the page, or maybe it's a list of tasks that you have to do and the status of the task changes, you might want the summary of that task or detail of the task to update. And so anytime that you can set up an autofilled Notion property so that anytime something on that database row or page changes, the AI will go back and sort of auto-update the autofilled value. And that, I think, is a really interesting part that we might continue leading into of like, even though there's AI now tied to this particular page, it's sort of doing its own thing in the background to help automate and alleviate some of that pain of automating these things. But yeah, Writer, Blocks, and Autofill are the three sort of cornerstones we have today. [00:47:42]Alessio: You know, there used to be this glorious time where like, Roam Research was like the hottest knowledge company out there, and then Notion built Backlinks. I don't know if we are to blame for that. No, no, but how do Backlinks play into some of this? You know, I think most AI use cases today are kind of like a single page, right? Kind of like this document. I'm helping with this. Do you see some of these tools expanding to do changes across things? So we just had Itamar from Codium on the podcast, and he talked about how agents can tie in specs for features, tests for features, and the code for the feature. So like the three entities are tied together. Like, do you see some Backlinks help AI navigate through knowledge basis of companies where like, you might have the document the product uses, but you also have the document that marketing uses to then announce it? And as you make changes, the AI can work through different pieces of it? [00:48:41]Swyx: Definitely. [00:48:41]Linus: If I may get a little theoretical from that. One of my favorite ideas from my last year of hacking around building text augmentations with AI for documents is this realization that, you know, when you look at code in a code editor, what it is at a very lowest level is just text files. A code file is a text file, and there are maybe functions inside of it, and it's a list of functions, but it's a text file. But the way that you understand it is not as a file, like a Word document, it's a kind of a graph.[00:49:10]Linus: Like you have a function, you have call sites to that function, there are places where you call that function, there's a place where that function is tested, many different definitions for that function. Maybe there's a type definition that's tied to that function. So it's a kind of a graph. And if you want to understand that function, there's advantages to be able to traverse that whole graph and fully contextualize where that function is used. Same with types and same with variables. And so even though its code is represented as text files, it's actually kind of a graph. And a lot of the, of what, all of the key interfaces, interface innovations behind IDEs is helping surface that graph structure in the context of a text file. So like things like go to definition or VS Code's little window view when you like look at references. And interesting idea that I explored last year was what if you bring that to text documents? So text documents are a little more unstructured, so there's a less, there's a more fuzzy kind of graph idea. But if you're reading a textbook, if there's a new term, there's actually other places where the term is mentioned. There's probably a few places where that's defined. Maybe there's some figures that reference that term. If you have an idea, there are other parts of the document where the document might disagree with that idea or cite that idea. So there's still kind of a graph structure. It's a little more fuzzy, but there's a graph structure that ties together like a body of knowledge. And it would be cool if you had some kind of a text editor or some kind of knowledge tool that let you explore that whole graph. Or maybe if an AI could explore that whole graph. And so back to your point, I think taking advantage of not just the backlinks. Backlinks is a part of it. But the fact that all of these inside Notion, all of these pages exist in a single workspace and it's a shared context. It's a connected workspace. And you can take any idea and look up anywhere to fully contextualize what a part of your engineering system design means. Or what we know about our pitching their customer at a company. Or if I wrote down a book, what are other places where that book has been mentioned? All these graph following things, I think, are really important for contextualizing knowledge. [00:51:02]Swyx: Part of your job at Notion is prompt engineering. You are maybe one of the more advanced prompt engineers that I know out there. And you've always commented on the state of prompt ops tooling. What is your process today? What do you wish for? There's a lot here. [00:51:19]Linus: I mean, the prompts that are inside Notion right now, they're not complex in the sense that agent prompts are complex. But they're complex in the sense that there is even a problem as simple as summarize a [00:51:31]Swyx: page. [00:51:31]Linus: A page could contain anything from no information, if it's a fresh document, to a fully fledged news article. Maybe it's a meeting note. Maybe it's a bug filed by somebody at a company. The range of possible documents is huge. And then you have to distill all of it down to always generate a summary. And so describing that task to AI comprehensively is pretty hard. There are a few things that I think I ended up leaning on, as a team we ended up leaning on, for the prompt engineering part of it. I think one of the early transitions that we made was that the initial prototype for Notion AI was built on instruction following, the sort of classic instruction following models, TextWG003, and so on. And then at some point, we all switched to chat-based models, like Claude and the new ChatGPT Turbo and these models. And so that was an interesting transition. It actually kind of made few-shot prompting a little bit easier, I think, in that you could give the few-shot examples as sort of previous turns in a conversation. And then you could ask the real question as the next follow-up turn. I've come to appreciate few-shot prompting a lot more because it's difficult to fully comprehensively explain a particular task in words, but it's pretty easy to demonstrate like four or five different edge cases that you want the model to handle. And a lot of times, if there's an edge case that you want a model to handle, I think few-shot prompting is just the easiest, most reliable tool to reach for. One challenge in prompt engineering that Notion has to contend with often is we want to support all the different languages that Notion supports. And so all of our prompts have to be multilingual or compatible, which is kind of tricky because our prompts are written, our instructions are written in English. And so if you just have a naive approach, then the model tends to output in English, even when the document that you want to translate or summarize is in French. And so one way you could try to attack that problem is to tell the model, answering the language of the user's query. But it's actually a lot more effective to just give it examples of not just English documents, but maybe summarizing an English document, maybe summarize a ticket filed in French, summarize an empty document where the document's supposed to be in Korean. And so a lot of our few-shot prompt-included prompts in Notion AI tend to be very multilingual, and that helps support our non-English-speaking users. The other big part of prompt engineering is evaluation. The prompts that you exfiltrated out of Notion AI many weeks ago, surprisingly pretty spot-on, at least for the prompts that we had then, especially things like summary. But they're also outdated because we've evolved them a lot more, and we have a lot more examples. And some of our prompts are just really, really long. They're like thousands of tokens long. And so every time we go back and add an example or modify the instruction, we want to make sure that we don't regress any of the previous use cases that we've supported. And so we put a lot of effort, and we're increasingly building out internal tooling infrastructure for things like what you might call unit tests and regression tests for prompts with handwritten test cases, as well as tests that are driven more by feedback from Notion users that have chosen to share their feedback with us. [00:54:31]Swyx: You just have a hand-rolled testing framework or use Jest or whatever, and nothing custom out there. You basically said you've looked at so many prompt ops tools and you're sold on none of them. [00:54:42]Linus: So that tweet was from a while ago. I think there are a couple of interesting tools these days. But I think at the moment, Notion uses pretty hand-rolled tools. Nothing too heavy, but it's basically a for loop over a list of test cases. We do do quite a bit of using language models to evaluate language models. So our unit test descriptions are kind of funny because the test is literally just an input document and a query, and then we expect the model to say something. And then our qualification for whether that test passes or not is just ask the language model again, whether it looks like a reasonable summary or whether it's in the right language. [00:55:19]Swyx: Do you have the same model? Do you have entropic-criticized OpenAI or OpenAI-criticized entropic? That's a good question. Do you worry about models being biased towards its own self? [00:55:29]Linus: Oh, no, that's not a worry that we have. I actually don't know exactly if we use different models. If you have a fixed budget for running these tests, I think it would make sense to use more expensive models for evaluation rather than generation. But yeah, I don't remember exactly what we do there. [00:55:44]Swyx: And then one more follow-up on, you mentioned some of your prompts are thousands of tokens. That takes away from my budget as a user. Isn't that a trade-off that's a concern? So there's a limited context window, right? Some of that is taken by you as the app designer, product designer, deciding what system prompt to provide. And then the remainder is what I as a user can give you to actually summarize as my content. In theory. [00:56:10]Linus: I think in practice there are a couple of trends that make that an issue. So for things like generating summaries, a summary is only going to be so many tokens long. If our prompts are generating you 3,000 token summaries, the prompt is not doing its job anyway. [00:56:25]Swyx: Yeah, but the source doc is. [00:56:27]Linus: The source doc could be longer. So if you wanted to translate a 5,000 token document, you do have to truncate it. And there is a limitation. It's not something that we are super focused on at the moment for a couple of reasons. I think there are techniques that, if we need to, help us compress those prompts. Things like parameter-efficient fine-tuning. And also the context lengths. It seems like the dominant trend is that context lengths are getting cheaper and longer constantly. Anthropic recently announced their 100,000 token context model recently. And so I think in the longer term that's going to be taken care of anyway by the models becoming more accommodating of longer contexts. And it's more of a temporary limitation. Cool. [00:57:04]Swyx: Shall we talk about the professionalizing of a scene? [00:57:07]Linus: Yeah, I think one of the things that is a helpful bit of context when thinking about HCI and AI in particular is, historically, HCI and AI have been sort of competing disciplines. Competing very specifically in the sense that they often fought for the same sources of funding and the same kinds of people and attention throughout the history of computer science. HCI and AI both used to come from the same or very aligned, similar, parallel motivations of, we have computers. How do we make computers work better with humans? And one way to do it was to make the machine smarter. Another way to do it was to design better interfaces. And through the AI booms and busts, when the AI boom was happening, HCI would get less funding. And when AIs had winters, HCI would get a lot more attention because it was sort of the alternative solution. And now that we have this sort of renewed attention on how to build better interfaces for AI, I think it's interesting that it's kind of a scene now. There are podcasts like this where I get to talk about interfaces and AI. But it's definitely not a fully-fledged field. My favorite definition of sort of what distinguishes the two apart comes from Andy Matuszak, where he, I'm going to butcher the quote, but he said something to the effect of, a field has at their disposal a powerful set of established tools and methods and standards and a shared set of core questions they want to answer. And so if you look at machine learning, which is obviously a really dominant established field, if you want to answer, if you want to evaluate a model, if you want to answer, if you want to solve a particular task or build a model that solves a particular task, there are powerful methods that we have, like gradient descent and specific benchmarks, for building solutions and then re-evaluating how to do the solutions. Or if you have an even more expensive problem, there are surely attempts that have been made before and then attempts that people are making now for how to attack that problem and frameworks to think about these things. In AI and UX, I think, we're very early in the evolution of that space and that community, and there's a lot of people excited, a lot of people building, but we have yet to come up with a set of best practices and tools and methods and frameworks for thinking about these things. And those will surely arise, and as they do, I think we'll see the evolution of the field. In prompt engineering and using language models in products at large, I think that community is a little farther along. It's still very fast moving because it's really young, but there are established prompting techniques like React and distillation of larger instruction following models. And these techniques, I think, are the beginnings of best practices and powerful tools at the disposal of this language model using field. [00:59:43]Swyx: Yeah, and mostly it's just following Riley Goodside. It's how I learn about prompting techniques. Right, right. Yeah, pioneers. But yeah, I am actually interested in this. We've recently kind of rebranded the podcast or the newsletter somewhat in towards being for this term AI engineer, which I kind of view as somewhere between machine learning researcher and software engineer, some kind of in-between mix. And I think creating the media, creating meetups, creating a de facto conference for it, creating job titles, and then I think that core set of questions that everyone wants to get better at, I think that is essentially how this starts. Yeah, yeah. Pretty excited of. [01:00:25]Linus: Creating a space for the people that are interested to come together, I think, is a really, really key important part of it. I'm always, whenever I come back to it, I'm always amazed by how if you look at the sort of golden era of theoretical physics in the early 20th century, or the golden era of early personal computing, there are maybe like two dozen people that have contributed all of the significant ideas to that field. They all kind of know each other. I always found that really fascinating. And I think the causal relationship actually goes the other way. It's not that all those people happen to know each other. It's that because there was that core set of people that always, that were very close to each other and shared ideas often, and they were co-located, that that field is able to blossom. And so I think creating that space is really critical. [01:01:08]Swyx: Yeah, there's a very famous photo of the Solvay conference in 1927, where Albert Einstein, Niels Bohr, Marie Curie, all these top physics names. And how many Nobel laureates are in the photo, right? Yeah, and when I tweeted it out once, people were like, I didn't know these all lived together, and they all knew each other, and they must have exchanged so many ideas. [01:01:28]Linus: I mean, similar with artists and writers that help a new kind of period blossom. [01:01:34]Swyx: Now, is it going to be San Francisco, New York, though? [01:01:36]Alessio: That's a spicy question. [01:01:39]Swyx: I don't know, we'll see. Well, we're glad to at least be a part of your world, whether it is on either coast. But it's also virtual, right? Like, we have a Discord, it's happening online as well, even if you're in a small town like Indiana. [01:01:54]Swyx: Cool, lightning round? Awesome, yeah, let's do it. [01:01:59]Alessio: We only got three questions for you. One is acceleration, one exploration, then a final takeaway. So the first one we always like to ask is like, what is something that happened in AI that you thought would take much longer than it has? [01:02:13]Swyx: Price is coming down. [01:02:14]Linus: Price is coming down and or being able to get a lot more bang for your buck. So things like GPT-3.5 Turbo being, I don't know, exactly the figure, like 10 times, 20 times cheaper. [01:02:25]Swyx: And then having GPT, then DaVinci O3. [01:02:27]Linus: Then DaVinci O3 per token, or the super long context clod, or MPT StoryWriter, these like long context models that take, theoretically would take a lot of compute to run, but they're sort of accessible to us now. I think they're surprising because I would have thought that before these things came out, that cost per token and scaling context length, and these were like sort of core constraints that you would have to design your AI systems around. And it ends up being like, if you just wait a few months, like OpenAI will figure out how to make these models 10 times cheaper. Or Anthropic will figure out how to make the models be able to take a million tokens. And the speed at which that's happened has been surprising and a little bit frightening, because it invalidates a lot of the assumptions that I was operating with, and I have to recalibrate. [01:03:11]Swyx: Yeah, there's this very famous law called Wurf's Law, also known as Gates's Law, that basically says software engineers will take up whatever hardware engineers give them. And I feel like there's a parallel law right now where language model improvements, AI UX people are going to take up all the improvements that language model people will give them. So, you know, they're trying to, while the language model people are improving the costs by a single order of magnitude, you, with your Notion AI autofill, are increasing by orders of magnitude the amount of consumption that's being used. [01:03:39]Linus: Yeah, exactly. Before the show started, we were just talking about how when I was prototyping an autofill, just to make sure that things sort of like scaled up, okay, I ended up running autofill on a database with like 6,000 pages and just summaries. And usually these are fairly long pages. I ended up running through something like two or three million tokens in a matter of like 20 minutes. [01:03:58]Swyx: Yeah. [01:03:58]Linus: Which is not too expensive, luckily, because the models are getting cheaper. It's going to be fine. But it is like $5 or $6, which the concept of like running a test on my computer and it spending the price of like a nice coffee is kind of a weird thing still that I'm getting used to. [01:04:13]Swyx: And Notion AI currently is $10 a month, something like that. So there's ways to make Notion lose money. [01:04:20]Alessio: You just get negative gross margins on that test. [01:04:24]Linus: Not sanctioned by Notion. I mean, obviously, you should use it to, you know, improve your life and support your workflows in whatever ways that's useful. [01:04:33]Swyx: Okay, second question is about exploration. What do you think is the most interesting unsolved question in AI? [01:04:39]Linus: Predictability, reliability. Well, in AI broadly, I think it's much harder. But with language models specifically, I think how to build dependable systems is really important. If you ask Notion AI or if you ask ChatGPT or Claude, like maybe a bullet list of X, Y, Z, sometimes it'll make those bullets with like the Unicode center dot. Sometimes it'll make them with a dash. Sometimes it'll like add a title. Sometimes it'll like bold random things. And all of the things are fine. But it's a little jarring if every time the answer is a little stochastic. I think this is much more of a concern for when you're automating tasks or having the model make decisions by itself. Predictability, dependability, so much of the software that runs the world is sort of behind-the-scenes decision-making programs that run inside enterprises and automate systems and make decisions for people. And auditability, dependability is just so critical to all of them. One avenue of work that I'm really intrigued by is in these decision-making systems, not having the model sort of internally as a black box make decisions, but having the model synthesize code that makes decisions. So you might ask the model for things like summarization, like natural language tasks, you have to ask the model. But if you wanted to, I don't know, let's say you have a document and you want to filter out all the dates. Instead of asking the model, hey, can you grab all the dates? You can ask the model to write a regular expression that captures a particular set of date formats that you really care about. And at that point, the output of the model is a program. And the nice thing about a program is you can kind of check it. There's lots of nice things. One is it's much cheaper to run afterwards. Another is you can verify it. And the program becomes a kind of a, what in design we call a boundary object, where it's a shared thing that exists both in the sphere of the human and the sphere of the computer. And you can iterate on it to fix bugs. And you can co-evolve this object that is now like a representation of this decision that you want the model to, the computer to make. But it's auditable and dependable and reliable. And so I'm pretty bullish on co-generation and other sort of like program synthesis and program verification techniques. But using the model to write the initial program and help the people maintain the software. [01:06:36]Swyx: Yeah, I'm so excited by that. Just in terms of reliability, I'll call out our previous guest. Rojbal. Yeah, yeah. And she's working on Guardrails AI. There's also LMQL. And then Microsoft recently put out Guidance, which is their custom language thing. Have you explored any of those? [01:06:51]Linus: I've taken a look at all of them. I've spoken to Shreya. I think this general space of like more... Speaking of adding constraints to general systems, adding constraints, adding program verification, all of these things I think are super fascinating. I also personally like it a lot. Because before I was spending a lot of my time in AI, I spent a bunch of time looking at like programming languages and compilers and interpreters. And there is just so much amazing work that has gone into how do you build automated ways to reason about a program? Like compilers and type checkers and so on. And it would be a real shame if the whole field of program synthesis and verification just became like ask GPT-4. [01:07:30]Swyx: But actually, it's not. [01:07:30]Linus: Like they work together. You write the program, you synthesize the program with GPT-4 from human descriptions. And then now we have this whole set of powerful techniques that we can use to more formally understand and prove things about programs. And I think the synergy of them, I'm excited to see. [01:07:44]Swyx: Awesome. This was great, Linus. [01:07:47]Alessio: Our last question is always, what's one message you want everyone to remember today about the space, exciting challenges? [01:07:54]Swyx: We were at the beginning. [01:07:57]Linus: Maybe this is really cliche. But one thing that I always used to say about when I was working on text interfaces last year [01:08:05]Swyx: was that I would be really disappointed [01:08:07]Linus: if in a thousand years humans are still using the same kind of like writing tools and writing systems that we are today. Like it would be pretty surprising if we're still sort of like writing documents in the same way that we are today in a thousand years. And the language and the writing system hasn't evolved at all. If humans plan to be around for many thousands of years into the future, writing has really only been around for like two, three thousand years. And it's like sort of modern form. And we should, I think, care a lot more about building flexible, powerful tools than about backwards compatibility if we plan to be around for many more times the number of years that we've been around. And so I think whether we look at something as simple as language models or as expansive as like humans interacting with text documents, I think it's worth reminding yourself often that the things that we have today are sometimes that way for a reason but often just because an artifact of like the way that we've gotten here. And text can look very different. Language models can look very different. I personally think in a couple of years we're going to do something better than transformers. So all of these things are going to change. And I think it's important to have your eyes sort of looking over the horizon at what's coming far into the future. [01:09:24]Swyx: Nice way to end it. [01:09:25]Alessio: Well, thank you, Linus, for coming on. This was great. Thank you. This was lovely. [01:09:29]Linus: Thanks for having me. [01:09:31] This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit www.latent.space
    1/6/2023
    1:09:50
  • Debugging the Internet with AI agents – with Itamar Friedman of Codium AI and AutoGPT
    We are hosting the AI World’s Fair in San Francisco on June 8th! You can RSVP here. Come meet fellow builders, see amazing AI tech showcases at different booths around the venue, all mixed with elements of traditional fairs: live music, drinks, games, and food! We are also at Amplitude’s AI x Product Hackathon and are hosting our first joint Latent Space + Practical AI Podcast Listener Meetup next month!We are honored by the rave reviews for our last episode with MosaicML! They are also welcome on Apple Podcasts and Twitter/HN/LinkedIn/Mastodon etc!We recently spent a wonderful week with Itamar Friedman, visiting all the way from Tel Aviv in Israel: * We first recorded a podcast (releasing with this newsletter) covering Codium AI, the hot new VSCode/Jetbrains IDE extension focused on test generation for Python and JS/TS, with plans for a Code Integrity Agent. * Then we attended Agent Weekend, where the founders of multiple AI/agent projects got together with a presentation from Toran Bruce Richards on Auto-GPT’s roadmap and then from Itamar on Codium’s roadmap* Then some of us stayed to take part in the NextGen Hackathon and won first place with the new AI Maintainer project.So… that makes it really hard to recap everything for you. But we’ll try!Podcast: Codium: Code Integrity with Zero BugsWhen it launched in 2021, there was a lot of skepticism around Github Copilot. Fast forward to 2023, and 40% of all code is checked in unmodified from Copilot. Codium burst on the scene this year, emerging from stealth with an $11m seed, their own foundation model (TestGPT-1) and a vision to revolutionize coding by 2025.You might have heard of "DRY” programming (Don’t Repeat Yourself), which aims to replace repetition with abstraction. Itamar came on the pod to discuss their “extreme DRY” vision: if you already spent time writing a spec, why repeat yourself by writing the code for it? If the spec is thorough enough, automated agents could write the whole thing for you.Live Demo Video SectionThis is referenced in the podcast about 6 minutes in.Timestamps, show notes, and transcript are below the fold. We would really appreciate if you shared our pod with friends on Twitter, LinkedIn, Mastodon, Bluesky, or your social media poison of choice!Auto-GPT: A Roadmap To The Future of WorkMaking his first public appearance, Toran (perhaps better known as @SigGravitas on GitHub) presented at Agents Weekend:Lightly edited notes for those who want a summary of the talk:* What is AutoGPT?AutoGPT is an Al agent that utilizes a Large Language Model to drive its actions and decisions. It can be best described as a user sitting at a computer, planning and interacting with the system based on its goals. Unlike traditional LLM applications, AutoGPT does not require repeated prompting by a human. Instead, it generates its own 'thoughts', criticizes its own strategy and decides what next actions to take.* AutoGPT was released on GitHub in March 2023, and went viral on April 1 with a video showing automatic code generation. 2 months later it has 132k+ stars, is the 29th highest ranked open-source project of all-time, a thriving community of 37.5k+ Discord members, 1M+ downloads.* What’s next for AutoGPT? The initial release required users to know how to build and run a codebase. They recently announced plans for a web/desktop UI and mobile app to enable nontechnical/everyday users to use AutoGPT. They are also working on an extensible plugin ecosystem called the Abilities Hub also targeted at nontechnical users.* Improving Efficacy. AutoGPT has many well documented cases where it trips up. Getting stuck in loops, using instead of actual content incommands, and making obvious mistakes like execute_code("writea cookbook"'. The plan is a new design called Challenge Driven Development - Challenges are goal-orientated tasks or problems thatAuto-GPT has difficulty solving or has not yet been able to accomplish. These may include improving specific functionalities, enhancing the model's understanding of specific domains, or even developing new features that the current version of Auto-GPT lacks. (AI Maintainer was born out of one such challenge). Itamar compared this with Software 1.0 (Test Driven Development), and Software 2.0 (Dataset Driven Development).* Self-Improvement. Auto-GPT will analyze its own codebase and contribute to its own improvement. AI Safety (aka not-kill-everyone-ists) people like Connor Leahy might freak out at this, but for what it’s worth we were pleasantly surprised to learn that Itamar and many other folks on the Auto-GPT team are equally concerned and mindful about x-risk as well.The overwhelming theme of Auto-GPT’s roadmap was accessibility - making AI Agents usable by all instead of the few.Podcast Timestamps* [00:00:00] Introductions* [00:01:30] Itamar’s background and previous startups* [00:03:30] Vision for Codium AI: reaching “zero bugs”* [00:06:00] Demo of Codium AI and how it works* [00:15:30] Building on VS Code vs JetBrains* [00:22:30] Future of software development and the role of developers* [00:27:00] The vision of integrating natural language, testing, and code* [00:30:00] Benchmarking AI models and choosing the right models for different tasks* [00:39:00] Codium AI spec generation and editing* [00:43:30] Reconciling differences in languages between specs, tests, and code* [00:52:30] The Israeli tech scene and startup culture* [01:03:00] Lightning RoundShow Notes* Codium AI* Visualead* AutoGPT* StarCoder* TDD (Test-Driven Development)* AST (Abstract Syntax Tree)* LangChain* ICON* AI21TranscriptAlessio: [00:00:00] Hey everyone. Welcome to the Latent Space podcast. This is Alessio, Partner and CTO-in-Residence at Decibel Partners. I'm joined by my co-host, Swyx, writer and editor of Latent Space.Swyx: Today we have a special guest, Tamar Friedman, all the way from Tel Aviv, CEO and co-founder of Codium AI. Welcome.Itamar: Hey, great being here. Thank you for inviting me.Swyx: You like the studio? It's nice, right?Itamar: Yeah, they're awesome.Swyx: So I'm gonna introduce your background a little bit and then we'll learn a bit more about who you are. So you graduated from Teknion Israel Institute of Technology's kind of like the MIT of of Israel. You did a BS in CS, and then you also did a Master's in Computer Vision, which is kind of relevant.You had other startups before this, but your sort of claim to fame is Visualead, which you started in 2011 and got acquired by Alibaba Group You showed me your website, which is the sort of QR codes with different forms of visibility. And in China that's a huge, huge deal. It's starting to become a bigger deal in the west. My favorite anecdote that you told me was something about how much sales use you saved or something. I forget what the number was.Itamar: Generally speaking, like there's a lot of peer-to-peer transactions going on, like payments and, and China with QR codes. So basically if for example 5% of the scanning does not work and with our scanner we [00:01:30] reduce it to 4%, that's a lot of money. Could be tens of millions of dollars a day.Swyx: And at the scale of Alibaba, it serves all of China. It's crazy. You did that for seven years and you're in Alibaba until 2021 when you took some time off and then hooked up with Debbie, who you've known for 25 years, to start Codium AI and you just raised your $11 million seed rounds with TlB Partners and Vine. Congrats. Should we go right into Codium? What is Codium?Itamar: So we are an AI coding assistant / agent to help developers reaching zero bugs. We don't do that today. Right now, we help to reduce the amount of bugs. Actually you can see people commenting on our marketplace page saying that they found bugs with our tool, and that's like our premise. Our vision is like for Tesla zero emission or something like that, for us it's zero bugs.We started with building an IDE extension either in VS Code or in JetBrains. And that actually works alongside the main panel where you write your code and I can show later what we do is analyze the code, whether you started writing it or you completed it.Like you can go both TDD (Test-Driven Development) or classical coding. And we offer analysis, tests, whether they pass or not, we further self debug [00:03:00] them and make suggestions eventually helping to improve the code quality specifically on code logic testing.Alessio: How did you get there? Obviously it's a great idea. Like, what was the idea, maze? How did you get here?Itamar: I'll go back long. So, yes I was two and a half times a CTO, VC backed startup CTO where we talked about the last one that I sold to Alibaba. But basically I'm like, it's weird to say by 20 years already of R&D manager, I'm not like the best programmer because like you mentioned, I'm coming more from the machine learning / computer vision side, one, one of the main application, but a lot of optimization. So I’m not necessarily the best coder, but I am like 20 year R&D manager. And I found that verifying code logic is very hard thing. And one of the thing that really makes it difficult to increase the development velocity.So you have tools related to checking performance.You have tools for vulnerabilities and security, Israelis are really good at that. But do you have a tool that actually helps you test code logic? I think what we have like dozens or hundreds, even thousands that help you on the end to end, maybe on the microservice integration system. But when you talk about code level, there isn't anything.So that was the pain I always had, especially when I did have tools for that, for the hardware. Like I worked in Mellanox to be sold to Nvidia as a student, and we had formal tools, et cetera. [00:04:30] So that's one part.The second thing is that after being sold to Alibaba, the team and I were quite a big team that worked on machine learning, large language model, et cetera, building developer tools relate with, with LLMs throughout the golden years of. 2017 to 2021, 2022. And we saw how powerful they became.So basically, if I frame it this way, because we develop it for so many use cases, we saw that if you're able to take a problem put a framework of a language around it, whether it's analyzing browsing behavior, or DNA, or etc, if you can put a framework off a language, then LLMs take you really far.And then I thought this problem that I have with code logic testing is basically a combination of a few languages: natural language, specification language, technical language. Even visual language to some extent. And then I quit Alibaba and took a bit of time to maybe wrap things around and rest a bit after 20 years of startup and corporate and joined with my partner Dedy Kredo who was my ever first employee.And that's how we like, came to this idea.Alessio: The idea has obviously been around and most people have done AST analysis, kinda like an abstract syntax tree, but it's kind of hard to get there with just that. But I think these models now are getting good enough where you can mix that and also traditional logical reasoning.Itamar: Exactly.Alessio: Maybe talk a little bit more about the technical implementation of it. You mentioned the agent [00:06:00] part. You mentioned some of the model part, like what happens behind the scenes when Codium gets in your code base?Itamar: First of all, I wanna mention I think you're really accurate.If you try to take like a large language model as is and try to ask it, can you like, analyze, test the code, etc, it'll not work so good. By itself it's not good enough on the other side, like all the traditional techniques we already started to invent since the Greek times. You know, logical stuff, you mentioned ASTs, but there's also dynamic code analysis, mutation testing, etc. There's a lot of the techniques out there, but they have inefficiencies.And a lot of those inefficiencies are actually matching with AI capabilities. Let me give you one example. Let's say you wanna do fuzzy testing or mutation testing.Mutation testing means that you either mutate the test, like the input of the test, the code of the test, etc or you mutate the code in order to check how good is your test suite.For example, if I mutate some equation in the application code and the test finds a bug and it does that at a really high rate, like out of 100 mutation, I [00:07:30] find all of the 100 problems in the test. It's probably a very strong test suite.Now the problem is that there's so many options for what to mutate in the data, in the test. And this is where, for example, AI could help, like pointing out where's the best thing that you can mutate. Actually, I think it's a very good use case. Why? Because even if AI is not 100% accurate, even if it's 80% accurate, it could really take you quite far rather just randomly selecting things.So if I wrap up, just go back high level. I think LLM by themselves cannot really do the job of verifying code logic and and neither can the traditional ones, so you need to merge them. But then one more thing before maybe you tell me where to double click. I think with code logic there's also a philosophy question here.Logic different from performance or quality. If I did a three for in loop, like I loop three things and I can fold them with some vector like in Python or something like that. We need to get into the mind of the developer. What was the intention? Like what is the bad code? Not what is the code logic that doesn't work. It's not according to the specification. So I think like one more thing that AI could really help is help to match, like if there is some natural language description of the code, we can match it. Or if there's missing information in natural language that needs [00:09:00] to be asked for the AI could help asking the user.It's not like a closed solution. Rather open and leaving the developer as the lead. Just like moving the developer from, from being the coder to actually being like a pilot that that clicks button and say, ah, this is what I meant, or this is the fix, rather actually writing all the code.Alessio: That makes sense. I think I talked about it on the podcast before, but like the switch from syntax to like semantics, like developers used to be focused on the syntax and not the meaning of what they're writing. So now you have the models that are really good at the syntax and you as a human are supposed to be really good at the semantics of what you're trying to build.How does it practically work? So I'm a software developer, I want to use Codium, like how do I start and then like, how do you make that happen in the, in the background?Itamar: So, like I said, Codium right now is an IDE extension. For example, I'm showing VS code. And if you just install it, like you'll have a few access points to start Codium AI, whether this sidebar or above every component or class that we think is very good to check with Codium.You'll have this small button. There's other way you can mark specific code and right click and run code. But this one is my favorite because we actually choose above which components we suggest to use code. So once I click it code, I starts analyzing this class. But not only this class, but almost everything that is [00:10:30] being used by the call center class.But all and what's call center is, is calling. And so we do like a static code analysis, et cetera. What, what we talked about. And then Codium provides with code analysis. It's right now static, like you can't change. It can edit it, and maybe later we'll talk about it. This is what we call the specification and we're going to make it editable so you can add additional behaviors and then create accordingly, test that will not pass, and then the code will, will change accordingly. So that's one entrance point, like via natural language description. That's one of the things that we're working on right now. What I'm showing you by the way, could be downloaded as is. It's what we have in production.The second thing that we show here is like a full test suite. There are six tests by default but you can just generate more almost as much as you want every time. We'll try to cover something else, like a happy pass edge case et cetera. You can talk with specific tests, okay? Like you can suggest I want this in Spanish or give a few languages, or I want much more employees.I didn't go over what's a call center, but basically it manages like call center. So you can imagine, I can a ask to make it more rigorous, etc, but I don't wanna complicate so I'm keeping it as is.I wanna show you the next one, which is run all test. First, we verify that you're okay, we're gonna run it. I don't know, maybe we are connected to the environment that is currently [00:12:00] configured in the IDE. I don't know if it's production for some reason, or I don't know what. Then we're making sure that you're aware we're gonna run the code that and then once we run, we show if it pass or fail.I hope that we'll have one fail. But I'm not sure it's that interesting. So I'll go like to another example soon, but, but just to show you what's going on here, that we actually give an example of what's a problem. We give the log of the error and then you can do whatever you want.You can fix it by yourself, or you can click reflect and fix, and what's going on right now is a bit a longer process where we do like chain of thought or reflect and fix. And we can suggest a solution. You can run it and in this case it passes. Just an example, this is a very simple example.Maybe later I'll show you a bug. I think I'll do that and I'll show you a bug and how we recognize actually the test. It's not a problem in the test, it's a problem in the code and then suggest you fix that instead of the code. I think you see where I'm getting at.The other thing is that there are a few code suggestion, and there could be a dozen of, of types that could be related to performance modularity or I see this case there is a maintainability.There could also be vulnerability or best practices or even suggestion for bugs. Like if we noticed, if we think one of the tests, for example, is failing because of a bug. So just code presented in the code suggestion. Probably you can choose a few, for example, if you like, and then prepare a code change like I didn't show you which exactly.We're making a diff now that you can apply on your code. So basically what, what we're seeing here is that [00:13:30] there are three main tabs, the code, the test and the code analysis. Let's call spec.And then there's a fourth tab, which is a code suggestion, if you wanna look at analytics, etc. Mm-hmm. Right now code okay. This is the change or quite a big change probably clicked on something. So that's the basic demo.Right now let's be frank. Like I wanted to show like a simple example. So it's a call center. All the inputs to the class are like relatively simple. There is no jsm input, like if you're Expedia or whatever, you have a J with the hotels, Airbnb, you know, so the test will be almost like too simple or not covering enough.Your code, if you don't provide it with some input is valuable, like adjacent with all information or YAMA or whatever. So you can actually add input data and the AI or model. It's actually by the way, a set of models and algorithms that will use that input to create interesting tests. And another thing is many people have some reference tests that they already made. It could be because they already made it or because they want like a very specific they have like how they imagine the test. So they just write one and then you add a reference and that will inspire all the rest of the tests. And also you can give like hints. [00:15:00] This is by the way plan to be like dynamic hints, like for different type of code.We will provide different hints. So we can help you become a bit more knowledgeable about how to test your code. So you can ask for like having a, a given one then, or you can have like at a funny private, like make different joke for each test or for example,Swyx: I'm curious, why did you choose that one? This is the pirate one. Yeah.Itamar: Interesting choice to put on your products. It could be like 11:00 PM of people sitting around. Let's choose one funny thingSwyx: and yeah. So two serious ones and one funny one. Yeah. Just for the listening audience, can you read out the other hints that you decided on as well?Itamar: Yeah, so specifically, like for this case, relatively very simple class, so there's not much to do, but I'm gonna go to one more thing here on the configuration. But it basically is given when then style, it's one of the best practices and tests. So even when I report a bug, for example, I found a bug when someone else code, usually I wanna say like, given, use this environment or use that this way when I run this function, et cetera.Oh, then it's a very, very full report. And it's very common to use that in like in unit test and perform.Swyx: I have never been shown this format.Itamar: I love that you, you mentioned that because if you go to CS undergrad you take so many courses in development, but none of them probably in testing, and it's so important. So why would you, and you don't go to Udemy or [00:16:30] whatever and, and do a testing course, right? Like it's, it's boring. Like people either don't do component level testing because they hate it or they do it and they hate it. And I think part of it it’s because they're missing tool to make it fun.Also usually you don't get yourself educated about it because you wanna write your code. And part of what we're trying to do here is help people get smarter about testing and make it like easy. So this is like very common. And the idea here is that for different type of code, we'll suggest different type of hints to make you more knowledgeable.We're doing it on an education app, but we wanna help developers become smarter, more knowledgeable about this field. And another one is mock. So right now, our model decided that there's no need for mock here, which is a good decision. But if we would go to real world case, like, I'm part of AutoGPT community and there's all of tooling going on there. Right? And maybe when I want to test like a specific component, and it's relatively clear that going to the web and doing some search and coming back, I don't really need to do that. Like I know what I expect to do and so I can mock that part of using to crawl the web.A certain percentage of accuracy, like around 90, we will decide this is worth mocking and we will inject it. I can click it now and force our system to mock this. But you'll see like a bit stupid mocking because it really doesn't make sense. So I chose this pirate stuff, like add funny pirate like doc stringing make a different joke for each test.And I forced it to add mocks, [00:18:00] the tests were deleted and now we're creating six new tests. And you see, here's the shiver me timbers, the test checks, the call successful, probably there's some joke at the end. So in this case, like even if you try to force it to mock it didn't happen because there's nothing but we might find here like stuff that it mock that really doesn't make sense because there's nothing to mock here.So that's one thing I. I can show a demo where we actually catch a bug. And, and I really love that, you know how it is you're building a developer tools, the best thing you can see is developers that you don't know giving you five stars and sharing a few stuff.We have a discord with thousands of users. But I love to see the individual reports the most. This was one of my favorites. It helped me to find two bugs. I mentioned our vision is to reach zero bugs. Like, if you may say, we want to clean the internet from bugs.Swyx: So debugging the internet. I have my podcast title.Itamar: So, so I think like if we move to another exampleSwyx: Yes, yes, please, please. This is great.Itamar: I'm moving to a different example, it is the bank account. By the way, if you go to ChatGPT and, and you can ask me what's the difference between Codium AI and using ChatGPT.Mm-hmm. I'm, I'm like giving you this hard question later. Yeah. So if you ask ChatGPT give me an example to test a code, it might give you this bank account. It's like the one-on-one stuff, right? And one of the reasons I gave it, because it's easy to inject bugs here, that's easy to understand [00:19:30] anyway.And what I'm gonna do right now is like this bank account, I'm gonna change the deposit from plus to minus as an example. And then I'm gonna run code similarly to how I did before, like it suggests to do that for the entire class. And then there is the code analysis soon. And when we announce very soon, part of this podcast, it's going to have more features here in the code analysis.We're gonna talk about it. Yep. And then there is the test that I can run. And the question is that if we're gonna catch the bag, the bugs using running the test, Because who knows, maybe this implementation is the right one, right? Like you need to, to converse with the developer. Maybe in this weird bank, bank you deposit and, and the bank takes money from you.And we could talk about how this happens, but actually you can see already here that we are already suggesting a hint that something is wrong here and here's a suggestion to put it from minus to to plus. And we'll try to reflect and, and fix and then we will see actually the model telling you, hey, maybe this is not a bug in the test, maybe it's in the code.Swyx: I wanna stay on this a little bit. First of all, this is very impressive and I think it's very valuable. What user numbers can you disclose, you launched it and then it's got fairly organic growth. You told me something off the air, but you know, I just wanted to show people like this is being adopted in quite a large amount.Itamar:  [00:21:00] First of all, I'm a relatively transparent person. Like even as a manager, I think I was like top one percentile being transparent in Alibaba. It wasn't five out of five, which is a good thing because that's extreme, but it was a good, but it also could be a bad, some people would claim it's a bad thing.Like for example, if my CTO in Alibaba would tell me you did really bad and it might cut your entire budget by 30%, if in half a year you're not gonna do like much better and this and that. So I come back to a team and tell 'em what's going on without like trying to smooth thing out and we need to solve it together.If not, you're not fitting in this team. So that's my point of view. And the same thing, one of the fun thing that I like about building for developers, they kind of want that from you. To be transparent. So we are on the high numbers of thousands of weekly active users. Now, if you convert from 50,000 downloads to high thousands of weekly active users, it means like a lot of those that actually try us keep using us weekly.I'm not talking about even monthly, like weekly. And that was like one of their best expectations because you don't test your code every day. Right now, you can see it's mostly focused on testing. So you probably test it like once a week. Like we wanted to make it so smooth with your development methodology and development lifecycle that you use it every day.Like at the moment we hope it to be used weekly. And that's what we're getting. And the growth is about like every two, three weeks we double the amount of weekly and downloads. It's still very early, like seven weeks. So I don't know if it'll keep that way, but we hope so. Well [00:22:30] actually I hope that it'll be much more double every two, three weeks maybe. Thanks to the podcast.Swyx: Well, we, yeah, we'll, we'll add you know, a few thousand hopefully. The reason I ask this is because I think there's a lot of organic growth that people are sharing it with their friends and also I think you've also learned a lot from your earliest days in, in the private beta test.Like what have you learned since launching about how people want to use these testing tools?Itamar: One thing I didn't share with you is like, when you say virality, there is like inter virality and intra virality. Okay. Like within the company and outside the company. So which teams are using us? I can't say, but I can tell you that a lot of San Francisco companies are using us.And one of the things like I'm really surprised is that one team, I saw one user two weeks ago, I was so happy. And then I came yesterday and I saw 48 of that company. So what I'm trying to say to be frank is that we see more intra virality right now than inter virality. I don't see like video being shared all around Twitter. See what's going on here. Yeah. But I do see, like people share within the company, you need to use it because it's really helpful with productivity and it's something that we will work about the [00:24:00] inter virality.But to be frank, first I wanna make sure that it's helpful for developers. So I care more about intra virality and that we see working really well, because that means that tool is useful. So I'm telling to my colleague, sharing it on, on Twitter means that I also feel that it will make me cool or make me, and that's something maybe we'll need, still need, like testing.Swyx: You know, I don't, well, you're working on that. We're gonna announce something like that. Yeah. You are generating these tests, you know, based on what I saw there. You're generating these tests basically based on the name of the functions. And the doc strings, I guess?Itamar:So I think like if you obfuscate the entire code, like our accuracy will drop by 50%. So it's right. We're using a lot of hints that you see there. Like for example, the functioning, the dog string, the, the variable names et cetera. It doesn't have to be perfect, but it has a lot of hints.By the way. In some cases, in the code suggestion, we will actually suggest renaming some of the stuff that will sync, that will help us. Like there's suge renaming suggestion, for example. Usually in this case, instead of calling this variable is client and of course you'll see is “preferred client” because basically it gives a different commission for that.So we do suggest it because if you accept it, it also means it will be easier for our model or system to keep improving.Swyx: Is that a different model?Itamar: Okay. That brings a bit to the topic of models properties. Yeah. I'll share it really quickly because Take us off. Yes. It's relevant. Take us off. Off. Might take us off road.I think [00:25:30] like different models are better on different properties, for example, how obedient you are to instruction, how good you are to prompt forcing, like to format forcing. I want the results to be in a certain format or how accurate you are or how good you are in understanding code.There's so many calls happening here to models by the way. I. Just by clicking one, Hey Codium AI. Can you help me with this bank account? We do a dozen of different calls and each feature you click could be like, like with that reflect and fix and then like we choose the, the best one.I'm not talking about like hundreds of models, but we could, could use different APIs of open AI for example, and, and other models, et cetera. So basically like different models are better on different aspect. Going back to your, what we talked about, all the models will benefit from having those hints in, in the code, that rather in the code itself or documentation, et cetera.And also in the code analysis, we also consider the code analysis to be the ground truth to some extent. And soon we're also going to allow you to edit it and that will use that as well.Alessio: Yeah, maybe talk a little bit more about. How do I actually get all these models to work together? I think there's a lot of people that have only been exposed to Copilot so far, which is one use case, just complete what I'm writing. You're doing a lot more things here. A lot of people listening are engineers themselves, some of them build these tools, so they would love to [00:27:00] hear more about how do you orchestrate them, how do you decide which model the what, stuff like that.Itamar: So I'll start with the end because that is a very deterministic answer, is that we benchmark different models.Like every time this there a new model in, in town, like recently it's already old news. StarCoder. It's already like, so old news like few days ago.Swyx: No, no, no. Maybe you want to fill in what it is StarCoder?Itamar: I think StarCoder is, is a new up and coming model. We immediately test it on different benchmark and see if, if it's better on some properties, et cetera.We're gonna talk about it like a chain of thoughts in different part in the chain would benefit from different property. If I wanna do code analysis and, and convert it to natural language, maybe one model would be, would be better if I want to output like a result in, in a certain format.Maybe another model is better in forcing the, a certain format you probably saw on Twitter, et cetera. People talk about it's hard to ask model to output JSON et cetera. So basically we predefine. For different tasks, we, we use different models and I think like this is for individuals, for developers to check, try to sync, like the test that now you are working on, what is most important for you to get, you want the semantic understanding, that's most important? You want the output, like are you asking for a very specific [00:28:30] output?It's just like a chat or are you asking to give a output of code and have only code, no description. Or if there's a description of the top doc string and not something else. And then we use different models. We are aiming to have our own models in in 2024. Being independent of any other third party, like OpenAI or so, but since our product is very challenging, it has UI/UX challenges, engineering challenge, statical and dynamical analysis, and AI.As entrepreneur, you need to choose your battles. And we thought that it's better for us to, to focus on everything around the model. And one day when we are like thinking that we have the, the right UX/UI engineering, et cetera, we'll focus on model building. This is also, by the way, what we did in in Alibaba.Even when I had like half a million dollar a month for trading one foundational model, I would never start this way. You always try like first using the best model you can for your product. Then understanding what's the glass ceiling for that model? Then fine tune a foundation model, reach a higher glass ceiling and then training your own.That's what we're aiming and that's what I suggest other developers like, don't necessarily take a model and, and say, oh, it's so easy these days to do RLHF, et cetera. Like I see it’s like only $600. Yeah, but what are you trying to optimize for? The properties. Don't try to like certain models first, organize your challenges.Understand the [00:30:00] properties you're aiming for and start playing with that. And only then go to train your own model.Alessio: Yeah. And when you say benchmark, you know, we did a one hour long episode, some benchmarks, there's like many of them. Are you building some unique evals to like your own problems? Like how are you doing that? And that's also work for your future model building, obviously, having good benchmarks. Yeah.Itamar:. Yeah. That's very interesting. So first of all, with all the respect, I think like we're dealing with ML benchmark for hundreds of years now.I'm, I'm kidding. But like for tens of years, right? Benchmarking statistical creatures is something that, that we're doing for a long time. I think what's new here is the generative part. It's an open challenge to some extent. And therefore, like maybe we need to re rethink some of the way we benchmark.And one of the notions that I really believe in, I don't have a proof for that, is like create a benchmark in levels. Let's say you create a benchmark from level one to 10, and it's a property based benchmark. Let's say I have a WebGPT ask something from the internet and then it should fetch it for me.So challenge level one could be, I'm asking it and it brings me something. Level number two could be I'm asking it and it has a certain structure. Let's say for example, I want to test AutoGPT. Okay. And I'm asking it to summarize what's the best cocktail I could have for this season in San Francisco.So [00:31:30] I would expect, like, for example, for that model to go. This is my I what I think to search the internet and do a certain thing. So level number three could be that I want to check that as part of this request. It uses a certain tools level five, you can add to that. I expect that it'll bring me back something like relevance and level nine it actually prints the cocktail for me I taste it and it's good. So, so I think like how I see it is like we need to have data sets similar to before and make sure that we not fine tuning the model the same way we test it. So we have one challenges that we fine tune over, right? And few challenges that we don't.And the new concept may is having those level which are property based, which is something that we know from software testing and less for ML. And this is where I think that these two concepts merge.Swyx: Maybe Codium can do ML testing in the future as well.Itamar: Yeah, that's a good idea.Swyx: Okay. I wanted to cover a little bit more about Codium in the present and then we'll go into the slides that you have.So you have some UI/UX stuff and you've obviously VS Code is the majority market share at this point of IDE, but you also have IntelliJ right?Itamar: Jet Brains in general.Swyx: Yeah. Anything that you learned supporting JetBrains stuff? You were very passionate about this one user who left you a negative review.What is the challenge of that? Like how do you think about the market, you know, maybe you should focus on VS Code since it's so popular?Itamar: Yeah. [00:33:00] So currently the VS Code extension is leading over JetBrains. And we were for a long time and, and like when I tell you long time, it could be like two or three weeks with version oh 0.5, point x something in, in VS code, although oh 0.4 or so a jet brains, we really saw the difference in, in the how people react.So we also knew that oh 0.5 is much more meaningful and one of the users left developers left three stars on, on jet brands and I really remember that. Like I, I love that. Like it's what do you want to get at, at, at our stage? What's wrong? Like, yes, you want that indication, you know, the worst thing is getting nothing.I actually, not sure if it's not better to get even the bad indication, only getting good ones to be re frank like at, at, at least in our stage. So we're, we're 9, 10, 10 months old startup. So I think like generally speaking We find it easier and fun to develop in vs code extension versus JetBrains.Although JetBrains has like very nice property, when you develop extension for one of the IDEs, it usually works well for all the others, like it's one extension for PyCharm, and et cetera. I think like there's even more flexibility in the VS code. Like for example, this app is, is a React extension as opposed that it's native in the JetBrains one we're using. What I learned is that it's basically is almost like [00:34:30] developing Android and iOS where you wanna have a lot of the best practices where you have one backend and all the software development like best practices with it.Like, like one backend version V1 supports both under Android and iOS and not different backends because that's crazy. And then you need all the methodology. What, what means that you move from one to 1.1 on the backend? What supports whatnot? If you don't what I'm talking about, if you developed in the past, things like that.So it's important. And then it's like under Android and iOS and, and you relatively want it to be the same because you don't want one developer in the same team working with Jet Brains and then other VS code and they're like talking, whoa, that's not what I'm seeing. And with code, what are you talking about?And in the future we're also gonna have like teams offering of collaboration Right now if you close Codium Tab, everything is like lost except of the test code, which you, you can, like if I go back to a test suite and do open as a file, and now you have a test file with everything that you can just save, but all the goodies here it's lost. One day we're gonna have like a platform you can save all that, collaborate with people, have it part of your PR, like have suggested part of your PR. And then you wanna have some alignment. So one of the challenges, like UX/UI, when you think about a feature, it should, some way or another fit for both platforms be because you want, I think by the way, in iOS and Android, Android sometimes you don’t care about parity, but here you're talking about developers that might be on the same [00:36:00] team.So you do care a lot about that.Alessio: Obviously this is a completely different way to work for developers. I'm sure this is not everything you wanna build and you have some hint. So maybe take us through what you see the future of software development look like.Itamar: Well, that's great and also like related to our announcement, what we're working on.Part of it you already start seeing in my, in my demo before, but now I'll put it into a framework. I'll be clearer. So I think like the software development world in 2025 is gonna look very different from 2020. Very different. By the way. I think 2020 is different from 2000. I liked the web development in 95, so I needed to choose geocities and things like that.Today's much easier to build a web app and whatever, one of the cloud. So, but I think 2025 is gonna look very different in 2020 for the traditional coding. And that's like a paradigm I don't think will, will change too much in the last few years. And, and I'm gonna go over that when I, when I'm talking about, so j just to focus, I'm gonna show you like how I think the intelligence software development world look like, but I'm gonna put it in the lens of Codium AI.We are focused on code integrity. We care that with all this advancement of co-generation, et cetera, we wanna make sure that developers can code fast with confidence. That they have confidence on generated code in the AI that they are using that. That's our focus. So I'm gonna put, put that like lens when I'm going to explain.So I think like traditional development. Today works like creating some spec for different companies, [00:37:30] different development teams. Could mean something else, could be something on Figma, something on Google Docs, something on Jira. And then usually you jump directly to code implementation. And then if you have the time or patience, or will, you do some testing.And I think like some people would say that it's better to do TDD, like not everyone. Some would say like, write spec, write your tests, make sure they're green, that they do not pass. Write your implementation until your test pass. Most people do not practice it. I think for just a few, a few reason, let them mention two.One, it's tedious and I wanna write my code like before I want my test. And I don't think, and, and the second is, I think like we're missing tools to make it possible. And what we are advocating, what I'm going to explain is actually neither. Okay. It's very, I want to say it's very important. So here's how we think that the future of development pipeline or process is gonna look like.I'm gonna redo it in steps. So, first thing I think there do I wanna say that they're gonna be coding assistance and coding agents. Assistant is like co-pilot, for example, and agents is something that you give it a goal or a task and actually chains a few tasks together to complete your goal.Let's have that in mind. So I think like, What's happening right now when you saw our demo is what I presented a few minutes ago, is that you start with an implementation and we create spec for you and test for you. And that was like a agent, like you didn't converse with it, you just [00:39:00] click a button.And, and we did a, a chain of thought, like to create these, that's why it's it's an agent. And then we gave you an assistant to change tests, like you can converse it with it et cetera. So that's like what I presented today. What we're announcing is about a vision that we called the DRY. Don't repeat yourself. I'm gonna get to that when I'm, when I'm gonna show you the entire vision. But first I wanna show you an intermediate step that what we're going to release. So right now you can write your code. Or part of it, like for example, just a class abstract or so with a coding assistant like copilot and maybe in the future, like a Codium AI coding assistant.And then you can create a spec I already presented to you. And the next thing is that you going to have like a spec assistant to generate technical spec, helping you fill it quickly focused on that. And this is something that we're working on and, and going to release the first feature very soon as part of announcement.And it's gonna be very lean. Okay? We're, we're a startup that going bottom up, like lean features going to more and more comprehensive one. And then once you have the spec and implementation, you can either from implementation, have tests, and then you can run the test and fix them like I presented to you.But you can also from spec create tests, okay? From the spec directly to tests. [00:40:30]So then now you have a really interesting thing going on here is that you can start from spec, create, test, create code. You can start from test create code. You can start from a limitation. From code, create, spec and test. And actually we think the future is a very flexible one. You don't need to choose what you're practicing traditional TDD or whatever you wanna start with.If you have already some spec being created together with one time in one sprint, you decided to write a spec because you wanted to align about it with your team, et cetera, and now you can go and create tests and implementation or you wanted to run ahead and write your code. Creating tests and spec that aligns to it will be relatively easy.So what I'm talking about is extreme DRY concept; DRY is don't repeat yourself. Until today when we talked about DRY is like, don't repeat your code. I claim that there is a big parts of the spec test and implementation that repeat himself, but it's not a complete repetition because if spec was as detailed as the implementation, it's actually the implementation.But the spec is usually in different language, could be natural language and visual. And what we're aiming for, our vision is enabling the dry concept to the extreme. With all these three: you write your test will help you generate the code and the spec you write your spec will help you doing the test and implementation.Now the developers is the driver, okay? You'll have a lot [00:42:00] of like, what do you think about this? This is what you meant. Yes, no, you wanna fix the coder test, click yes or no. But you still be the driver. But there's gonna be like extreme automation on the DRY level. So that's what we're announcing, that we're aiming for as our vision and what we're providing these days in our product is the middle, is what, what you see in the middle, which is our code integrity agents working for you right now in your id, but soon also part of your Github actions, et cetera, helping you to align all these three.Alessio: This is great. How do you reconcile the difference in languages, you know, a lot of times the specs is maybe like a PM or it's like somebody who's more at the product level.Some of the implementation details is like backend developers for something. Frontend for something. How do you help translate the language between the two? And then I think in the one of the blog posts on your blog, you mentioned that this is also changing maybe how programming language themselves work. How do you see that change in the future? Like, are people gonna start From English, do you see a lot of them start from code and then it figures out the English for them?Itamar: Yeah. So first of all, I wanna say that although we're working, as we speak on managing we front-end frameworks and languages and usage, we are currently focused on the backend.So for example, as the spec, we won't let you input Figma, but don't be surprised if in 2024 the input of the spec could be a Figma. Actually, you can see [00:43:30] demos of that on a pencil drawing from OpenAI and when he exposed the GPT-4. So we will have that actually.I had a blog, but also I related to two different blogs. One, claiming a very knowledgeable and respectful, respectful person that says that English is going to be the new language program language and, and programming is dead. And another very respectful person, I think equally said that English is a horrible programming language.And actually, I think both of are correct. That's why when I wrote the blog, I, I actually related, and this is what we're saying here. Nothing is really fully redundant, but what's annoying here is that to align these three, you always need to work very hard. And that's where we want AI to help with. And if there is inconsistency will raise a question, what do, which one is true?And just click yes or no or test or, or, or code that, that what you can see in our product and we'll fix the right one accordingly. So I think like English and, and visual language and code. And the test language, let's call it like, like that for a second. All of them are going to persist. And just at the level of automation aligning all three is what we're aiming for.Swyx: You told me this before, so I I'm, I'm just actually seeing Alessio’s reaction to it as a first time.Itamar: Yeah, yeah. Like you're absorbing like, yeah, yeah.Swyx: No, no. This is, I mean, you know, you can put your VC hat on or like compare, like what, what is the most critical or unsolved question presented by this vision?Alessio: A lot of these tools, especially we've seen a lot in the past, it's like the dynamic nature of a lot of this, you know?[00:45:00] Yeah. Sometimes, like, as you mentioned, sometimes people don't have time to write the test. Sometimes people don't have time to write the spec. Yeah. So sometimes you end up with things. Out of sync, you know? Yeah. Or like the implementation is moving much faster than the spec, and you need some of these agents to make the call sometimes to be like, no.Yeah, okay. The spec needs to change because clearly if you change the code this way, it needs to be like this in the future. I think my main question as a software developer myself, it's what is our role in the future? You know? Like, wow, how much should we intervene, where should we intervene?I've been coding for like 15 years, but if I've been coding for two years, where should I spend the next year? Yeah. Like focus on being better at understanding product and explain it again. Should I get better at syntax? You know, so that I can write code. Would love have any thoughts.Itamar: Yeah. You know, there's gonna be a difference between 1, 2, 3 years, three to six, six to 10, and 10 to 20. Let's for a second think about the idea that programming is solved. Then we're talking about a machine that can actually create any piece of code and start creating, like we're talking about singularity, right?Mm-hmm. If the singularity happens, then we're talking about this new set of problems. Let's put that aside. Like even if it happens in 2041, that's my prediction. I'm not sure like you should aim for thinking what you need to do, like, or not when the singularity happens. So I, [00:46:30] I would aim for mm-hmm.Like thinking about the future of the next five years or or, so. That's my recommendation because it's so crazy. Anyway. Maybe not the best recommendation. Take that we're for grain of salt. And please consult with a lawyer, at least in the scope of, of the next five years. The idea that the developers is the, the driver.It actually has like amazing team members. Agents that working for him or her and eventually because he or she's a driver, you need to understand especially what you're trying to achieve, but also being able to review what you get. The better you are in the lower level of programming in five years, it it mean like real, real program language.Then you'll be able to develop more sophisticated software and you will work in companies that probably pay more for sophisticated software and the more that you're less skilled in, in the actual programming, you actually would be able to be the programmer of the new era, almost a creator. You'll still maybe look on the code levels testing, et cetera, but what's important for you is being able to convert products, requirements, et cetera, to working with tools like Codium AI.So I think like there will be like degree of diff different type developers now. If you think about it for a second, I think like it's a natural evolution. It's, it's true today as well. Like if you know really good the Linux or assembly, et cetera, you'll probably work like on LLVM Nvidia [00:48:00] whatever, like things like that.Right. And okay. So I think it'll be like the next, next step. I'm talking about the next five years. Yeah. Yeah. Again, 15 years. I think it's, it's a new episode if you would like to invite me. Yeah. Oh, you'll be, you'll be back. Yeah. It's a new episode about how, how I think the world will look like when you really don't need a developer and we will be there as Cody mi like you can see.Mm-hmm.Alessio: Do we wanna dive a little bit into AutoGPT? You mentioned you're part of the community. Yeah.Swyx: Obviously Try, Catch, Finally, Repeat is also part of the company motto.Itamar: Yeah. So it actually really. Relates to what we're doing and there's a reason we have like a strong relationship and connection with the AutoGPT community and us being part part of it.So like you can see, we're talking about agent for a few months now, and we are building like a designated, a specific agent because we're trying to build like a product that works and gets the developer trust to have developer trust us. We're talking about code integrity. We need it to work. Like even if it will not put 100% it's not 100% by the way our product at all that UX/UI should speak the language of, oh, okay, we're not sure here, please take the driving seat.You want this or that. But we really not need, even if, if we're not close to 100%, we still need to work really well just throwing a number. 90%. And so we're building a like really designated agents like those that from code, create tests.So it could create tests, run them, fix them. It's a few tests. So we really believe in that we're [00:49:30] building a designated agent while Auto GPT is like a swarm of agents, general agents that were supposedly you can ask, please make me rich or make me rich by increase my net worth.Now please be so smart and knowledgeable to use a lot of agents and the tools, et cetera, to make it work. So I think like for AutoGPT community was less important to be very accurate at the beginning, rather to show the promise and start building a framework that aims directly to the end game and start improving from there.While what we are doing is the other way around. We're building an agent that works and build from there towards that. The target of what I explained before. But because of this related connection, although it's from different sides of the, like the philosophy of how you need to build those things, we really love the general idea.So we caught it really early that with Toran like building it, the, the maker of, of AutoGPT, and immediately I started contributing, guess what, what did I contribute at the beginning tests, right? So I started using Codium AI to build tests for AutoGPT, even, even finding problems this way, et cetera.So I become like one of the, let's say 10 contributors. And then in the core team of the management, I talk very often with with Toran on, on different aspects. And we are even gonna have a workshop,Swyx: a very small [00:49:00] meetingItamar: work meeting workshop. And we're going to compete together in a, in a hackathons.And to show that AutoGPT could be useful while, for example, Codium AI is creating the test for it, et cetera. So I'm part of that community, whether is my team are adding tests to it, whether like advising, whether like in in the management team or whether to helping Toran. Really, really on small thing.He is the amazing leader like visionaire and doing really well.Alessio: What do you think is the future of open source development? You know, obviously this is like a good example, right? You have code generating the test and in the future code could actually also implement the what the test wanna do. So like, yeah.How do you see that change? There's obviously not enough open source contributors and yeah, that's one of the, the main issue. Do you think these agents are maybe gonna help us? Nadia Eghbal has this  great book called like Working in Public and there's this type of projects called Stadium model, which is, yeah, a lot of people use them and like nobody wants to contribute to them.I'm curious about, is it gonna be a lot of noise added by a lot of these agents if we let them run on any repo that is open source? Like what are the contributing guidelines for like humans versus agents? I don't have any of the answers, but like some of the questions that I've been thinking about.Itamar: Okay. So I wanna repeat your question and make sure I understand you, but like, if they're agents, for example, dedicated for improving code, why can't we run them on, mm-hmm.Run them on like a full repository in, in fixing that? The situation right now is that I don't think that right now Auto GPT would be able to do that for you. Codium AI might but it's not open sourced right now. And and like you can see like in the months or two, you will be able to like running really quickly like development velocity, like our motto is moving fast with confidence by the way.So we try to like release like every day or so, three times even a day in the backend, et cetera. And we'll develop more feature, enable you, for example, to run an entire re, but, but it's not open source. So about the open source I think like AutoGPT or LangChain, you can't really like ask please improve my repository, make it better.I don't think it will work right now because because let me like. Softly quote Ilya from Open AI. He said, like right now, let's say that a certain LLM is 95% accurate. Now you're, you're concatenating the results. So the accuracy is one point like it's, it's decaying. And what you need is like more engineering frameworks and work to be done there in order to be able to deal with inaccuracies, et cetera.And that's what we specialize in Codium, but I wanna say that I'm not saying that Auto GPT won't be able to get there. Like the more tools and that going to be added, the [00:52:30] more prompt engineering that is dedicated for this, this idea will be added by the way, where I'm talking with Toran, that Codium, for example, would be one of the agents for Auto GPT.Think about it AutoGPT is not, is there for any goal, like increase my net worth, though not focused as us on fixing or improving code. We might be another agent, by the way. We might also be, we're working on it as a plugin for ChatGPT. We're actually almost finished with it. So that's like I think how it's gonna be done.Again, open opensource, not something we're thinking about. We wanted to be really good before weSwyx: opensource it. That was all very impressive. Your vision is actually very encouraging as well, and I, I'm very excited to try it out myself. I'm just curious on the Israel side of things, right? Like you, you're visiting San Francisco for a two week trip for this special program you can tell us about. But also I think a lot of American developers have heard that, you know, Israel has a really good tech scene. Mostly it's just security startups. You know, I did some, I was in some special unit in the I D F and like, you know, I come out and like, I'm doing the same thing again, but like, you know, for enterprises but maybe just something like, describe for, for the rest of the world.It's like, What is the Israeli tech scene like? What is this program that you're on and what shouldItamar: people know? So I think like Israel is the most condensed startup per capita. I think we're number one really? Or, or startup pair square meter. I think, I think we're number one as well because of these properties actually there is a very strong community and like everyone are around, like are [00:57:00] working in a.An entrepreneur or working in a startup. And when you go to the bar or the coffee, you hear if it's 20, 21, people talking about secondary, if it's 2023 talking about like how amazing Geni is, but everyone are like whatever are around you are like in, in the scene. And, and that's like a lot of networking and data propagation, I think.Somehow similar here to, to the Bay Area in San Francisco that it helps, right. So I think that's one of our strong points. You mentioned some others. I'm not saying that it doesn't help. Yes. And being in the like idf, the army, that age of 19, you go and start dealing with technology like very advanced one, that, that helps a lot.And then going back to the community, there's this community like is all over the world. And for example, there is this program called Icon. It's basically Israelis and in the Valley created a program for Israelis from, from Israel to come and it's called Silicon Valley 1 0 1 to learn what's going on here.Because with all the respect to the tech scene in Israel here, it's the, the real thing, right? So, so it's an non-profit organization by Israelis that moved here, that brings you and, and then brings people from a 16 D or, or Google or Navon or like. Amazing people from unicorns or, or up and coming startup or accelerator, and give you up-to-date talks and, and also connect you to relevant people.And that's, that's why I'm here in addition to to, you know, to [00:58:30] me and, and participate in this amazing podcast, et cetera.Swyx: Yeah. Oh, well, I, I think, I think there's a lot of exciting tech talent, you know, in, in Tel Aviv, and I, I'm, I'm glad that your offer is Israeli.Itamar: I, I think one of thing I wanted to say, like yeah, of course, that because of what, what what we said security is, is a very strong scene, but a actually water purification agriculture attack, there's a awful other things like usually it's come from necessity.Yeah. Like, we have big part of our company of our state is like a desert. So there's, there's other things like ai by the way is, is, is big also in Israel. Like, for example, I think there's an Israeli competitor to open ai. I'm not saying like it's as big, but it's ai 21, I think out of 10.Yeah. Out. Oh yeah. 21. Is this really? Yeah. Out of 10 like most, mm-hmm. Profound research labs. Research lab is, for example, I, I love, I love their. Yeah. Yeah.Swyx: I, I think we should try to talk to one of them. But yeah, when you and I met, we connected a little bit Singapore, you know, I was in the Singapore Army and Israeli army.We do have a lot of connections between countries and small countries that don't have a lot of natural resources that have to make due in the world by figuring out some other services. I think the Singapore startup scene has not done as well as the Israeli startup scene. So I'm very interested in, in how small, small countries can have a world impact essentially.Itamar: It's a question we're being asked a lot, like why, for example, let's go to the soft skills. I think like failing is a bad thing. Yeah. Like, okay. Like sometimes like VCs prefer to [01:00:00] put money on a, on an entrepreneur that failed in his first startup and actually succeeded because now that person is knowledgeable, what it mean to be, to fail and very hungry to, to succeed.So I think like generally, like there's a few reason I think it's hard to put the finger exactly, but we talked about a few things. But one other thing I think like failing is not like, this is my fourth company. I did one as, it wasn't a startup, it was a company as a teenager. And then I had like my first startup, my second company that like, had a amazing run, but then very beautiful collapse.And then like my third company, my second startup eventually exit successfully to, to Alibaba. So, so like, I think like it's there, there are a lot of trial and error, which is being appreciated, not like suppressed. I guess like that's one of the reason,Alessio: wanna jump into lightning round?Swyx: Yes. I think we send you into prep, but there's just three questions now.We've, we've actually reduced it quite a bit, but you have it,Alessio: so, and we can read them that you can take time and answer. You don't have to right away. First question, what is a already appin in AI that Utah would take much longer than an sItamar: Okay, so I have to, I hope it doesn't sound like arrogant, but I started coding AI BC before chatty.Mm-hmm. And, and I was like going to like VCs and V P R and D is director, et cetera, and telling them, listen, we're gonna help with code logic testing and we're going to do that interactive conversation way. And they were like, no way. I even had like two saying, I won't let your silly AI get close to my code.[01:01:30]That was bc ac. It's really different. And so like we kind of saw like it. Like if you played with G P T three, especially three and a half, whatever, like you felt working really well with instruction and conversation. So having said that, I think like still like Open Eye did amazing job, like building the product, like of course building the model, but that's forgiven.Like they're the leaders, but did an amazing job building the product that's as accessible. And I think that was maybe a bit surprising. Like I think like many tried to do a chatbot or so with these GPTs, but they, since they're. Developing these, these models, they probably felt, and I think that's what happened, that it's not being used correctly.So I think like the fact that they built actually the product, so well, that was maybe surprising for me. Again, I hope it doesn't sound too arrogant, but I I don't feel like there was a step function here. We might reach your point, but that's like, as we said, a different episode at inflection point and things were gonna be really surprisingSwyx: when the agents take over exploration.So what do you think is the most interesting unsolved question in, in ai? Like, what would you re, what's an open question that you think, man, somebody should solve that?Itamar: Okay, so here I am going to go to the Yes obvious answer. That's a AI alignment. Mm-hmm. Like, it's, it's a technical question. It's it's a philosophy question, et cetera.It's, it's, it's not easy. Like it raises so many question even about ourself [01:03:00] as as human or we, like, I saw one tweet by someone that I'm thinking about like for a few years he wrote are we actually like LLMs, like in essence? So, so I think like we're trying to look into those LLMs for years. Like there, there was, like in 2014 there was already in the C N N, there was a few works.Trying to visualize what, what are the, the feature detection, the feature, like what are the feature with the hidden layers that you see, like we're trying to work on it for years, lately, like a really long time ago, like five years, days ago or so, like, we saw work by open ai, like trying to turn, look on on different parts of Dell LM and trying to provide a natural language description for them.So I think like this is very important. Very interesting tech-wise, philosophy wise, et cetera, that that's like, I think need to be explored more. And just one takeawayAlessio: for all the listeners, like what's one message you want everyone to remember about ai? I, IItamar: would say, again, something might be a bit obvious, but I think right now what's happening is that we're actually true to this month's overestimating what gen AI can do overestimating, but we're underestimating what it can do in the future.Okay. So why am I saying that? Because if you're a builder, I really encourage you, speak less and do more play with it. Try it for specific use cases and see what's easy to do. And then if your purpose is just like incorporating stuff and that's what you wanna do and [01:04:30] then do it, but don't like, tell everyone you're gonna do it before you do it, because you might find that it's actually really hard and there's a lot of problems.It works amazing. Like it wowed you for two examples, but then for eight other examples that like works crappy data. I want, if you're building, you wanna build a startup. So find that case where you believe that you can think about a solution around LLMs or what it's going to be in in one or two years because you want to, what?You wanna try to predict that and what's a challenging around it and do that through trying, trying, trying. Like for example, if you're really excited about auto G P T. Try to find five different cases that you, you managed to make it work for. Again, you might find you can't. I'm, I think that it's, it will do a lot and I think it was good that somebody brought these frameworks and now will try to jump, will progress with the levels that I talked about before.So that, that's my like really like. If you think of idea first, try it. It's like easier than ever. Like there are so many, so many tools to, to try like, and that's one of the things that brought us to coding large language model as is do not work for verifying code logic. But we think there's, we see the path, how to combine with other technical elements and how AI's going to evolve that we can actually bring to fruition this, this idea, this notion of the dry concept that I mentioned.Well,Alessio: Edmar, thank you so much for coming on. This was great.Itamar: Thank you for inviting me. It was a pleasure.[01:06:00] This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit www.latent.space
    25/5/2023
    1:02:36
  • MPT-7B and The Beginning of Context=Infinity — with Jonathan Frankle and Abhinav Venigalla of MosaicML
    We are excited to be the first podcast in the world to release an in-depth interview on the new SOTA in commercially licensed open source models - MosiacML MPT-7B!The Latent Space crew will be at the NYC Lux AI Summit next week, and have two meetups in June. As usual, all events are on the Community page! We are also inviting beta testers for the upcoming AI for Engineers course. See you soon!One of GPT3’s biggest limitations is context length - you can only send it up to 4000 tokens (3k words, 6 pages) before it throws a hard error, requiring you to bring in LangChain and other retrieval techniques to process long documents and prompts. But MosaicML recently open sourced MPT-7B, the newest addition to their Foundation Series, with context length going up to 84,000 tokens (63k words, 126 pages):This transformer model, trained from scratch on 1 trillion tokens of text and code (compared to 300B for Pythia and OpenLLaMA, and 800B for StableLM), matches the quality of LLaMA-7B. It was trained on the MosaicML platform in 9.5 days on 440 GPUs with no human intervention, costing approximately $200,000. Unlike many open models, MPT-7B is licensed for commercial use and it’s optimized for fast training and inference through FlashAttention and FasterTransformer.They also released 3 finetuned models starting from the base MPT-7B: * MPT-7B-Instruct: finetuned on dolly_hhrlhf, a dataset built on top of dolly-5k (see our Dolly episode for more details). * MPT-7B-Chat: finetuned on the ShareGPT-Vicuna, HC3, Alpaca, Helpful and Harmless, and Evol-Instruct datasets.* MPT-7B-StoryWriter-65k+: it was finetuned with a context length of 65k tokens on a filtered fiction subset of the books3 dataset. While 65k is the advertised size, the team has gotten up to 84k tokens in response when running on a single node A100-80GB GPUs. ALiBi is the dark magic that makes this possible. Turns out The Great Gatsby is only about 68k tokens, so the team used the model to create new epilogues for it!On top of the model checkpoints, the team also open-sourced the entire codebase for pretraining, finetuning, and evaluating MPT via their new MosaicML LLM Foundry. The table we showed above was created using LLM Foundry in-context-learning eval framework itself!In this episode, we chatted with the leads of MPT-7B at Mosaic: Jonathan Frankle, Chief Scientist, and Abhinav Venigalla, Research Scientist who spearheaded the MPT-7B training run. We talked about some of the innovations they’ve brought into the training process to remove the need for 2am on-call PagerDutys, why the LLM dataset mix is such an important yet dark art, and why some of the traditional multiple-choice benchmarks might not be very helpful for the type of technology we are building.Show Notes* Introducing MPT-7B* Cerebras* Lottery Ticket Hypothesis* Hazy Research* ALiBi* Flash Attention* FasterTransformer* List of naughty words for C4 https://twitter.com/code_star/status/1661386844250963972* What is Sparsity?* Hungry Hungry Hippos* BF16 FPp.s. yes, MPT-7B really is codenamed LLongboi!Timestamps* Introductions [00:00:00]* Intro to Mosaic [00:03:20]* Training and Creating the Models [00:05:45]* Data Choices and the Importance of Repetition [00:08:45]* The Central Question: What Mix of Data Sets Should You Use? [00:10:00]* Evaluation Challenges of LLMs [0:13:00]* Flash Attention [00:16:00]* Fine-tuning for Creativity [00:19:50]* Open Source Licenses and Ethical Considerations [00:23:00]* Training Stability Enhancement [00:25:15]* Data Readiness & Training Preparation [00:30:00]* Dynamic Real-time Model Evaluation [00:34:00]* Open Science for Affordable AI Research [00:36:00]* The Open Approach [00:40:15]* The Future of Mosaic [00:44:11]* Speed and Efficiency [00:48:01]* Trends and Transformers [00:54:00]* Lightning Round and Closing [1:00:55]TranscriptAlessio: [00:00:00] Hey everyone. Welcome to the Latent Space podcast. This is Alessio partner and CTO-in-Residence at Decibel Partners. I'm joined by my co-host, Swyx, writer and editor of Latent Space.Swyx: Hey, and today we have Jonathan and Abhi from Mosaic ML. Welcome to our studio.Jonathan: Guys thank you so much for having us. Thanks so much.Swyx: How's it feel?Jonathan: Honestly, I've been doing a lot of podcasts during the pandemic, and it has not been the same.Swyx: No, not the same actually. So you have on your bio that you're primarily based in Boston,Jonathan: New York. New York, yeah. My Twitter bio was a probability distribution over locations.Swyx: Exactly, exactly. So I DMd you because I was obviously very interested in MPT-7B and DMd you, I was like, for the 0.2% of the time that you're in San Francisco, can you come please come to a podcast studio and you're like, I'm there next week.Jonathan: Yeah, it worked out perfectly. Swyx: We're really lucky to have you, I'll read off a few intros that people should know about you and then you can fill in the blanks.So Jonathan, you did your BS and MS at Princeton in programming languages and then found your way into ML for your PhD at MiT where you made a real splash with the lottery ticket hypothesis in 2018, which people can check up on. I think you've done a few podcasts about it over the years, which has been highly influential, and we'll talk about sparse models at Mosaic. You have also had some side [00:01:30] quest. You taught programming for lawyers and you did some law and privacy stuff in, in DC and also did some cryptography stuff. Um, and you've been an assistant professor at Harvard before earning your PhD.Jonathan:  I've yet to start.Swyx: You, you yet to start. Okay. But you just got your PhD.Jonathan:. I technically just got my PhD. I was at Mosaic which delayed my defense by about two years. It was, I was at 99% done for two years. Got the job at Harvard, Mosaic started, and I had better things to do than write my dissertation for two years. Swyx: You know, you know, this is very out of order.Jonathan: Like, oh, completely out of order, completely backwards. Go talk to my advisor about that. He's also an advisor at Mosaic and has been from the beginning. And, you know, go talk to him about finishing on time.Swyx: Great, great, great. And just to fill it out, Abhi, you did your BS and MS and MIT, you were a researcher at Cerebras, and you're now a research scientist at Mosaic. Just before we go into Mosaic stuff, I'm actually very curious about Cereus and, uh, just that, that space in general. Um, what are they doing that people should know about?Abhinav: Yeah, absolutely. Um, I think the biggest thing about CEREUS is that they're really building, you know, kind of the NextGen computing platform beyond, like GPUs.Um, they're trying to build a system that uses an entire wafer, you know, rather than cutting up a wafer into smaller chips and trying to train a model on that entire system, or actually more recently on many such wafers. Um, so it's, and it's really extraordinary. I think it's like the first time ever that kind of wafer scale computing has ever really worked. And so it's a really exciting time to be there, trying to figure out how we can map ML workloads to work, um, on a much, much bigger chip.Swyx: And do you use like [00:03:00] a different programming language or framework to do that? Or is that like..Abhinav: Yeah, so I mean, things have changed a bit since I was there.I think, um, you can actually run just normal tensor flow and pie torch on there. Um, so they've built a kind of software stack that compiles it down. So it actually just kind of works naturally. But yeah.Jonathan : Compiled versions of Python is a hot topic at the moment with Mojo as well. Swyx: And then Mosaic, you, you spearheaded the MPT-7B effort.INTRO TO MOSAIC [00:03:20]Abhinav: Uh, yeah. Yeah, so it's kind of like, it's been maybe six months, 12 months in the making. We kind of started working on LMs sort of back in the summer of last year. Um, and then we came with this blog post where we kind of profiled a lot of LMs and saw, hey, the cost of training is actually a lot lower than what people might think.Um, and then since then, you know, being inspired by kind of, you know, meta’s release, so the LLaMA models and lots of other open source work, we kind of started working towards, well, what if we were to release a really good kind of 7 billion parameter model? And that's what MPT is. Alessio:You know, we mentioned some of the podcasts you had done, Jonathan, I think in one of them you mentioned Mosaic was not planning on building a  model and releasing and obviously you eventually did. So what are some of the things that got you there that maybe obviously LLaMA you mentioned was an inspiration. You now have both the training and like inference products that you offer. Was this more of a research challenge in a way, uh, that you wanted to do?Or how did the idea come to be?Jonathan: I think there were a couple of things. So we still don't have a first class model. We're not an open AI where, you know, our businesses come to use our one great model. Our business is built around customers creating their own models. But at the end of the day, if customers are gonna create their own models, we have to have the tools to help them do that, and to have the tools to help them do that and know that they work we have to create our own models to start. We have to know that we can do something great if customers are gonna do something great. And one too many people may have challenged me on Twitter about the fact that, you know, mosaic claims all these amazing numbers, but, you know, I believe not to, you know, call out Ross Whiteman here, but, you know, I believe he said at some point, you know, show us the pudding.Um, and so Ross, you know, please let me know how the pudding tastes. But in all seriousness, like I think there is something, this is a demo in some sense. This is to say we did this in 9.5 days for a really reasonable cost, straight through 200, an intervention. 200 K. Yep. Um, you can do this too.Swyx: Uh, and just to reference the numbers that you're putting out, this is the, the last year you were making a lot of noise for trading GPT 3 under 450 K, which is your, your initial estimate.Um, and then it went down to a 100 K and stable diffusion 160 k going down to less than 50 K as well.Jonathan: So I will be careful about that 100 K number. That's certainly the challenge I've given Abhi to hit. Oh, I wouldn't make the promise that we’ve hit yet, but you know, it's certainly a target that we have.And I, you know, Abhi may kill me for saying this. I don't think it's crazy. TRAINING AND CREATING THE MODELS [00:05:45] Swyx: So we definitely want to get into like estimation math, right? Like what, what needs to happen for those big order magnitude changes to in, in infrastructure costs. But, uh, let's kind of stick to the MPT-7B story. Yeah. Tell us everything.Like you have, uh, three different models. One of them. State of the art essentially on context length. Let's talk about the process of training them, the, uh, the decisions that you made. Um, I can go into, you know, individual details, but I just wanna let you let you rip.Abhinav: Yeah, so I mean, I think, uh, we started off with the base model, which is kind of for all practical purposes, a recreation of LLaMA 7B.Um, so it's a 7 billion perimeter model trained on the trillion tokens. Um, and our goal was like, you know, we should do it efficiently. We should be able to do it like, kind of hands free so we don't have to babysit the runs as they're doing them. And it could be kind of a, a launching point for these fine tune models and those fine tune models, you know, on, on the one hand they're kind of really fun for the community, like the story writer model, which has like a 65,000 length context window and you can even kind of extrapolate beyond that. Um, but they're, they're also kind of just tr inspirations really. So you could kind of start with an MPT-7B base and then build your own custom, you know, downstream. If you want a long context code model, you could do that with our platform. If you wanted one that was for a particular language, you could do that too.But yeah, so we picked kind of the three variance chat and instruct and story writer just kind of like inspirations looking at what people were doing in the community today. Yeah. Alessio: And what's the beginning of the math to come up with? You know, how many tokens you wanna turn it on? How many parameters do you want in a bottle? 7 billion and 30 billion seem to be kind of like two of the magic numbers going around right now. Abhinav: Yeah, definitely. Definitely. Yeah, I think like there's sort of these scaling laws which kind of tell you how to best spend your training compute if that's all you cared about. So if you wanna spend $200,000 exactly in the most efficient way, there'd be a recipe for doing that.Um, and that we usually go by the Chinchilla laws. Now for these models, we actually didn't quite do that because we wanted to make sure that people could actually run these at home and that they [00:07:30] were good for inference. So we trained them kind of beyond those chinchilla points so that we're almost over-training them.I think there's like a joke going on online that they're like long boy and that that came up internally because we were training them for really, really long durations. So that 7B model, the chinchilla point might be 140 billion tokens. Instead, we trained a trillion, so almost seven times longer than you normally would.Swyx: So longboi was the code name. So is it, is it the trading method? Is it the scaling law that you're trying to coin or is it the code name for the 64 billion?Jonathan: Uh, 64. It was just an internal joke for the, for training on way more tokens than you would via chinchilla. Okay. Um, we can coin it long boy and it, it really stuck, but just to, you know, long boys filled with two ELs at the beginning.Yeah. Cause you know, we wanted the lLLaMA thing in there as well. Jonathan: Yeah, yeah, yeah. Our darn CEO we have to rein him in that guy, you know, you can't, yeah. I'm gonna take away his Twitter password at some point. Um, but you know, he had to let that one out publicly. And then I believe there was a YouTube video where someone happened to see it mentioned before the model came out and called it the Long G boy or something like that.Like, so you know, now it's out there in the world. It's out there. It's like Sydnee can't put it back inSwyx: There's a beautiful picture which I think Naveen tweeted out, which, um, shows a long boy on a whiteboard.Jonathan: That was the origin of Long Boy. In fact, the legs of the lLLaMA were the two Ls and the long boy.DATA CHOICES AND THE IMPORTANCE OF REPETITION [00:08:45]Swyx: Well, talk to me about your data choices, right? Like this is your passion project. Like what can you tell us about it?Jonathan: Yeah, I think Abhi wanted to kill me by the end for trying to use all the GPUs on data and none of them on actually training the model. Um, at the end of the day, We know that you need to train these models and [00:09:00] lots of data, but there are a bunch of things we don't know.Number one is what kinds of different data sources matter. The other is how much does repetition really matter? And really kind of repetition can be broken down into how much does quality versus quantity matter. Suppose I had the world's best 10 billion tokens of data. Would it be better to train on that a hundred times or better to train on a trillion tokens of low quality, fresh data?And obviously there's, there's a middle point in between. That's probably the sweet spot. But how do you even know what good quality data is? And. So, yeah, this is, nobody knows, and I think the more time I spent, we have a whole data team, so me and several other people, the more time that we spent on this, you know, I came away thinking, gosh, we know nothing.Gosh, if I were back in academia right now, I would definitely go and, you know, write a paper about this because I have no idea what's going on.Swyx: You would write a paper about it. I'm interested in such a paper. I haven't come across any that exists. Could you frame the central question of such a paper?THE CENTRAL QUESTION: WHAT MIX OF DATA SETS SHOULD YOU USE? [00:10:00]Jonathan: Yeah. The central question is what mix of data sets should you use? Okay. Actually I've, you know, you had mentioned my law school stuff. I went back to Georgetown Law where I used to teach, um, in the midst of creating this model, and I actually sat down with a class of law students and asked them, I gave them our exact data sets, our data mixes, um, like how many tokens we had, and I said, Create the best data set for your model.Knowing they knew nothing about large language models, they just know that data goes in and it's going to affect the behavior. Um, and I was like, create a mix and they basically covered all the different trade-offs. Um, you probably want a lot of English language [00:10:30] text to start with. You get that from the web, but do you want it to be multilingual?If so, you're gonna have a lot less English text. Maybe it'll be worse. Do you wanna have code in there? There are all these beliefs that code leads to models being better at logical reasoning, of which I've seen zero evidence. Rep. It's not, um, I mean, really made a great code model, but code models leading to better chain of thought reasoning on the part of language or code being in the training set leading to better chain of thought reasoning.People claim this all the time, but I've still never seen any real evidence beyond that. You know, one of the generations of the GPT three model started supposedly from Code Da Vinci. Yes. And so there's a belief that, you know, maybe that helped. But again, no evidence. You know, there's a belief that spending a lot of time on good sources like Wikipedia is good for the model.Again, no evidence. At the end of the day, we tried a bunch of different data mixes and the answer was that there are some that are better or worse than others. We did find that the pile, for example, was a really solid data mix, but you know, there were stronger data mixes by our evaluation metrics. And I'll get back to the evaluation question in a minute cuz that's a really important one.This data set called c4, which is what the original T five model was trained on, is weirdly good. And everybody, when I posted on this on Twitter, like Stella Beaterman from Luther mentioned this, I think someone else mentioned this as well. C4 does really well in the metrics and we have no idea why we de-duplicated it against our evaluation set.So it's not like it memorized the data, it is just one web scrape from 2019. If you actually look at the T five paper and see how it was pre-processed, it looks very silly. Mm-hmm. They removed anything that had the word JavaScript in it because they didn't want to get like no JavaScript [00:12:00] warnings. They removed anything with curly braces cuz they didn't wanna get JavaScript in it.They looked at this list of bad words, um, and removed anything that had those bad words. If you actually look at the list of bad words, words like gay are on that list. And so there's, you know, it is a very problematic, you know, list of words, but that was the cleaning that leads to a data set that seems to be unbeatable.So that to me says that we know nothing about data. We, in fact used a data set called mc four as well, which is they supposedly did the same pre-processing of C4 just on more web calls. The English portion is much worse than C4 for reasons that completely escape us. So in the midst of all that, Basically I set two criteria.One was I wanted to be at least as good as mc four English, like make sure that we're not making things actively worse. And mc four English is a nice step up over other stuff that's out there. And two was to go all in on diversity after that, making sure that we had some code, we had some scientific papers, we had Wikipedia, because people are gonna use this model for all sorts of different purposes.But I think the most important thing, and I'm guessing abhi had a million opinions on this, is you're only as good as your evaluation. And we don't know how to evaluate models for the kind of generation we ask them to do. So past a certain point, you have to kinda shrug and say, well, my evaluation's not even measuring what I care about.Mm-hmm. So let me just make reasonable choices. EVALUATION CHALLENGES OF LLMs [0:13:00]Swyx: So you're saying MMLU, big bench, that kind of stuff is not. Convincing for youJonathan: A lot of this stuff is you've got two kinds of tasks. Some of these are more of multiple choice style tasks where there is a right answer. Um, either you ask the model to spit out A, B, C, or D or you know, and if you're more [00:13:30] sophisticated, you look at the perplexity of each possible answer and pick the one that the model is most likely to generate.But we don't ask these models to do multiple choice questions. We ask them to do open-ended generation. There are also open-ended generation tasks like summarization. You compare using things like a blue score or a rouge score, which are known to be very bad ways of comparing text. At the end of the day, there are a lot of great summaries of a paper.There are a lot of great ways to do open form generation, and so humans are, to some extent, the gold standard. Humans are very expensive. It turns out we can't put them into our eval pipeline and just have the humans look at our model every, you know, 10 minutes? Not yet. Not yet. Maybe soon. Um, are you volunteering Abhi?Abhinav: I, I, I just know we have a great eval team who's, uh, who's helping us build new metrics. So if they're listening,Jonathan:  But it's, you know, evaluation of large language models is incredibly hard and I don't think any of these metrics really truly capture. What we expect from the models in practice.Swyx: Yeah. And we might draw wrong conclusions.There's been a debate recently about the emergence phenomenon, whether or not it's a mirage, right? I don't know if you guys have opinions about that process. Abhinav: Yeah, I think I've seen like this paper and all and all, even just kind of plots from different people where like, well maybe it's just a artifact of power, like log scaling or metrics or, you know, we're meshing accuracy, which is this a very like harsh zero one thing.Yeah. Rather than kind of something more continuous. But yeah, similar to what Jonathan was saying about evals. Like there there's one issue of like you just like our diversity of eval metrics, like when we put these models up, even like the chat ones, the instruct ones, people are using 'em for such a variety of tasks.There's just almost no way we get ahead of time, like measuring individual dimensions. And then also particularly like, you know, at the 7B scale, [00:15:00] um, these models still are not super great yet at the really hard tasks, like some of the hardest tasks in MMLU and stuff. So sometimes they're barely scoring like the above kind of random chance, you know, like on really, really hard tasks.So potentially as we. You know, aim for higher and higher quality models. Some of these things will be more useful to us. But we kind of had to develop MPT 7B kind of flying a little bit blind on, on what we knew it was coming out and just going off of like, you know, a small set of common sensor reasoning tasks.And of course, you know, just comparing, you know, those metrics versus other open source models. Alessio: I think fast training in inference was like one of the goals, right? So there's always the trade off between doing the hardest thing and like. Doing all the other things quickly.Abhinav: Yeah, absolutely. Yeah, I mean, I think like, you know, even at the 7B scale, you know, uh, people are trying to run these things on CPUs at home.You know, people are trying to port these to their phones, basically prioritizing the fact that the small scale would lead to our adoption. That was like a big, um, big thing going on. Alessio: Yeah. and you mentioned, um, flash attention and faster transformer as like two of the core things. Can you maybe explain some of the benefits and maybe why other models don't use it?FLASH ATTENTION [00:16:00]Abhinav: Yeah, absolutely. So flash attention is this basically faster implementation of full attention. Um, it's like a mathematical equivalent developed by like actually some of our collaborators, uh, at Stanford. Uh, the hazy research. Hazy research, yeah, exactly.Jonathan: What is, what, what, what's the name hazy research mean?Abhinav: I actually have no idea.Swyx: I have no clue. All these labs have fun names. I always like the stories behind them.Abhinav: Yeah, absolutely. We really, really liked flash attention. We, I think, had to integrate into repo even as [00:16:30] as early as September of last year. And it really just helps, you know, with training speed and also inference speed and we kind of bake that into model architecture.And this is kind of unique amongst all the other hugging face models you see out there. So ours actually, you can toggle between normal torch attention, which will work anywhere and flash attention, which will work on GPUs right out of the box. And that way I think you get almost like a 2x speed up at training time and somewhere between like 50% to a hundred percent speed up at inference time as well.So again, this is just like, we really, really wanted people to use these and like, feel like an improvement and we, we have the team to, to help deliver that. Swyx: Another part, um, of your choices was alibi position, encodings, which people are very interested in, maybe a lot of people just, uh, to sort of take in, in coatings as, as a given.But there's actually a lot of active research and honestly, it's a lot of, um, it's very opaque as well. Like people don't know how to evaluate encodings, including position encodings, but may, may, could you explain, um, alibi and, um, your choice?Abhinav: Yeah, for sure. The alibi and uh, kind of flash attention thing all kind of goes together in interesting ways.And even with training stability too. What alibi does really is that it eliminates the need to have positional embeddings in your model. Where previously, if you're a token position one, you have a particular embedding that you add, and you can't really go beyond your max position, which usually is like about 2000.With alibies, they get rid of that. Instead, just add a bias to the attention map itself. That's kind of like this slope. And if at inference time you wanna go much, much larger, they just kind of stretch that slope out to a longer, longer number of positions. And because the slope is kind of continuous and you can interpret it, it all works out now.Now one of [00:18:00] the, the funny things we found is like with flash attention, it saved so much memory and like improved performance so much that even as early as I kind of last year, like we were profiling models with, with very long context lines up to like, you know, the 65 k that you seen in release, we just never really got around to using it cuz we didn't really know what we might use it for.And also it's very hard to train stably. So we started experimenting with alibi integration, then we suddenly found that, oh wow, stability improves dramatically and now we can actually work together with alibi in a long context lens. That's how we got to like our story writer model where we can stably train these models out to very, very long context lenses and, and use them performantly.Jonathan: Yeah.Swyx: And it's also why you don't have a firm number. Most people now have a firm number on the context line. Now you're just like, eh, 65 to 85Abhinav: Oh yeah, there's, there's a, there's a big age to be 64 K or 65 k. 65 k plus.Swyx: Just do powers of twos. So 64 isn't, you know. Jonathan: Right, right. Yeah. Yeah. But we could, I mean, technically the context length is infinite.If you give me enough memory, um, you know, we can just keep going forever. We had a debate over what number to say is the longest that we could handle. We picked 84 cakes. It's the longest I expect people to see easily in practice. But, you know, we played around for even longer than that and I don't see why we couldn't go longer.Swyx: Yeah. Um, and so for those who haven't read the blog posts, you put the Great Gatsby in there and, uh, asked it to write an epilogue, which seemed pretty impressive.Jonathan: Yeah. There are a bunch of epilogues floating around internally at Mosaic. Yeah. That wasn't my favorite. I think we all have our own favorites.Yeah. But there are a bunch of really, really good ones. There was one where, you know, it's Gatsby's funeral and then Nick starts talking to Gatsby's Ghost, and Gatsby's father shows up and, you know, then he's [00:19:30] at the police station with Tom. It was very plot heavy, like this is what comes next. And a bunch of that were just very Fitzgerald-esque, like, you know, beautiful writing.Um, but it was cool to just see that Wow, the model seemed to actually be working with. You know, all this input. Yeah, yeah. Like it's, it's exciting. You can think of a lot of things you could do with that kind of context length.FINE-TUNING FOR CREATIVITY [00:19:50]Swyx: Is there a trick to fine tuning for a creative task rather than, um, factual task?Jonathan: I don't know what that is, but probably, yeah, I think, you know, the person, um, Alex who did this, he did fine tune the model explicitly on books. The goal was to try to get a model that was really a story writer. But, you know, beyond that, I'm not entirely sure. Actually, it's a great question. Well, no, I'll ask you back.How would you measure that? Swyx: Uh, God, human feedback is the solve to all things. Um, I think there is a labeling question, right? Uh, in computer vision, we had a really, really good episode with Robo Flow on the segment. Anything model where you, you actually start human feedback on like very, I think it's something like 0.5% of the, the overall, uh, final, uh, uh, labels that you had.But then you sort augment them and then you, you fully automate them, um, which I think could be applied to text. It seems intuitive and probably people like snorkel have already raised ahead on this stuff, but I just haven't seen this applied in the language domain yet.Jonathan: It, I mean there are a lot of things that seem like they make a lot of sense in machine learning that never work and a lot of things that make zero sense that seem to work.So, you know, I've given up trying to even predict. Yeah, yeah. Until I see the data or try it, I just kind shg my shoulders and you know, you hope for the best. Bring data or else, right? Yeah, [00:21:00] exactly. Yeah, yeah, yeah.Alessio: The fine tuning of books. Books three is like one of the big data sets and there was the whole.Twitter thing about trade comments and like, you know, you know, I used to be a community [email protected] and we've run into a lot of things is, well, if you're explaining lyrics, do you have the right to redistribute the lyrics? I know you ended up changing the license on the model from a commercial use Permitted.Swyx: Yeah let's let them. I'm not sure they did. Jonathan: So we flipped it for about a couple hours. Swyx: Um, okay. Can we, can we introduce the story from the start Just for people who are under the loop. Jonathan: Yeah. So I can tell the story very simply. So, you know, the book three data set does contain a lot of books. And it is, you know, as I discovered, um, it is a data set that provokes very strong feelings from a lot of folks.Um, that was one, one guy from one person in particular, in fact. Um, and that's about it. But it turns out one person who wants a lot of attention can, you know, get enough attention that we're talking about it now. And so we had a, we had a discussion internally after that conversation and we talked about flipping the license and, you know, very late at night I thought, you know, maybe it's a good thing to do.And decided, you know, actually probably better to just, you know, Stan Pat's license is still Apache too. And one of the conversations we had was kind of, we hadn't thought about this cuz we had our heads down, but the Hollywood writer Strike took place basically the moment we released the model. Mm-hmm.Um, we were releasing a model that could do AI generated creative content. And that is one of the big sticking points during the strike. Oh, the optics are not good. So the optics aren't good and that's not what we want to convey. This is really, this is a demo of the ability to do really long sequence lengths and.Boy, you know, [00:22:30] that's, that's not timing that we appreciated. And so we talked a lot internally that night about like, oh, we've had time to read the news. We've had time to take a breath. We don't really love this. Came to the conclusion that it's better to just leave it as it is now and learn the lesson for the future.But certainly that was one of my takeaways is this stuff, you know, there's a societal context around this that it's easy to forget when you're in the trenches just trying to get the model to train. And you know, in hindsight, you know, I might've gone with a different thing than a story writer. I might've gone with, you know, coder because we seem to have no problem putting programmers out of work with these models.Swyx: Oh yeah. Please, please, you know, take away this stuff from me.OPEN SOURCE LICENSES AND ETHICAL CONSIDERATIONS [00:23:00]Jonathan: Right. You know, so it's, I think, you know, really. The copyright concerns I leave to the lawyers. Um, that's really, if I learned one thing teaching at a law school, it was that I'm not a lawyer and all this stuff is a little complicated, especially open source licenses were not designed for this kind of world.They were designed for a world of forcing people to be more open, not forcing people to be more closed. And I think, you know, that was part of the impetus here, was to try to use licenses to make things more closed. Um, which is, I think, against the grain of the open source ethos. So that struck me as a little bit strange, but I think the most important part is, you know, we wanna be thoughtful and we wanna do the right thing.And in that case, you know, I hope with all that interesting licensing fund you saw, we're trying to be really thoughtful about this and it's hard. I learned a lot from that experience. Swyx: There’s also, I think, an open question of fair use, right? Is training on words of fair use because you don't have a monopoly on words, but some certain arrangements of words you do.And who is to say how much is memorization by a model versus actually learning and internalizing and then. Sometimes happening to land at the right, the [00:24:00] same result.Jonathan: And if I've learned one lesson, I'm not gonna be the person to answer that question. Right, exactly. And so my position is, you know, we will try to make this stuff open and available.Yeah. And, you know, let the community make decisions about what they are or aren't comfortable using. Um, and at the end of the day, you know, it still strikes me as a little bit weird that someone is trying to use these open source licenses to, you know, to close the ecosystem and not to make things more open.That's very much against the ethos of why these licenses were created.Swyx: So the official mosaic position, I guess is like, before you use TC MPC 7B for anything commercial, check your own lawyers now trust our lawyers, not mosaic’s lawyers.Jonathan: Yeah, okay. Yeah. I'm, you know, our lawyers are not your lawyers.Exactly. And, you know, make the best decision for yourself. We've tried to be respectful of the content creators and, you know, at the end of the day, This is complicated. And this is something that is a new law. It's a new law. It's a new law that hasn't been established yet. Um, but it's a place where we're gonna continue to try to do the right thing.Um, and it's, I think, one of the commenters, you know, I really appreciated this said, you know, well, they're trying to do the right thing, but nobody knows what the right thing is to even do, you know, the, I guess the, the most right thing would've been to literally not release a model at all. But I don't think that would've been the best thing for the community either.Swyx: Cool.Well, thanks. Well handled. Uh, we had to cover it, just causeJonathan:  Oh, yes, no worries. A big piece of news. It's been on my mind a lot.TRAINING STABILITY ENHANCEMENT [00:25:15]Swyx: Yeah. Yeah. Well, you've been very thoughtful about it. Okay. So a lot of these other ideas in terms of architecture, flash, attention, alibi, and the other data sets were contributions from the rest of the let's just call it open community of, of machine learning advancements. Uh, but Mosaic in [00:25:30] particular had some stability improvements to mitigate loss spikes, quote unquote, uh, which, uh, I, I took to mean, uh, your existing set of tools, uh, maybe we just co kind of covered that. I don't wanna sort of put words in your mouth, but when you say things like, uh, please enjoy my empty logbook.How much of an oversell is that? How much, you know, how much is that marketing versus how much is that reality?Abhinav: Oh yeah. That, that one's real. Yeah. It's like fully end-to-end. Um, and I think.Swyx: So maybe like what, what specific features of Mosaic malibu?Abhinav: Totally, totally. Yeah. I think I'll break it into two parts.One is like training stability, right? Knowing that your model's gonna basically get to the end of the training without loss spikes. Um, and I think, you know, at the 7B scale, you know, for some models like it ha it's not that big of a deal. As you train for longer and longer durations, we found that it's trickier and trickier to avoid these lost spikes.And so we actually spent a long time figuring out, you know, what can we do about our initialization, about our optimizers, about the architecture that basically prevents these lost spikes. And you know, even in our training run, if you zoom in, you'll see small intermittent spikes, but they recover within a few hundred steps.And so that's kind of the magical bit. Our line is one of defenses we recover from Las Vegas, like just naturally, right? Mm-hmm. Our line two defense was that we used determinism and basically really smart resumption strategies so that if something catastrophic happened, we can resume very quickly, like a few batches before.And apply some of these like, uh, interventions. So we had these kinds of preparations, like a plan B, but we didn't have to use them at all for MPT 7B training. So, that was kind of like a lucky break. And the third part of like basically getting all the way to the empty law book is having the right training infrastructure.[00:27:00]So this is basically what, like is, one of the big selling points of the platform is that when you try to train these models on hundreds of GPUs, not many people outside, you know, like deep industry research owners, but the GPUs fail like a lot. Um, I would say like almost once every thousand a 100 days.So for us on like a big 512 cluster every two days, basically the run will fail. Um, and this is either due to GPUs, like falling off the bus, like that's, that's a real error we see, or kind of networking failures or something like that. And so in those situations, what people have normally done is they'll have an on-call team that's just sitting round the clock, 24-7 on slack, once something goes wrong.And if then they'll basically like to try to inspect the cluster, take nodes out that are broken, restart it, and it's a huge pain. Like we ourselves did this for a few months. And as a result of that, because we're building such a platform, we basically step by step automated every single one of those processes.So now when a run fails, we have this automatic kind of watch talk that's watching. It'll basically stop the job. Test the nodes cord in anyone's that are broken and relaunch it. And because our software's all deterministic has fast resumption stuff, it just continues on gracefully. So within that log you can see sometimes I think maybe at like 2:00 AM or something, the run failed and within a few minutes it's back up and running and all of us are just sleeping peacefully.Jonathan: I do wanna say that was hard one. Mm-hmm. Um, certainly this is not how things were going, you know, many months ago, hardware failures we had on calls who were, you know, getting up at two in the morning to, you know, figure out which node had died for what reason, restart the job, have to cord the node. [00:28:30] Um, we were seeing catastrophic loss spikes really frequently, even at the 7B scale that we're just completely derailing runs.And so this was step by step just ratcheting our way there. As Abhi said, to the point where, Many models are training at the moment and I'm sitting here in the studio and not worrying one bit about whether the runs are gonna continue. Yeah. Swyx: I'm, I'm not so much of a data center hardware kind of guy, but isn't there existing software to do this for CPUs and like, what's different about this domain? Does this question make sense at all?Jonathan: Yeah, so when I think about, like, I think back to all the Google fault tolerance papers I read, you know, as an undergrad or grad student mm-hmm. About, you know, building distributed systems. A lot of it is that, you know, Each CPU is doing, say, an individual unit of work.You've got a database that's distributed across your cluster. You wanna make sure that one CPU failing can't, or one machine failing can't, you know, delete data. So you, you replicate it. You know, you have protocols like Paxos where you're literally, you've got state machines that are replicated with, you know, with leaders and backups and things like that.And in this case, you were performing one giant computation where you cannot afford to lose any node. If you lose a node, you lose model state. If you lose a node, you can't continue. It may be that, that in the future we actually, you know, create new versions of a lot of our distributed training libraries that do have backups and where data is replicated so that if you lose a node, you can detect what node you've lost and just continue training without having to stop the run, you know?Pull from a checkpoint. Yeah. Restart again on different hardware. But for now, we're certainly in a world where if anything dies, that's the end of the run and you have to go back and recover from it. [00:30:00]DATA READINESS & TRAINING PREPARATION [00:30:00]Abhinav: Yeah. Like I think a big part, a big word there is like synchronous data pluralism, right? So like, we're basically saying that on every step, every GP is gonna do some work.They're gonna stay in sync with each other and average their, their gradients and continue. Now that there are algorithmic techniques to get around this, like you could say, oh, if a GP dies, just forget about it. All the data that's gonna see, we'll just forget about it. We're not gonna train on it.But, we don't like to do that currently because, um, it makes us give up determinism, stuff like that. Maybe in the future, as you go to extreme scales, we'll start looking at some of those methods. But at the current time it's like, we want determinism. We wanted to have a run that we could perfectly replicate if we needed to.And it was, the goal is figure out how to run it on a big cluster without humans having to babysit it. Babysit it. Alessio: So as you mentioned, these models are kind of the starting point for a lot of your customers To start, you have a. Inference product. You have a training product. You previously had a composer product that is now kind of not rolled into, but you have like a super set of it, which is like the LLM foundry.How are you seeing that change, you know, like from the usual LOP stack and like how people train things before versus now they're starting from, you know, one of these MPT models and coming from there. Like worship teams think about as they come to you and start their journey.Jonathan: So I think there's a key distinction to make here, which is, you know, when you say starting from MPT models, you can mean two things.One is actually starting from one of our checkpoints, which I think very few of our customers are actually going to do, and one is starting from our configuration. You can look at our friends at Rep for that, where, you know, MPT was in progress when Refl [00:31:30] came to us and said, Hey, we need a 3 billion parameter model by next week on all of our data.We're like, well, here you go. This is what we're doing, and if it's good enough for us, um, hopefully it's good enough for you. And that's basically the message we wanna send to our customers. MPT is basically clearing a path all the way through where they know that they can come bring their data, they can use our training infrastructure, they can use all of our amazing orchestration and other tools that abhi just mentioned, for fault tolerance.They can use Composer, which is, you know, still at the heart of our stack. And then the l l M Foundry is really the specific model configuration. They can come in and they know that thing is gonna train well because we've already done it multiple times. Swyx: Let's dig in a little bit more on what should people have ready before they come talk to you? So data architecture, eval that they're looking, etc.Abhinav: Yeah, I, I mean, I think we'll accept customers at any kind of stage in their pipeline. You know, like I'd say science, there's archetypes of people who have built products around like some of these API companies and reach a stage or maturity level where it's like we want our own custom models now, either for the purpose of reducing cost, right?Like our inference services. Quite a bit cheaper than using APIs or because they want some kind of customization that you can't really get from the other API providers. I'd say the most important things to have before training a big model. You know, you wanna have good eval metrics, you know, some kind of score that you can track as you're training your models and scaling up, they can tell you you're progressing.And it's really funny, like a lot of times customers will be really excited about training the models, right? It's really fun to like launch shelves on hundreds of gfs, just all around. It's super fun. But then they'll be like, but wait, what are we gonna measure? Not just the training loss, right? I mean, it's gotta be more than that.[00:33:00]So eval metrics is like a, it's a good pre-req also, you know, your data, you know, either coming with your own pre-training or fine-tune data and having like a strategy to clean it or we can help clean it too. I think we're, we're building a lot of tooling around that. And I think once you have those two kinds of inputs and sort of the budget that you want, we can pretty much walk you through the rest of it, right?Like that's kind of what we do. Recently we helped build CR FM's model for biomedical language a while back. Jonathan: Um, we can. That's the center of research for foundation models. Abhi: Exactly, exactly.Jonathan: Spelling it out for people. Of course.Abhinav: No, absolutely. Yeah, yeah. No, you've done more of these than I have.Um, I think, uh, basically it's sort of, we can help you figure out what model I should train to scale up so that when I go for my big run company, your here run, it's, uh, it's predictable. You can feel confident that it's gonna work, and you'll kind of know what quality you're gonna get out before you have to spend like a few hundred thousand dollars.DYNAMIC REAL-TIME MODEL EVALUATION [00:34:00]Alessio: The rap Reza from rap was on the podcast last week and, uh, they had human eval and then that, uh, I'm Jon Eval, which is like vibe based. Jonathan: And I, I do think the vibe based eval cannot be, you know, underrated really at the, I mean, at the end of the day we, we did stop our models and do vibe checks and we did, as we monitor our models, one of our evals was we just had a bunch of prompts and we would watch the answers as the model trained and see if they changed cuz honestly, You know, I don't really believe in any of these eval metrics to capture what we care about.Mm-hmm. But when you ask it, uh, you know, I don't know. I think one of our prompts was to suggest games for a three-year-old and a seven-year-old. That would be fun to play. Like that was a lot more [00:34:30] valuable to me personally, to see how that answer evolved and changed over the course of training. So, you know, and human eval, just to clarify for folks, human human eval is an automated evaluation metric.There's no humans in it at all. There's no humans in it at all. It's really badly named. I got so confused the first time that someone brought that to me and I was like, no, we're not bringing humans in. It's like, no, it's, it's automated. They just called it a bad name and there's only a hundred cents on it or something.Abhinav: Yeah. Yeah. And, and it's for code specifically, right?Jonathan: Yeah. Yeah. It's very weird. It's a, it's a weird, confusing name that I hate, but you know, when other metrics are called hella swag, like, you know, you do it, just gotta roll with it at this point. Swyx: You're doing live evals now. So one, one of the tweets that I saw from you was that it is, uh, important that you do it paralyzed.Uh, maybe you kind of wanna explain, uh, what, what you guys did.Abhinav: Yeah, for sure. So with LLM Foundry, there's many pieces to it. There's obviously the core training piece, but there's also, you know, tools for evaluation of models. And we've kind of had one of the, I think it's like the, the fastest like evaluation framework.Um, basically it's multi GPU compatible. It runs with Composer, it can support really, really big models. So basically our framework runs so fast that even Azure models are training. We can run these metrics live during the training. So like if you have a dashboard like weights and biases, you kind of watch all these evil metrics.We have, like, 15 or 20 of them honestly, that we track during the run and add negligible overhead. So we can actually watch as our models go and feel confident. Like, it's not like we wait until the very last day to, to test if the models good or notJonathan: That's amazing. Yeah. I love that we've gotten this far into the conversation.We still haven't talked about efficiency and speed. Those are usually our two watch words at Mosaic, which is, you know, that's great. That says that we're [00:36:00] doing a lot of other cool stuff, but at the end of the day, um, you know, Cost comes first. If you can't afford it, it doesn't matter. And so, you know, getting things down cheap enough that, you know, we can monitor in real time, getting things down cheap enough that we can even do it in the first place.That's the basis for everything we do.OPEN SCIENCE FOR AFFORDABLE AI RESEARCH [00:36:00]Alessio: Do you think a lot of the questions that we have around, you know, what data sets we should use and things like that are just because training was so expensive before that, we just haven't run enough experiments to figure that out. And is that one of your goals is trying to make it cheaper so that we can actually get the answers?Jonathan: Yeah, that's a big part of my personal conviction for being here. I think I'm, I'm still in my heart, the second year grad student who was jealous of all his friends who had GPUs and he didn't, and I couldn't train any models except in my laptop. And that, I mean, the lottery ticket experiments began on my laptop that I had to beg for one K 80 so that I could run amist.And I'm still that person deep down in my heart. And I'm a believer that, you know, if we wanna do science and really understand these systems and understand how to make them work well, understand how they behave, understand what makes them safe and reliable. We need to make it cheap enough that we can actually do science, and science involves running dozens of experiments.When I finally, you know, cleaned out my g c s bucket from my PhD, I deleted a million model checkpoints. I'm not kidding. There were over a million model checkpoints. That is the kind of science we need, you know, that's just what it takes. In the same way that if you're in a biology lab, you don't just grow one cell and say like, eh, the drug seems to work on that cell.Like, there's a lot more science you have to do before you really know.Abhinav: Yeah. And I think one of the special things about Mosaic's kind of [00:37:30] position as well is that we have such, so many customers all trying to train models that basically we have the incentive to like to devote all these resources and time to do this science.Because when we learn which pieces actually work, which ones don't, we get to help many, many people, right? And so that kind of aggregation process I think is really important for us. I remember way back there was a paper about Google that basically would investigate batch sizes or something like that.And it was this paper that must have cost a few million dollars during all the experience. And it was just like, wow, what a, what a benefit to the whole community. Now, like now we all get to learn from that and we get, we get to save. We don't have to spend those millions of dollars anymore. So I think, um, kind of mosaical science, like the insights we get on, on data, on pre-screening architecture, on all these different things, um, that's why customers come to us.Swyx: Yeah, you guys did some really good stuff on PubMed, G B T as well. That's the first time I heard of you. Of you. And that's also published to the community.Abhinav: Yeah, that one was really fun. We were like, well, no one's really trained, like fully from scratch domain specific models before. Like, what if we just did a biomed one?Would it still work? And, uh, yeah, I'd be really excited. That did, um, we'll probably have some follow up soon, I think, later this summer.Jonathan: Yeah. Yes. Stay tuned on that. Um, but I, I will say just in general, it's a really important value for us to be open in some sense. We have no incentive not to be open. You know, we make our money off of helping people train better.There's no cost to us in sharing what we learn with the community. Cuz really at the end of the day, we make our money off of those custom models and great infrastructure and, and putting all the pieces together. That's honestly where the Mosaic name came from. Not off of like, oh, we've got, you know, this one cool secret trick [00:39:00] that we won't tell you, or, you know, closing up.I sometimes, you know, in the past couple weeks I've talked to my friends at places like Brain or, you know, what used to be Brain Now Google DeepMind. Oh, I R I P Brain. Yeah. R i p Brian. I spent a lot of time there and it was really a formative time for me. Um, so I miss it, but. You know, I kind of feel like we're one of the biggest open research labs left in industry, which is a very sad state of affairs because we're not very big.Um, but at least can you say how big the team is actually? Yeah. We were about 15 researchers, so we're, we're tiny compared to, you know, the huge army of researchers I remember at Brain or at fair, at Deep Mind back, you know, when I was there during their heydays. Um, you know, but everybody else is kind of, you know, closed up and isn't saying very much anymore.Yeah. And we're gonna keep talking and we're gonna keep sharing and, you know, we will try to be that vanguard to the best of our ability. We're very small and I, I can't promise we're gonna do what those labs used to do in terms of scale or quantity of research, but we will share what we learn and we will try to create resources for the community.Um, I, I dunno, I just, I believe in openness fundamentally. I'm an academic at heart and it's sad to me to watch that go away from a lot of the big labs. THE OPEN APPROACH [00:40:15]Alessio: We just had a live pod about the, you know, open AI snow mode, uh, post that came out and it was one of the first time I really dove into Laura and some of the this new technologies, like how are you thinking about what it's gonna take for like the open approach to really work?Obviously today, GPT four is still, you know, part of like that state-of-the-art model for a [00:40:30] lot of tasks. Do you think some of the innovation and kind of returning methods that we have today are enough if enough people like you guys are like running these, these research groups that are open? Or do you think we still need a step function improvement there?Jonathan: I think one important point here is the idea of coexistence. I think when you look at, I don't know who won Linux or Windows, the answer is yes. Microsoft bought GitHub and has a Windows subsystem for Linux. Linux runs a huge number of our servers and Microsoft is still a wildly profitable company.Probably the most successful tech company right now. So who won open source or closed source? Yes. Um, and I think that's a similar world that we're gonna be in here where, you know, it's gonna be different things for different purposes. I would not run Linux on my laptop personally cuz I like connecting to wifi and printing things.But I wouldn't run Windows on one of my surfers. And so I do think what we're seeing with a lot of our customers is, do they choose opening IR mosaic? Yes. There's a purpose for each of these. You have to send your data off to somebody else with open eyes models. That's a risk. GPT four is amazing and I would never promise someone that if they come to Mosaic, they're gonna get a GPT four quality model.That's way beyond our means and not what we're trying to do anyway. But there's also a whole world for, you know, domain specific models, context specific models that are really specialized, proprietary, trained on your own data that can do things that you could never do with one of these big models. You can customize in crazy ways like G B T four is not gonna hit 65 K context length for a very long time, cuz they've already trained that [00:42:00] model and you know, they haven't even released the 32 K version yet.So we can, you know, we can do things differently, you know, by being flexible. So I think the answer to all this is yes. But we can't see the open source ecosystem disappear. And that's the scariest thing for me. I hear a lot of talk in academia about, you know, whatever happened to that academic research on this field called information retrieval?Well, in 1999 it disappeared. Why? Because Google came along and who cares about information retrieval research when you know you have a Google Scale, you know, Web Scale database. So you know, there's a balance here. We need to have both. Swyx: I wanna applaud you, Elaine. We'll maybe edit it a little like crowd applause, uh, line.Cuz I, I think that, um, that is something that as a research community, as people interested in progress, we need to see these things instead of just, uh, seeing marketing papers from the advertising GPT 4.Jonathan: Yeah. I, I think I, you know, to get on my soapbox for 10 more seconds. Go ahead. When I talk to policymakers about, you know, the AI ecosystem, the usual fear that I bring up is, Innovation will slow because of lack of openness.I've been complaining about this for years and it's finally happened. Hmm. Why is Google sharing, you know, these papers? Why is Open AI sharing these papers? There are a lot of reasons. You know, I have my own beliefs, but it's not something we should take for granted that everybody's sharing the work that they do and it turns out well, I think we took it for granted for a while and now it's gone.I think it's gonna slow down the pace of progress. In a lot of cases, each of these labs has a bit of a monoculture and being able to pass ideas [00:43:30] back and forth was a lot of what kept, you know, scientific progress moving. So it's imperative not just, you know, for the open source community and for academia, but for the progress of technology.That we have a vibrant open source research community.THE FUTURE OF MOSAIC [00:44:11]Swyx: There’s a preview of the ecosystem and commentary that we're, we're gonna do. But I wanna close out some stuff on Mosaic. You launched a bunch of stuff this month. A lot of stuff, uh, actually was, I was listening to you on Gradient descent, uh, and other podcasts we know and love.Uh, and you said you also said you were not gonna do inference and, and, and last week you were like, here's Mosaic ML inference. Oops. So maybe just a, at a high level, what was Mosaic ml and like, what is it growing into? Like how do you conceptualize this? Jonathan: Yeah, and I will say gradient, when graded dissent was recorded, we weren't doing inference and had no plans to do it.It took a little while for the podcast to get out. Um, in the meantime, basically, you know, one thing I've learned at a startup, and I'm sure abhi can comment on this as well, focus is the most important thing. We have done our best work when we've been focused on doing one thing really well and our worst work when we've tried to do lots of things.Yeah. So, We don't want to do inference, we don't want to have had to do inference. Um, and at the end of the day, our customers were begging us to do it because they wanted a good way to serve the models and they liked our ecosystem. And so in some sense, we got dragged into it kicking and screaming. We're very excited to have a product.We're going to put our best foot forward and make something really truly amazing. But there is, you know, that's something that we were reluctant to do. You know, our customers convinced us it would be good for our business. It's been wonderful for business and we are gonna put everything into this, but you know, back when grading dissent came out, I [00:45:00] was thinking like, or when we recorded it or focused, oh God, like focus is the most important thing.I've learned that the hard way multiple times that Mosaic, abhi can tell you like, you know, I've made a lot of mistakes on not focusing enough. Um, boy inference, that's a whole second thing, and a whole different animal from training. And at the end of the day, when we founded the company, our belief was that inference was relatively well served at that time.There were a lot of great inference companies out there. Um, training was not well served, especially efficient training. And we had something to add there. I think we've discovered that as the nature of the models have changed, the nature of what we had to add to inference changed a lot and there became an opportunity for us to contribute something.But that was not the plan. But now we do wanna be the place that people come when they wanna train these big, complex, difficult models and know that it's gonna go right the first time and they're gonna have something they can servee right away. Um, you know, really the rep example of, you know, with 10 days to go saying, Hey, can you please train that model?And, you know, three or four days later the model was trained and we were just having fun doing interesting, fine tuning work in it for the rest of the 10 days, you know. That also requires good inference. Swyx: That’s true, that's true. Like, so running evals and, and fine tuning. I'm just putting my business hat on and you know, and Alessio as well, like, uh, I've actually had fights with potential co-founders about this on the primary business.Almost like being training, right? Like essentially a one-time cost.Jonathan: Who told you it was a one time cost? What, who, who told you that?Swyx: No, no, no, no. Correct me. Jonathan: Yeah. Yeah. Let me correct you in two ways. Um, as our CEO Navine would say, if he were here, when you create version 1.0 of your software, do you then fire all the engineers?Of [00:46:30] course not. You never, like, MPT has a thousand different things we wanted to do that we never got to. So, you know, there will be future models.Abhinav: And, and the data that's been trained on is also changing over time too, right? If you wanna ask anything about, I guess like May of 2023, we'll have to retrain it further and so on.Right? And I think this is especially true for customers who run like the kind of things that need to be up to date on world knowledge. So I, I think like, you know, the other thing I would say too is that, The malls we have today are certainly not the best malls we'll ever produce. Right. They're gonna get smaller, they're gonna get faster, they're gonna get cheaper, they're gonna get lower latency, they're gonna get higher quality.Right? And so you always want the next gen version of MPT and the one after that and one after that. There's a reason that even the GPT series goes three, four, and we know there's gonna be a five. Right? Um, so I I I also don't see as a, as a one-time cost.Jonathan: Yeah. Yeah. And I, if you wanna cite a stat on this, there are very, very few stats floating around on training versus inference cost.Mm-hmm. One is this blog post from I think David Patterson at Google, um, on the energy usage of ML at Google. And they break down and say three fifths of energy over the previous three years. I think this 2022 article was for inference, and two fifths were for training. And so actually that, you know, this is Google, which is serving models to billions of users.They're probably the most inference heavy place in the world. It's only a two fifth, three fifth breakdown, and that's energy training. Hardware is probably more expensive because it has fancier networking. That could be a 50 50 cost breakdown. And that's Google for a lot of other folks. It's gonna be weighed even more heavily, in favor of training.SPEED AND EFFICIENCY [00:48:01]Swyx: Amazing answer. Well, thanks. Uh, we can, we can touch on a little bit [00:48:00] on, uh, efficiency and speed because we, we, uh, didn't mention about that. So right now people spend between three to 10 days. You, you spend 10 days on, on mpc, seven rep spend three days. What's feasible? What's what Do you wanna get it down to?Abhinav: Oh, for, for these original models? Yeah. Yeah. So I think, um, this is probably one of the most exciting years, I think for training efficiency, just generally speaking, because we have the, the combination of a couple things, like one is like this next generation of hardware, like the H 100 s coming out from Nvidia, which on their own should be like, at least like a two x improvement or they 100 s on top of that, there's also a new floating point format f P eight, um, which could also deliver that alone.Does it? Yes. Yeah. Yeah. How, what, why? Oh, the f p thing? Yeah. Yeah. So basically what's happening is that, you know, when we do all of our math, like in the models matrix, multiplication, math, we do it in a particular precision. We started off in 32 bit precision a few years ago, and then in video came with 16 bit, and over the course of several years, we've all figured out how to do 16 bit training and that basically, you know, due to the harder requirements like.Increase the throughput by two x, reduce the cost by two x. That's about to happen again with FBA eight, like starting this year. And with Mosaic, you know, we've already started profiling L L M training with f p eight on H 100 s. We're seeing really, really good improvements there. And so you're gonna see a huge cost reduction this year just from this hardware fact alone.On top of that, you know, there's a lot of architectural applications. We're looking at ways to introduce some forms of sparsity, not necessarily like the, the, the super unstructured sparsity like lottery ticket. Um, which not that I'm sure I'm really happy to talk about. Um, but, but, um, are there ways of doing, like you [00:49:30] gating or like, kind of like m moe style architectures?So, you know, I think originally, you know, what was like 500 k. I think to try and train a Jeep, the equality model, if at the end of the year we could get that down to a hundred k, that would be fantastic.Swyx: That is this year's type of thing. Jonathan: Not, not, like, that's not a pie in the sky thing. Okay. It is not, it's not a place we are now, but I think it is a, you know, I don't think more than a year in the future these days, cuz it's impossible.I think that is very much a 2023 thing. Yeah. Yeah. Okay. And hold me to that later this year.Swyx: G PT three for a hundred K, let's go. Um, and then also stable diffusion originally reported to be 600 K. Uh, you guys can get it done for under 50. Anything different about image models that we should image, to text?Jonathan: Um, I mean I think the, the most important part in all this is, you know, it took us a while to get 50 down by almost seven x. That was our original kind of proof of concept project for Mosaic. You know, just at the beginning to show like, you know, we can even do this and our investors should give us more money.But what I love about newer models that come out is they're always really slow. We haven't figured out how to optimize them yet. And so there's so much work to be done. So getting, you know, in that case, I guess from the cost you mentioned like a 12 x cost reduction in stable diffusion. Mm-hmm. Honestly it was a lot easier than getting a seven X for RESNET 50 an image net or a three X for Burt, cuz the architecture was much newer and there were a lot of inefficiencies to improve.Um, you know, I'm guessing that's gonna continue to be the case as we lean toward the bleeding edge and try to, you know, push the bleeding edge. I hope that, you know, in some sense you'll see smaller speed ups from us because the new models will come from us and they'll already be fast.Alessio: So that's making existing [00:51:00] things better with the, the long boy, the 60 5K context window, uh, you've doubled instead of the r.There was the R M T a couple weeks ago that had a possible 1 million. Uh, that's the unlimited former thing that came out last week, which is theoretically limitless context. What should people think about trade offs? Implications? You mentioned memories kind of start to become one of the bounds.Yeah. What's the right number? Like is it based on the customer's needs? Like how would you advise customers and startups who might be building their own models?Jonathan: It's all contextual. You know, there's a lot of buzz coming for long contexts lately with a lot of these papers. None of them are exact. In terms of the way that they're doing attention.And so there's, you know, to some extent there's an approximation or a trade off between doing some kind of inexact or approximate or hierarchical or, you know, non quadratic attention versus doing it explicitly correctly the quadratic way. I'm a big fan of approximation, so I'm eager to dig into these papers.If I've learned one thing from writing and reading papers, it's to believe nothing until I've implemented it myself. And we've certainly been let down many, many, many times at Mosaic by papers that look very promising until we implement them and realize, you know, here's how they cook the books on their data.Here's, you know, the one big caveat that didn't show up in the paper. So I look at a lot of this with skepticism until, you know, I believe nothing until I re-implement it. And in general, I'm rewarded for doing that because, you know, a lot of this stuff doesn't end up working quite as well in practice.This is promised in a paper, the [00:52:30] incentives just aren't there, which is part of the reason we went with just pure quadratic attention here. Like it's known to work. We didn't have to make an approximation. There's no asterisk or caveat. This was in some sense a sheer force of will by our amazing engineers.Alessio: So people want super long context because, you know, they wanna feed more documents and right now people do it with embeddings and feed them into the context window. How do you kind of see that changing? Are we gonna get to a point where like, you know, maybe it's 60 4k, maybe it's 120 k, where it's like, okay.You know, semantic search and embeddings are gonna work better than just running a million parameters, like a million token context window.Jonathan: Do, do you wanna say the famous thing about 64 K? Does somebody wanna say that, that statement, the, you know, the 64 K is all you'll ever need? The Bill Gates statement about Rams.Swyx: Andre Kaparthi actually made that comparison before that, uh, context is essentially Ram,Jonathan: if I get quoted here saying 60 4K is all you need, I will be wrong. We have no idea. People are gonna get ambitious. Yes. Um, GPT four has probably taken an image and turning it into a bunch of tokens and plugging it in.I'm guessing each image is worth a hell of a lot of tokens. Um, maybe that's not a thousand words. Not a thousand words, but, you know, probably a thousand words worth of tokens, if not even more so. Maybe that's the reason they did 32 k. Maybe, you know, who knows? Maybe we'll wanna put videos in these models.Like every time that we say, ah, that isn't that model big enough, somebody just gets more ambitious. Who knows? TRENDS AND TRANSFORMERS [00:54:00]Swyx: Right? Um, you've famously made one. [00:54:00] Countertrend, uh, bet, which is, uh, you, you're actually betting that, uh, transformers will stick around for a long time. Jonathan: How is that counter trend? Swyx: Counter trend is in, you just said, a lot of things won't last.Right. A lot of things will get replaced, uh, really easily, butJonathan: transformers will stick around. I mean, look at the history here. How long did the Convolutional neural network stick around for? Oh wait. They're still here and vision Transformers still haven't replaced them. Mm-hmm. How long did r and n stick around for?Decades. And, you know, they're still alive and kicking in a bunch of different places, so, you know. The fundamental architecture improvements are really hard to come by. I can't wait to collect from Sasha on that bet.Abhinav: I, I think a lot of your bet hinges on what counts as attention, right.Swyx: Wait, what do you mean?Well, how, how can that change? Oh, because it'll be approximated.Abhinav: Well, I suppose if, if we ever replace like the Qk multiplication, something that looks sort of like it, I, I wonder who, who, who comes out on top here.Jonathan: Yeah. I mean at the end of the day is a feed forward network, you know, that's fully connected, just a transformer with very simple attention.Mm-hmm. Um, so Sasha better be very generous to me cause it's possible that could change, but at the end of the day, we're still doing Transformers the way, you know, Vaswani had all intended back six years ago now, so, I don't know, things. Six years is a pretty long time. What's another four years at this point?Alessio: Yeah. What do you think will replace it if you lose Ben? What do you think? You would've lost  it time?Jonathan: If I knew that I'd be working on it.Abhinav:  I think it's gonna be just like MLPs, you know, that's the only, that's the only way we can go, I think at this point, because Thelp, I, I dunno. Oh, just basically down to, to um, to linear layers.[00:55:30]Oh, mostly the percepts. Exactly. Got, yeah. Yeah. Yeah. Cuz the architecture's been stripped, simplified so much at this point. I think, uh, there's very little left other than like some linear layers, some like residual connections and, and of course the attention, um, dot product.Jonathan: But you're assuming things will get simpler, maybe things will get more complicated.Swyx: Yeah, there's some buzz about like, the hippo models. Hungry, hungry hippos.Jonathan: I, I mean there's always buzz about something, um, you know, that's not to dismiss this work or any other work, but there's always buzz about something. I tend to wait a little bit to see if things stand the test of time for like two weeks.Um, at this point, it used to be, you know, a year, but now it's down to two weeks. Oh. But you know, I'm. I don't know. I don't like to follow the hype. I like to see what sticks around, what people actually manage to build off of. Swyx: I have a follow up question actually on that. Uh, what's a, what's an egregiously overrated paper that once you actually looked into it fell apart completely?Jonathan: I'm not going down that path. Okay. I, you know, I even, even though I think there are papers that, you know, did not hold up under scrutiny, I don't think any of this was out of malice. And so I don't wanna go down that path. Alessio: Yeah. I know you already talked about your focus on open research. Are you mostly gonna focus on open models or are there also, are you working on configurations that are more just for your customers and private, like, what percentage of your time are you focusing on, on open work?Jonathan: It's a little fuzzy. I mean, I think at the end of the day you have to ask what is the point of our business? Our business is not just to train a bunch of open models and give them to the world. That would, our VCs probably wouldn't be very happy if that were the case. The open [00:57:00] models serve our business because they're demos.A demo does not mean we give away everything. Um, a demo does not mean every single thing we do is shared with the world, but. We do have a business imperative to share with the world, which I kind of like. That was part of the design of the company, was making sure we had an imperative to do science and an imperative to share.But we are still a company and we do have to make money, but it would be a disaster for our business if we didn't share. And that's by design from the start. So, you know, there's certainly going to be some work that we do that is for our customers only, but by and large for anything that we wanna advertise to customers, there has to be something that is meaningful and useful that's out there in the world.Otherwise we can't convince people that we have it. Abhinav: Yeah, I think like this, our recent inference product also makes the decision easier for us, right? So even since these open malls like we've developed so far, um, you can actually like, you know, uh, query them on our inference api, like our starter tier, and we basically charge like a, a per token fee.Very, very similar to the other API fighters. So there are pathways by which, you know, like even the open mall we provide for free still end up like helping our business out, right? You can customize them, deploy them on our, on our platform, and that way we, we still make money off of them.Alessio: Do you wanna jump into the landing ground?Anything else that you guys wanna cover that we didn't get to?Jonathan: This has been great. These are great questions. Swyx: Do you want to dish on why Sparsity is not a focus for Mosaic?Jonathan: Um, I can just say that, you know, sparsity is not a focus for Mosaic and I am definitely over lottery tickets when I give my mosaic talk.The first slide is a, you know, a circle with a slash through it over a lottery ticket. [00:58:30] Um, and anyone who mentions lottery tickets, I ask to leave the room. Um, cuz you know there's other work out there. But Abhi, please feel free to dish on sparsity.Abhinav: Yeah, I, I think it really comes down to the fact that we don't have hardware yet that can accelerate it.Right? Or at least it's been mostly true for a long period of time. So the kinds of sparsity that the lottery check was working on was like if you put random zeros in the, in the weights, you know, and basically we found basically the fast year is that yes, you can turn most of the weights to zeros and the model still does kind of work, but there's no hardware out there that can take a matrix with a bunch of zeros and one without and make it go fast.Now, the one caveat for this, and this is gonna sound like a bit of advertisement, is, is Cereus actually, and they've been, since the beginning, they've built that architecture for Sparsity and they've actually published some research papers just earlier this year showing that yes, they really can train with Sparsity and get, this is, uh, sparse.U P T. Exactly. Yeah, exactly right. So, the final missing piece is really like, okay, we have the science to show you can train with sparse models, you know, from initialization even, or, or close initialization. Um, the last piece is just, is there a piece of hardware that actually speeds it up and gives you a cost savings?In which case, like the, the field is wide open. Jonathan: The other big challenge here is that if you want to make sparsity go fast in general right now on standard hardware, you do need it to be structured in various ways. And any incremental amount of structure that you force on the sparsity dramatically reduces the quality of the resulting model that you get up to the point where if you remove just, you know, entire neurons from the model, you're just making the layers smaller and that really hurts the quality of the model.So these models, steel is all you need. These models love unstructured [01:00:00] sparsity. Um, and yeah, if there were a chip and a software package that made it really, really easy to accelerate it, I bet we would be doing it at Mosaic right now. Alessio: This is like Sarah Hooker's point with the hardware lottery post, talking about lotteries.Absolutely. Where you know, if you don't have the right hardware, some models, architectures just can't emerge quickly enough.Abhinav: This there, there's like an invariance to think of, which is that today's popular models always run fast on today's hardware. Like this, this has to be true. Mm-hmm. Right? Like there's no such thing as a popular model that runs slow cuz no one would've developed it.Yeah. Um, so it's kind of like with the new architectures, right? If there's new hardware that can do sparsity, you have to co-evolve like a new architecture that works with it. And then those two pair together really well. Transformers and GPUs are like a match made in heaven. Jonathan: How would say transformers and GPUs are a match made in heaven.Yeah. And we're lucky that they work on GPUs, but the folks at Google D designed them for TPUs cuz TPUs and R and Ns were not a match made in heaven.LIGHTNING ROUND AND CLOSING [1:00:55]Alessio: All right, we have three questions. One is on acceleration, one on exploration, and then just a takeaway for the audience. And you can, you know, either of you can start and the other can finish.So the first one is, what has already happened in AI That thought would take much longer than it has?Abhinav: Do you have an answer, Jon? Jonathan: Yeah, I have answer everything. Um, you know, I, I remember when GPT two came out and I looked at that and went, eh, you know, that doesn't seem very exciting. And gosh, it's already 1.5 billion parameters.You know, they can't possibly keep getting better as they make it bigger. And then GPT three came out and I was like, eh, it's slightly better at [01:01:30] generating text. Yeah, who cares? And you know, I've been wrong again and again and again. That. Next token prediction, making things big can produce useful models.To be fair, pretty much all of us were wrong about that. So I can't take that precisely on myself. Otherwise, Google, Facebook and Microsoft Research would all have had killer large language models way before opening I ever got the chance to do it. Um, opening I made a very strange bet and it happened to work out very well.But yeah, diffusion models, like they're pretty stupid at the end of the day and they produce beautiful images, it’s astounding.Abhinav: Yeah, I think my, my answer is gonna be like the, the chatbots at scale, like idea, like basically I thought it would be quite a while before, you know, like hundreds of millions of people will be talking to AI models for a large portion of the data, but now there's many startups and companies not, not just open with chat pt, but, but you know, like character and others where, um, it, it's really astounding, like how many people are actually developing like emotional connections to these, to these AI models.And I don't think I was. Would've predicted that like September, October of last year. But you know, the inflection point of the last six months has been really surprising.Swyx: I haven't actually tried any of these models, but I, I don't know. It seems like a very educational thing. It's like, oh, talk to Genius can, but like that's a very educational use case.Right? Right. Like what, what do you think they're using for, I guess, emotional support?Abhinav: Well, yes. I mean, I think some of them are sort of like, yeah, like either for emotional support or honestly just friends and stuff. Right. I mean, I think like, you know, loneliness mental health is a really a big problem everywhere.And so the most interesting I think I've found is that if you go to the subreddits, you know, for those communities and you see like how they [01:03:00] talk about and think about their like AI friends and like these characters, it's, it's, it's like out of a science fiction book, like I would never expect this to be like reality.Swyx: Yeah. What do you think are the most interesting unsolved questions in ai?Abhinav: I'm really interested in seeing how far down we can go in terms of precision and, and stuff like that. Particularly similar to the BF16 FP thing. Swyx: Okay. Um, there's also like just quantizing until like it's two bits.Abhinav: Yeah, exactly. Like, or even like down to analog or something like that. Because our brains obviously are not running on digital logic and stuff and so, you know, how many orders of magnitude do we have remaining in kind of like just these um, things and I wonder if some of these problems just get easier with scale.Like there have been sort of hints in some papers that, you know, it becomes easier to quantize or easier to prune as it gets bigger and bigger. So maybe as we, almost as a natural consequence of a scaling up over the next few years, will we just naturally become easier and easier to just start going to like four bits or two that are even binary leg weights.Jonathan: I want to know how small we can go in a different way. I just want to know how efficient we can make it to get models that are this good. That was my research question for my entire PhD lottery tickets were one way to get at that. That's now kind of the research question I'm chasing at Mosaic in a sense.I, you know, open ai has shown us that there is one path to getting these incredible capabilities that is scale. I hope that's not the only path. I hope there are lots of ways of getting there. There's better modeling, there are better algorithms. I hate the neuroscience metaphors, but in some sense, our existence and our brains are, you know, evidence that there is at least one other way to get to these kinds of incredible capabilities that doesn't require, you know, [01:04:30] a trillion parameters and megawatts and megawatts and gazillions of dollars.So, you know, I do wonder how small we can go? Is there another path to get to these capabilities without having to do it this way? If it's there, I hope we find it at Mosaic.Swyx: Yeah my, my favorite fact is something on the order of the human brain runs on 30 watts of energy, and so we are, we're doing like dozens of orders of magnitude off on that one.Abhinav: I, I don't think you can get like one gpu, one different. Yeah.Alessio: If there’s one message you want everyone. To remember when thinking about this thing. There's a lot of, you know, fear mongering. There's a lot of messaging being spread around, like, what should people think about in ai? What should be top of mind for them?Jonathan: I'll go for it. Which is, you know, stay balanced. They're the people who really feed into the hype or who, you know, eat up the hype. They're the people who are, you know, big pessimists or react very strongly against the hype, or to some extent are in denial. Stay balanced, embrace the fact that we've built extraordinarily useful tools.Um, but we haven't built a g I and you know, personally, I don't think we're anywhere close to that. You know, so stay balanced and follow the science. I think that's really, that's what we try to do around Mosaic. We try to focus on what's useful to people, what will, you know, hopefully make the world a better place.We try our best on that, but especially, you know, how we can follow the science and use data to be our guide, not just, you know, talk a lot, you know, try to talk through our work instead.Abhinav: And I would also say just kinda like research done in the open. I think like, you know, there's no computing with the, the open community, [01:06:00] right?Just in volume, the number of like, kind of eyeballs you basically have, like looking at your models at the, even at the problems with the models, at ways we improve them. Um, I just think, you know, yeah, research done in the open. It will, it will be the way forward, both to keep our models safe and to bely, like examine the consequences of these AI models like in the world.Alessio: Awesome. Thank you so much guys for coming on.Swyx: and thanks for keeping AI open. Abhinav: Thank you for having us. Jonathan: Yeah. Thank you so much for having us. This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit www.latent.space
    20/5/2023
    1:06:43
  • Guaranteed quality and structure in LLM outputs - with Shreya Rajpal of Guardrails AI
    Tomorrow, 5/16, we’re hosting Latent Space Liftoff Day in San Francisco. We have some amazing demos from founders at 5:30pm, and we’ll have an open co-working starting at 2pm. Spaces are limited, so please RSVP here!One of the biggest criticisms of large language models is their inability to tightly follow requirements without extensive prompt engineering. You might have seen examples of ChatGPT playing a game of chess and making many invalid moves, or adding new pieces to the board. Guardrails AI aims to solve these issues by adding a formalized structure around inference calls, which validates both the structure and quality of the output. In this episode, Shreya Rajpal, creator of Guardrails AI, walks us through the inspiration behind the project, why it’s so important for models’ outputs to be predictable, and why she went with an XML-like syntax. Guardrails TLDRGuardrails AI rules are created as RAILs, which have three main “atomic objects”:* Output: what should the output look like?* Prompt: template for requests that can be interpolated* Script: custom rules for validation and correctionEach RAIL can then be used as a “guard” when calling an LLM. You can think of a guard as a wrapper for the API call. Before returning the output, it will validate it, and if it doesn’t pass it will ask the model again. Here’s an example of a bad SQL query being returned, and what the ReAsk query looks like: Each RAIL is also model-agnostic. This allows for output consistency across different models, even if they have slight differences in how they are prompted. Guardrails can easily be used with LangChain and other tools to structure your outputs!Show Notes* Guardrails AI* Text2SQL* Use Guardrails and GPT to play valid chess* Shreya’s AI Tinkerers demo* Hazy Research Lab* AutoPR* Ian Goodfellow* GANs (Generative Adversarial Networks)Timestamps* [00:00:00] Shreya's Intro* [00:02:30] What's Guardrails AI?* [00:05:50] Why XML instead of YAML or JSON?* [00:10:00] SQL as a validation language?* [00:14:00] RAIL composability and package manager?* [00:16:00] Using Guardrails for agents* [00:23:50] Guardrails "contracts" and guarantees* [00:31:30] SLAs for LLMs* [00:40:00] How to prioritize as a solo founder in open source* [00:43:00] Guardrails open source community involvement* [00:46:00] Working with Ian Goodfellow* [00:50:00] Research coming out of Stanford* [00:52:00] Lightning RoundTranscriptAlessio: [00:00:00] Hey everyone. Welcome to the Latent Space Podcast. This is Alessio partner and CTO-in-Residence at Decibel Partners. I'm joined by my cohost Swyx, writer and editor of Latent Space.Swyx: And today we have Shreya Rajpal in the studio. Welcome Shreya.Shreya: Hi. Hi. Excited to be here.Swyx: Excited to have you too.This has been a long time coming, you and I have chatted a little bit and excited to learn more about guardrails. We do a little intro for you and then we have you fill in the blanks. So you, you got your bachelor's at IIT Delhi minor in computer science with focus on AI, which is super relevant now. I bet you didn't think about that in undergrad.Shreya: Yeah, I think it's, it's interesting because like, I started working in AI back in 2014 and back then I was like, oh, it's, it's here. This is like almost changing the world already. So it feels like that that like took nine years, that meme of like, almost like almost arriving the thing.So yeah, I, it's felt this way where [00:01:00] it's almost shared. It's almost changed the world for as long as I've been working in it.Swyx: Yeah. That's awesome. Maybe we can explore your, like the origins of your interests, because then you went on to U I U C to do your master's also in ai. And then it looks like you went to drive.ai to work on Perception and then to Apple S P G as, as the cool kids call it special projects group working with Ian Goodfellow.Yeah, that's right. And then you were at pretty base up until recently? Actually, I don't know if you've quit yet. I have, yeah. Okay, good, good, good. You haven't updated e LinkedIn, but we're getting the by breaking news that you're working on guardrails full-time. Yeah, well that's the professional history.We can double back to fill in the blanks on anything. But what's a personal side? You know, what's not on your LinkedIn that people should know about you?Shreya: I think the most obvious thing, this is like, this is still professional, but the most obvious thing that isn't on my LinkedIn yet is, is Guardrails.So, yeah. Like you mentioned, I haven't updated my LinkedIn yet, but I quit some time ago and I've been devoting like all of my energy. Yeah. Full-time working on Guardrails and growing the open source package and building out exciting features, et cetera. So that's probably the thing that's missing the most.I think another. More personal skill, which I [00:02:00] think I'm like kind of okay for an amateur and that isn't on my LinkedIn is, is pottery. So I really enjoy pottery and yeah, don't know how to slot that in amongst, like, all of the AI. So that's not in there. Swyx: Well, you like shaping things into containers where, where like unstructured things and kind of flow in, so, yeah, yeah, yeah. See I can, I can spin it for you.Shreya: I should, I should use that. Yeah. Yeah.Alessio: Maybe for the audience, you wanna give a little bit of intro on Guardrails AI, what it is, why you wanted to start itShreya: Yeah, yeah, for sure. So Guardrails or, or the need for Guardrails really came up as I was kind of like building some of my own projects in the space and like really solving some of my own problems.So this was back of like end of last year I was kind of building some applications, like everybody else was very excited about the space. And I built some stuff and I quickly realized that yeah, I could, you know it works like pretty well a bunch of times, but like a lot of other times it really does not work as I, the developer of this tool, like, want my tool to work.And then as a developer like I can tell that there's very few tools available for me to like, get this to, you know cooperate [00:03:00] with me, like get it to follow directions, etc. And the only tool I really have is this prompt. And there's only so, so far you can go with like, putting instructions in like caps, adding a bunch of exclamations and being like, follow my instructions. Like give me this output this way. And so I think like part of it was, You know that it's not reliable, et cetera. But also as a user, it just if I'm building an application for a user, I just want the user to have a have a certain experience using it. And there's just not enough control to me, not enough, like knobs for me to tune, you know as a developer to do that.So guardrails kind of like came up as a way to just like, manage this better. The tool basically, I was like, okay. As I'm building this, I know from the ground up, like what is the experience I want the user to add, to have like, what is a great LLM output look like for me? And so I wanted a tool that allows me to kind of specify that and enforce those constraints.As I was thinking of this, I was like, this should be very extensible, very flexible so that there's a bunch of use cases that can be handled, et cetera. But the need really like, kind of came up from my own from my own, like I was basically solving for my own pain points.[00:04:00]So that's a little bit of the history, but what the tool does is that it allows you to kind of like specify. It's this two-part system where there's a specification framework and then there's like a code that enforces that specification on the LLM outputs. So the specification framework allows you to be like as coarse or as fine grained as you care about.So you can essentially think about what is the, on a very like first order business, like where is the structure and what are the types, etc, of the output that I want. If you want structured outputs from LLMs. But you can also go like very into semantic correctness with this, with a. I just released something this morning, which is that if you're summarizing a bunch of documents, make sure that it's a very faithful summary.Make sure that there's like coherence amongst like what the output is, et cetera. So you can have like all of these semantic guarantees as well. And guardrails created like rails, like a reliable AI markup language that allows you to specify that. And along with that, there's like code that backs up that specification and it makes sure that a, you're just generating prompts that are more likely to get you the output in the right manner to start out with.And then once you get that output all of the specification criteria you entered is like [00:05:00] systematically validated and like corrected. And there's a bunch of like tools in there that allow you a lot of control to like handle failures much more gracefully. So that's in a nutshell what guardrails does.Awesome.Alessio: And this is model agnostic. People can use it on any model.Shreya: Yeah, that's right. When I was doing my prototyping, I like was developing with like OpenAI, as I'm sure like a bunch of other developers were. But since then I've added support where you can basically like plug in any, essentially any function or any callable as long as you, it has a string input.String output you can plug it in there and I've had people test it out with a bunch of other models and get pretty good results. Yeah.Alessio: That's awesome. Why did you start from XML instead of YAML or JSON?Shreya: Yeah. Yeah. I think it's a good question. It's also the question I get asked the most. Yes. I remember we chat about this as well the first chat and I was like, wait, okay, let's get it out of the way. Cause I'm sure you answered this a lot.Shreya: So it is I didn't start out with it is the truth. Like, I think I started out from this code first framework service initially like Python classes, et cetera. And I was like, wait, this is too verbose. This is like I, as I'm thinking about what I want, I truly just [00:06:00] want this is like, this is what this dictionary should look like for me, right?And having to like create classes on top of that just seemed like a higher upfront cost. Like obviously there's a balance there. Like there's some flexibility that classes and code affords you that maybe isn't there in a declarative markup language. But that that was my initial kind of like balance there.And then within markup languages, I experimented with the bunch, but the idea, like a few aesthetic things about xml, like really appeal to me, as unusual as that may sound. But I think one is this idea of like properties off. Any field that you're getting back from an LLM, right. So I think one of the initial ones that I was experimenting with was like TypeScript, et cetera.And with TypeScript, like all of the control you have is like, you try to like stuff as much information as possible in the name of the key, right? But that's not really sufficient because like in, in XML or, or what gars allows you to do is like maybe add like descriptions for each field that you're getting, which like is, is really very helpful because that almost acts as a proxy prompt.You know, and, and it gets you like better outputs. You can add in like what the correctness criteria or what the validity criteria is for this field, et [00:07:00] cetera. That also gets like passed through to the prompt, et cetera. And these are all like, Properties for a single field, right? But fields themselves can be containers and can have like other nested like fields within them.And so the separation of like what's a property of a field versus what's like child of a field, et cetera, was like nice to me. And having like all of this metadata contained within this one, like tag was like kind of elegant. It also mapped very well to this idea of like error handling or like event handling because like each field may fail in weird ways.It's very inspired from H T M L in that way, in that you have these like event handlers for like, oh, if this validity criteria for this field fails maybe I wanna re-ask the large language model and here's my re-asking parameters, et cetera. Whereas like, if other criteria fail there's like maybe other ways to do to handle that.Like maybe I don't care about it as much. Right. So, so that seemed pretty elegant to me. That said, I've talked to a lot of people who are very opinionated about it. My, like, the thing that I was optimizing for was essentially that it seemed clean to me compared to like other things I tried out and seemed as close to English as [00:08:00] possible.I tested it out with, with a bunch of friends you know, who did not have tag backgrounds or worked in tag but weren't like engineers and it like and they resonated and they were able to pick it up. But I think you'll see updates in the works where I meet people where they are in terms of like, people who, especially like really hate xml.Like there's something in the works where there'll be like a code first version of this. And also like other markup languages, which I'm actively exploring. Like what is a, what is a joyful experience to have for like other market languages. Yeah. DoSwyx: you think that non-technical people would.Use rail was because I was, I was just surprised by your mention that you tested it on non-technical people. Is that a design goal? Yeah, yeah,Shreya: for sure. Wow. Okay. We're seeing this big influx of, of of people who are building tools with these applications who are kind of like, not machine learning people.And I think like, that's truly the kind of like big explosion that we're seeing. Right. And a lot of them are like getting so much like value out of like lms, but because it allows you like earlier if you were to like, I don't know. Build a web scraper, you would need to do this like via code.[00:09:00] But now like you can get not all the way, but like a decent amount of way there, like with just English. And that is very, very powerful. So it is a design goal to like have like essentially low floor, high ceiling is, was like absolutely a design goal. So if, if you're used to plain English and prompting using Chad PK with plain English, then you can it should be very easy for you to kind of like pick this up and there's not a lot of gap there, but like you can also build like pretty complex workflows with guardrails and it's like very adaptable in that way.Swyx: The thing about having custom language is essentially other people can build. Stuff that compiles to you. Mm-hmm. Which is also super nice and, and visual layers on top. Like essentially HTML is, is xml, like mm-hmm. And people then build the WordPress that is for non-technical people to interface with html.Shreya: I don't know. Yeah, yeah. No, absolutely. I think like in the very first week that Guardrails was out, like somebody reached out to me and they were pm and they essentially were like, I don't, you know there's a lot of people on my team who would love to use this, but just do not write code.[00:10:00] Like what is the, where is a visual interface for building something like this? But I feel like that's, that's another reason for why XML was appealing, because it's essentially like a document structuring, like it's a way to think about like documents as trees, right? And so again, if you're thinking about like what a visual interface would be, then maps going nicely to xml.But yeah. So those are some of the design considerations. Yeah.Swyx: Oh, I was actually gonna ask this at the end, but I'm gonna bring it up now. Did you explore sql, like. Syntax. And obviously there's a project now l m qr, which I'm sure you've looked at. Yeah. Just compare, contrast, anything.Shreya: Yeah. I think from my use case, like I was very, how I wanted to build this package was like essentially very, very focused on developer ergonomics.And so I didn't want to like add a lot of overhead or add a lot of like, kind of like high friction essentially like learning a whole new dialect of sequel or a sequel like language is seems like a much bigger overhead to me compared to like doing things in XML or doing things in a markup language, which is much more intuitive in some ways.So I think that was part of the inspiration for not exploring sql. I'd looked into it very briefly, but I mean, I think for my, for my own workflows, [00:11:00] I wanted to make it like as easy as possible to like wrap whatever LLM API calls you make. And, and to me that design was in markup or like in XML, where you just define your desiredSwyx: structures.For what it's worth. I agree with you. I would be able to argue for LMQL because SQL is the proven language for business analysts. Right. Like less technical, like let's not have technical versus non-technical. There's also like less like medium technical people Yeah. Who learn sql. Yeah. Yeah. But I, I agree with you.Shreya: Yeah. I think it depends. So I have I've received like, I think the why XML question, like I mentioned is like one of the things I get most, but I also hear like this feedback from other people, which is like all of like essentially enterprises are also like very comfortable with xml, right? So I guess even within the medium technical people, it's like different cohorts of like Yeah.Technologies people are used to and you know, what they would find kind of most comfortable, et cetera. Yeah. And,Swyx: Well, you have a good shot at establishing the standard, which is pretty exciting. I'm someone who has come from a, a long background with React, the JavaScript framework. I don't know if you.And it's kind of has that approach of [00:12:00] taking a templating XML like language to describe something that was typically previously described in Code. I wonder if you took any inspiration from that? If you want to just exchange notes on anything from that like made React successful. Cuz I, I spent a few years studying that.Yeah.Shreya: I'm happy to talk about it, but I will say that I am very uneducated when it comes to front end, so Yeah, that's okay. So I might say some things that like aren't, aren't valid or like don't really, don't really map very well, but I'm gonna give it a shot anyway. So I don't know if it was React specifically.I think just this idea of marrying essentially like event handlers, like with the declarative framework. Yes. And with this idea of being able to like insert scripts, et cetera, and quote snippets into that. Like, that was super duper appealing to me. And that was like something like where you're programming with.Like Gabriels and, and Rail specifically is essentially a way to like program with large language models outside of using like just national language. Right? And so like just thinking of like what are the different like programming workflows that people typically need and like what would be the most elegant way to add that in there?I think that was an inspiration. So I basically looked at like, [00:13:00] If you're familiar with Guardrails and you know that you can insert like dynamic scripting into a rail specification, so you can register custom validators within rail. You can maybe have like essentially code snippets where things are like lists or things are like dynamically generated array, et cetera, within GAR Rail.So that kind of resonated a lot to like using JavaScript injected within like HTML files. And I think other inspiration was like I mentioned this before, but the event handlers was like something that was very appealing, how validators are configured in guardrails right now. How you tack on specific validators that's kind of inspired from like c s s and adding like style tags, et cetera, to specific Oh, inline styling.Okay. Yeah, yeah, yeah, exactly. Wow. So that was like some of the inspiration, I guess that and pedantic and like how pedantic kind of like does its validation. I think those two were probably like the two biggest inspirations while building building the current version of guardrails. Swyx: One part of the design of React is composability.Can I import a guardrails thing from into another guardrails project? [00:14:00] I see. That paves the way for guardrails package managers or libraries or Right. Reusable components, essentially. I think that'sShreya: pretty interesting. Do you wanna expand on that a little bit more? Swyx: Like, so for example, you have guardrails for a specific use case and you want to like, use that, use it in a bigger thing. And then just compose it up. Yeah.Shreya: Yeah. I wanna say that, I think that should be pretty straightforward. I'm trying to think about like, use cases where people have done that, but I think that kind of maps into like chaining or like building complex workflows generally. Right. So how I think about guardrails is that like, I.If you're doing something like chaining, you essentially are composing together these like multiple LLM API calls and you have these like different atomic units of each LLM API calls, right? So where guardrails kind of slots in is add like one of those nodes. It essentially adds guarantees, et cetera, and make sure that you know, that that one node is like water tied, et cetera, in terms of the, the output that is, that it has.So each node in your graph or tree or in your dag would essentially have like a guardrails config associated with it. And you can kind of like use your favorite chaining libraries, like nine chain, et cetera, to like then compose this further together. [00:15:00] I think I've seen like one of the first actually community projects that was like built using guardrails, like had chaining and then had like different rails for each node of that chain.Essentially,Alessio: I'm building an agent internally for us. And Guardrails are obviously very exciting because once you set the initial prompt, like the model creates its own prompts. Can the models create rails for themselves? Like, have you tried this out? Like, can they understand what the output is supposed to be and like where their ownShreya: specs?Yeah. Yeah. I think this is a very interesting question. So I haven't personally tried this out, but I've ha I've received this request you know, a few different times. So on the roadmap like seeing how this can be done, but I think in general, like in all of the prompt engineering experiments I've done, et cetera, I don't see like why with, especially with like few short examples that shouldn't be possible.But that's, that's a fun like experiment. I wanna try out,Alessio: I was just thinking about this because if you think about Baby a gi mm-hmm. And some of these projects mm-hmm. A lot of them are just loops of prompts. Yeah. You know so I can see a future [00:16:00] in which. A lot of these loops are kind off the shelf thing and then you bring your own rails mm-hmm.To make sure that they work the way you expect them to be instead of expecting the model to do everything for you. Yeah. What are your thoughts on agents and kind of like how this plays together? I feel like when you start it, people were mostly just using this for a single prompt. You know, now you have this like automated chainShreya: happening.Yeah. I think agents are like absolutely fascinating in how. Powerful they are, but also how unruly they are sometimes. Right? And how hard to control they are. But I think in general, this kind of like ties into even with machine learning or like all of the machine learning applications that I worked on there's a reason like you don't have like fully end-to-end ML applications even in you know, so I, I worked in self-driving for example, like a driveway.I at driveway you don't have a fully end-to-end deep learning driving system, right? You essentially have like smaller components of it that are deep learning and then you have some kind of guarantees, et cetera, at those interfaces of those boundaries. And then you have like other maybe more deterministic competence, et cetera.So essentially like the [00:17:00] interesting thing about the agent framework for me is like how we will kind of like break this up into smaller tasks and then like assign those guarantees kind of at e each outputs. It's a problem that I've been like thinking about, but it's also like frankly a hard problem to solve because you're.Because the goals are auto generated. You know, there's also like the, the correctness criteria for those goals also needs to be auto generated, right? Which is like a little bit antithetical to you knowing ahead of time, like, what, what a correct output for me for a developer or for your application kind of looking like.So I think like that's the interesting crossroads. But I do think, like with that said, I think guardrails are like absolutely essential for Asian frameworks, right? Like partially because like, not just making sure they're like constrained and they're safe, et cetera, but also, frankly, to just make sure that they're doing what you want them to do, right?And you get the right output from them. So it is a problem. Like I'm, I'm thinking a bunch about, I think just, just this idea of like, how do you make sure that it's not it's not just models checking each other, but there's like some more determinism, some more notion of like guarantees that can be backed up in there.I think like that's [00:18:00] the, that would be like super compelling to me, and that is kind of like the solution that I would be interested in putting out. But yeah, it's, it's something that I'm thinking about for sure. I'mSwyx: curious in the scope of the problem. I feel like we need to. I think a lot of people, when they hear about AI progress, they always assume that, oh, that just if it's not good now, just wait a year later.And I think obviously, I think that's something that you have to think about as well, right? Like how much of what guardrails is gonna do is going to be Threatens or competed with by GC four having 32,000 context tokens. Just like what do you think are like the invariables in model capabilities that you're betting on versus like stuff that you would not bet on because you just expected to get better?Yeah.Shreya: Yeah. I think that's a great question, and I think just this way of thinking about invariables, et cetera is something that is very core to how I've been thinking about this problem and like why I also chose to work on this problem. So, I think again, and this is like guided by some of my past experience in machine learning and also kind of like looking at like how these problems are, how like other applications that I've had a lot [00:19:00] of interest, like how some of the ML challenges have been solved in there.So I think like context, like longer context, length is going to arrive for sure. We are gonna start saying we're already seeing like some, some academic papers and you know, we're gonna start seeing a lot more of them like translated into actual applications.Swyx: This is the new transformer thing that was being sent around with like a millionShreya: context.Yeah. I also, I think my my husband is a PhD student you know, at Stanford and then his lab also does research basically in like some of the more efficient architectures for Oh, that'sSwyx: a secret weapon for guard rails. Oh my god. What? Tell us more.Shreya: Yeah, I think, I think their lab is pretty exciting.This is a shouted to the hazy research lab at Stanford. And yeah, I think like some of, there's basically some active research there about like, basically looking into like newer architectures, like not just transform. Yeah, it might not be the most I've been artifact more architecture.Yeah, more architectural research that allows for like longer context length. So longer context, length is arriving for sure. Yeah. Lower latency lower memory efficiency, et cetera. So that is actually some of my background. I worked in that in my previous jobs, something I'm familiar with.I think there's like known recipes for making [00:20:00] this work. And it's, it's like a problem like once, essentially it's a problem of just kind of like a lot of experimentation and like finding exactly what configurations kind of get you there. So that will also arrive, both of those things combined, you know will like drive down the cost of running inference on these models.So I, all of those trends are coming for sure. I think the trend that. Are the problem that is not solved by these trends is the problem of like determinism on machine learning models, like fundamentally machine learning models, deep learning models specifically, like are impossible to add guarantees on even with temperature zero.Oh, absolutely. Even with temperature zero, it's not the same as like seed equals zero or seed equals like a fixed amount. Mm-hmm. So even if with temperature zero with the same inputs, you run it multiple times, you'll essentially see that you don't get the same output multiple times. Right.Combined with this, System where you don't even actually own the model yourself, right? So the models are updated from under you all the time. Like for building guardrails, like I had to do a bunch of prompt engineering, right? So that users get like really great structured outputs, like share of the bat [00:21:00] without like having to do any work.And I had this where I developed something and it worked and then it ended up like for some internal model version, updated, ended up like not being functional anymore and I had to go back to the drawing board and you know, do that prompt engineering again. There's a bit of a digression, but I do see that as like a strength of guardrails in that like the contract that I'm providing is not between the user.So the user has a contract with me essentially. And then like I am making sure that we are able to do prompt engineering to get like the output from the LLM. And so it kind of like takes away a lot of that burden of having to figure that out for the user, right? So there's a little bit of a digression, but these models change all the time.And temperature zero does not equal like seed zero or fixed seed rather. And so even with all of the trends that we're gonna see arriving pretty soon over the next year, if not sooner, this idea of like determinism reproducibility is not gonna change, right? Ignoring reproducibility is a whole other problem of like the really, really, really long tail of like inputs and outputs that are not covered by, by tests and by training data, [00:22:00] et cetera.And it is like virtually impossible to cover that. You kind of like, this is not simply a problem where like, Throwing more data at the model is going to solve. Right? Yeah. Because like, people are building like genuinely really fascinating, really amazing complex applications and like, and these are just developers, like users are then using those applications in many diverse complex ways.And so it's hard to figure out like, what if you get like weird way word prompts that you know, like aren't, that you didn't kind of account for, et cetera. And so there's no amount of like scaling laws essentially that kind of account for those problems. They can be like internal guardrails, et cetera.Of course. And I would be very surprised if like open air, for example, like doesn't have their own internal guardrails. You can already see it in like some, some differences for example, like URLs like tend to be valid URLs now. Right. Whereas it really Yeah, I didn't notice that.It's my, it's my kind of my job to like keep track of, keep it, yeah. So I'm sure that's, If that's the case that like there's some internal guard rails, and I'm sure that that would be a trend that we would kind of see. But even with that there's like a ton of use cases and a [00:23:00] ton of kind of like application areas where like there's different requirements from different types of guard rails are valuable in different requirements.So this is a problem essentially that would be like, harder to solve or next to impossible to solve with just data, with just scaling up the models. So you would need kind of this ensemble basically of, of LLMs of like these really powerful models along with like deterministic guarantees, rule-based heuristics, et cetera, more traditional you know machine learning tools and like you ensemble all of these together and you end up getting something that you know, is greater than the sum of it.Its parts in terms of what it's able to do. So I think like that is the inva that I'm thinking of is like the way that people would be developing these applications. I will followSwyx: up on, on that because I'm super excited. So when you sent mentioned you have people have a contract with guardrails.I'm actually looking at the validators page on your docs, something, you have something like 20 different contracts that people can have. I'll name some of them just just so that people can have an, have an idea, but also highly encourage people to check it out. Is profanity free, is a, is a good one.Bug-free Python. And that's, that's also pretty, [00:24:00] pretty cool. You have similar to document and extracted summary sentences match. Which I think is, is like don't hallucinate,Shreya: right? Yeah. It's, it's essentially making sure that if you're generating summaries the summary should be very faithful.Yeah. Should be like citable attributable, et cetera to the source text.Swyx: Right. Valid url, which we talked about. Mm-hmm. Maybe open AI is doing a little bit more of internally. Mm-hmm. Maybe open AI uses card rails. You don know be a great endorsement. Uhhuh what is surprisingly popular and what is, what do you think is like underrated?Out of all your contracts? Mm-hmm.Shreya: Mm-hmm. Okay. I think that the, well, not surprisingly, but the most obvious popular ones for me that I've seen are like structure, structure type, et cetera. Anything that kind of guarantees that. So this isn't specifically in the validators, this is essentially like part of the gut, the core proposition.Yeah, the core proposition. I think that is like very popular, but that's also kind of like the first order. Problem that people are kind of solving. I think the sequel thing, for example, it's very exciting because I had just released this like two days ago and then I already got some inbound with like people kinda swapping, like building these products and of swapping it out internally and you know, [00:25:00] getting a lot of value out of what the sequel bug-free SQL provides.So I think like the bug-free SQL is a great example because you can see like how complex these validators can really go because you end up seeing like bug-free sql. What it does is it kind of like takes a connection string or maybe a, a schema file, et cetera. It creates a sandbox SQL environment for you, like from that.And it does that at startups so that like every time you're getting like a text to SQL Query, you're not having to do pay that cost time and time again. It takes that query, it like executes that query on that sandbox in that sandbox environment and then sees if that query is executable or not.And then if there's any errors that you know, like. Packages of those errors very nicely. And if you've configured re-asking it sends it back to the model and you know, basically make sure that that like it tries to get corrected. Sequel. So I think I have an example up there in the docs to be in there, like in applications or something where you can kind of see like how it corrects like weird table names, like weird predicates, et cetera.I think there's other kind of like, You can build pretty complex systems with this. So other things in there are like it takes [00:26:00] information about your database and then injects it into the prompt with like, here's the schema of this table. It automatically, like given a national language query, it finds like what the most similar examples are from the history of like, serving this model and like injects those into the prompt, et cetera.So you end up getting like this very kind of well thought out validator and this very well thought out contract that is, is just way, way, way better than just asking in plain English, the large language model to give you something, right? So I think that is the kind of like experience that I wanna provide.And I basically, you'll see more often the package, my immediateSwyx: response is like, that's cool. It does more than I thought it was gonna do, which is just check the SQL syntax. But you're actually checking against schema, which is. Highly, highly variable. Yeah. It'sShreya: slow though. I love that question. Yeah. Okay.Yeah, so I think like, here's where this idea of like, it doesn't have to be like, you don't have to send every request to your L so you're sampling. Okay. So you can essentially figure out, so for example, like there's like how what guardrails essentially does is there's like corrective actions and re-asking is like one of those corrective actions, [00:27:00] right?But there's like a ton other ways to handle it. Like there's maybe deterministic fixes, like programmatic fixes, there's maybe default values. There's this doesn't work like quite work for sql, but if you're doing like a bunch of structured data and if you know there's an invalid value, you can just filter it or you can just refrain from asking, et cetera.So there's a ton of ways where you can like, just handle errors more gracefully. And the one I kind of wanna point out here is programmatically fixing something that is wrong, like on, on the client side instead of just sending over another request. To the large language model. So for sql, I think the example that I talked about earlier that essentially has like an incorrect table name and to correct the table name, you end up sending another request.But you can think about like other ways to handle disgracefully, right? Like essentially looking at essentially a fuzzy matching with like the existing table names in the repository and in, in the database. And you know, like matching any incorrect names to that. And so you can think of like merging this re-asking thing with like, other error handling things that like smaller, easier errors are able, you can handle them programmatically by just Doing this in like the more patching, patching or I, I guess the more like [00:28:00] classical ML way essentially, like not the super fancy deep learning is like, I think ML 2.0.But like, and this, I, I've been calling it like ML 3.0, but like, even in like ML 1.0 ways you can like, think of how to do this, right? So you're not having to make these like really expensive calls. And so that builds a very powerful system, right? Where you essentially have this, like, depending on what your error is, you don't like, always use G P D three or, or your favorite L M API when you don't need to, you essentially are able to like combine these like other ways, other error handling techniques, like very gracefully so that you get correct outbursts, validated outbursts, and you get them for cheap and like faster, et cetera.So that's, I think there's some other SQL validation things that are in there. So I think like exclude SQL Predicates. Yeah, exclude SQL Predicates. And then there's one about columns that if like some columns are like sensitive columnSwyx: prisons. Yeah. Yeah. Oh, just check if it's there.Shreya: Check if it's there and you know, if there's like only certain columns that you wanna show it to the user and like, maybe like other columns have like private data or sensitive data you know, you can like exclude those and you can think of doing this on the table level.So this is very [00:29:00] easy to do just locally. Right. Like, so there's like different ways essentially to kind of like handle this, which makes for like a more compelling way to build theseSwyx: systems. Yeah. Yeah. By the way, I think we're proving out why. XML was a better choice than SQL Cause now, now you're wrapping sql.Yeah. Yeah. It's pretty cool. Cause you're talking about the text to SQL application example that you put out. It actually puts something, a design choice that isn't talked about very much in center focus, which is your logs. Your logs are gorgeous. I'm sure that took work. I'm sure that's a strong opinion of yours.Yeah. Why do you spend so much time on logs? Just like, how do you, how do you think about designing these things? Should everyone do it this way? What are the drawbacks? Like? Is any like,Shreya: yeah, I'm so excited about this idea of logs because you know, you're like, all of this data is like in there for free, right?Like if you're, if you're do like any validation that is run, like essentially in memory, and then also I write it out to file, et cetera. You essentially get like this you get a history of this was the prompt that was run. This was the this was the L raw LLM output. This was the validation that was run.This was the output of those validations. This [00:30:00] was any corrective actions, et cetera, that were taken. And I think that's like very, like as a developer, like, I'm so happy to see that I use these logs like personally as well.Swyx: Yeah, they're colored. They're like nicely, like there's like form double borders on the, on the logs.I've never seen this in any ML tooling at all.Shreya: Oh, thanks. Yeah. I appreciate it. Yeah, I think this was mostly. For once again, like solving my own problems, which is like, I was building a lot of these things and you know, doing a lot of dog fooding and doing a lot of application building like in notebooks.Yeah. And so in a notebook I wanted to kind of see like what the easiest way to kind of interact with it was. And, and that was kind of what I ended up building. I really appreciate that. I think that's, that's very nice to, nice to hear. I think I'm also thinking about what are, what are interesting ways to be able to like whittle down very deeply into like what kind of went wrong or what is going right when you're like running, running an application and like what the nice kind of interface to design that would be.So yeah, thinking about that problem. Don't have anything on there yet, but, but I do really like this idea of really as a developer you're just like, you really want like all the visibility you can get into what's, [00:31:00] what's happening right. Under the hood. And I wanna be able to provide that. Yeah.Yeah.Swyx: I mean the, the, the downside I'll point out just quickly cuz we, we should, we should move on is that this is not machine readable. So like, how does it work with like a Datadog or, you know? Yeah,Shreya: yeah, yeah, yeah. Well, we can deal with that later. I think that's that's basically my answer as well, that I, I'll do, yeah.Problem for future sreya, basically.Alessio: Yeah. You call Gabriel's SLAs for l m outputs. You know, historically SLAs are pretty objective there's the five nines availability, things like that. How do you build them in a sarcastic system when, say, my queries, like draft me a marketing article. Mm-hmm. Like, Have you read an SLA for something like that?Yeah. But in terms of quality and like, in terms of we talked about what's slow and like latency, like Hmm. Sometimes I would read away more and I, and have a better copy of like, have you thought about what are like the, the access of measurement for some of these things and how should people think about it?Shreya: Yeah, the copy example is interesting because [00:32:00] I think for any of these things, the SLAs are purely on like content and output, not on time. I don't guardrails I don't think even can make any guarantees on the time that it'll take to make these external API calls. But like, even within quality, it's this idea of like, if you're able to communicate what you desire.Either programmatically or by using a model in the loop, then that is something that can be enforced, right? That is something that can be validated and checked. So for example, like for writing content copy, like what's interesting is like for example, if you can break down the copy that you wanna write into, like this is a title, this is maybe a TLDR description, this is a more detailed take on the, the changes or the product announcement, et cetera.And you wanna hit like maybe three, like some set of points in there. So you already kind of like start thinking of like, what was a monolith of like copy to you in, in terms of like smaller building blocks, et cetera. And then on those building blocks you can essentially like then add like certain guarantees.So you can say that let's say like length or readability is a [00:33:00] guarantee. So some of the updates that I pushed today on, on summarization and like specific guards for summarization, one of them essentially was that like the reading time for the summary should be within like some certain amount, right?And so that's like you can start enforcing like all of those guarantees, like on each individual block. So I think like, Some of those things are. Naturally harder to do and you know, like are harder to automate ways. So essentially like, does this copy, I don't know, is this witty or something, right. Or is this Yeah.Something that I guess like the model doesn't have a good idea for, but like other things, as long as you can kind of like enforce them and like check them either via model or programmatically, it's something that you can like start building some some notion of like guarantees around. Yeah.Yeah. So that's why I think about it.Alessio: Yeah. This is super interesting because right now a lot of products are kind of the same because all I do is they call it the model and some are prompted a little differently, but you can only guess so much delta between them in the future. It's be, it'll be really interesting to have products differentiate with the amount of guardrails that they give you.Like you already [00:34:00] see that, Ooh, with open AI today when some people complain that too many of the responses have too much like, Well actually in it where it's like, oh, you ask a question, it's like, but you should remember that's actually not good. And remember this other side of the story and, and all of that.And some people don't want to have that in their automated generation. So, yeah. I'm really curious, and I think to Sean's point before about importing guardrails into products, like if there's a default amount of guardrails that you have and like you've being the provider of it, like that's really powerful.And then maybe there's a faction that is against guardrails and it's like they wanna, they wanna break out, they wanna be free. Yeah. So it's a. Interesting times. Yeah.Shreya: I think to that, like what I, I was actually chatting with someone who was building some application for content creators where like authenticity you know, was a big requirement, like of what they cared about in the right output.And so within authenticity, like why conventional models were not good for them is that they already have a lot of like quote unquote guardrails right. To, to I guess like [00:35:00] appeal to like certain certain sections of the audience to essentially be very cleaned up and then that was like an undesirable trade because that, for them, like, almost took away from that authenticity, et cetera.Right. So I think just this idea of like, I guess like what a guardrail means is like so different for different applications. Like I, I guess like I, there's like about 20 or so things in there. I think there's like a few more that I've added this morning, which Yes. Which are not Yeah. Which are not updated and then in the end.But there's like a lot of the, a lot of the common workflows, like you do have an understanding of like what the right. I guess like what is an appropriate constraint for this? Right. Of course, things like summarization, four things like text sequel, but there's also like so many like just this wide variety of like applications, which are so fascinating to learn about where you, you would wanna build something in-house, which is like your, so which is your secret sauce.And so how Guardrail is kind of designed or, or my intention with designing is that here's this way of breaking down what this problem is, right? Of like getting some determinism, getting some guarantees from your LM outputs. [00:36:00] And you can use this framework and like go crazy with it. Like build whatever you want, right?Like if you want this output to be more authentic or, or, or less clean or whatever, you can like add that in there, like making sure that it does have maybe some profanity and that's a desirable output for you. So I think like the framework side of it is very exciting to me as this, as this way of solving the problem.And then you can build your custom validators or use the ones that I provide out of the box. Yeah. Yeah.Alessio: So chat plugins, it's another big piece of this and. A lot of the integrations are very thin specs and like a lot of prompting, for example, a lot of them are asking to not mention the competitors. I think the Expedia one said, please do not mention any other travel website on the internet.Do not give any other alternative to what we do. Yeah. How do you see all these things come together? Like, do you see guardrails as something that not only helps with the prompting, but also helps with bringing external data into these things, and especially with agents going on any website, do you see each provider having like their own [00:37:00] guardrail where it's like, Hey, this is what you can expect from us, or this is what we want to provide?Or do you think that's, that's not really what, what you're interested in guardrailsShreya: being? Yeah, I think agents are a very fascinating question for me. I don't think I like quite know what the right, who the right owner for this guardrail is. Right. And maybe, I don't know if you guys wanna keep this in there or like maybe cut this front of my answer out, up to, up to you guys.I'm, I'm fine either way, but I think like that problem is, A harder problem to solve just from like a framework design perspective as well. Right. I think this idea of like, okay, right now it's just in the prompt, like don't mention competitors, et cetera. Like that is exactly that use case.Or I feel like, okay, if I was that business owner, right, and if I wanted to build this application, like, is that sufficient? There's like so much prompt injection, right? And you can get, or, or just so much like, just like an absolute lack of guarantees. Like, and, and it's hard to even detect that this is happening.Like let's say I have this running in production and then turns out that there was like some sort of leakage, et cetera, and you know, like my bot has actually been talking about like all of my competitors forever, [00:38:00] right? Like, that's a, that's a substantial risk. And so just this idea of like needing this like post-hoc validation to ensure deterministically that like it does what you want it to do is like, just so is like.As a developer putting myself in the shoes of like people building business applications like that is what gives me like peace of mind, right? So this framework, I think, like applies very well within those settings.Swyx: I'll go right into, we're gonna broaden out a little bit into commentary on other parts of the ecosystem that might, that might be interesting.So I think you and I. Talks briefly about this, but I think the, the broader population should know about it, which is that you also have an LLM API wrapper. Mm-hmm. So, such that the way, part of the way that guardrails works is you in, inject part of the few shot example into the prompt.Mm-hmm. And then you also do re-asking in all the other stuff post, I dunno what the pipeline is in, in, in your terminology. So essentially you have an API wrapper for open ai.completion.com dot create. But so does LangChain, so does Hellicone so does everyone I can name like five other people who are all fighting essentially for [00:39:00] the base layer, LLM API wrapper.Mm-hmm. I think this is valuable real estate, but I don't know how you like, think about working with other people or do you wanna be the base layer, likeShreya: I feel pretty collaboratively about it. I also feel like there's, like lang chain is doing like, it's so flexible as a framework, right?Like you can solve so many of your problems in there. And I think like it's, I, I have like a lang chain integration. I have a GPT Index / Llama integration, et cetera. And I think my view on this is that I wanna integrate with everybody. I think it is valuable real estate. It's not personally real estate that I'm interested in.Like you can essentially bring the LLM callable or the LLM API that's in there. It's just like some stub of a function that you can just add your favorite thing in there, right? It just, the only requirement is that string in first string output, that is all the requirement. And then you can bring in your own favorite component from your own favorite library in order to do that.And so, yeah, it's, I think like I'm pretty focused on this problem of like what is the guardrail that you would wanna build for a certain applications? So it's valuable real estate. I'm sure that people don't own [00:40:00] it.Swyx: It's, as long as people give you a way to insert your stuff, you're good.Shreya: Yeah, yeah. Yeah. I do think that, like I've chat with a bunch of people and then different applications and I do think that the abstractions that I have haven't failed me yet. Like it is very flexible. It is very easy to slot in into any workflow. Yeah.Swyx: I would love to ask about the meta elements of working on guardrails.This is your first company, but you launched five things this morning. The pace of the good AI projects that I've seen out there, like LangChain launches 10 things a week or whatever, I don't know. Surely that's something that you prioritize. How do you, how do you think about like, shipping versus like going going back and like testing and working in community and all the other stuff that you're managing?How do you prioritize? Shreya: That’s such a wonderful question. Yeah. A very hard question as well. I don't know if I would have a good answer for this. I think right now it's instinctive. Like I have a whole kind of stack ranked list of like things I wanna do and features I wanna build and like, support, et cetera.Combined with that is like a feature request I get or maybe some bugs, et cetera, that folks report. So I'm pretty focused on like any failures, any [00:41:00] feature requests from the community. So if those come up, I th those tend to Trump like anything else that I'm working on. But outside of that I have like this whole pool of ideas and like pool of features I wanna build and I kind of.Constantly kind of keep stack ranking them and like pushing something out. So I'm spending like I'm thinking about this problem constantly and as, as a function of that, I have like a ton of ideas for like what would be cool to build and, and what would be the right way to like, do certain things and yeah, wanna basically kind of like I keep jotting it down and keep thinking of like every time I cross something off the list.I think about like, what's the next exciting thing to work on. I think simultaneously with that we mentioned that at the beginning of this conversation, but like this idea of like what the right interface for rail is, right? Like, is it the xl, is it code, et cetera. So I think like those are like fundamental kind of design questions and I'm you know, collaborating with folks and trying to figure that out now.And yeah, I think that's like a parallel project that I'm hoping that yeah, you'll basically, that we'll be out soon. Like in termsSwyx: of the levers, how do you, like, let's just say in like a typical week, is it like 50% [00:42:00] calls with partners mm-hmm. And potential users and just understanding your use cases and the 50% building would you move that, that percentage anyway anywhere?Would you add in something that's significant?Shreya: I think it's frankly very variable week to week. So, yeah. I think early on when I released Guardrails I was like, here's how I'm thinking about this problem. Right? Yeah. Don't need anyone else. You just no, but actually to the contrary, it was like, this is like, I'm very opinionated about like what the right way to solve this is.And this is all of the problems I've thought about and like, and I know this framework maps well to these sets of problems, right? What are your problems? Like there's this whole other like big population of people that are building and you know, I basically wanna make sure that I have like user empathy and I have like I'm able to understand what people are doing and like make sure the framework like maps well.So I think I did a lot of that, like. Immediately after the release, like talking to a lot of teams and talking to a lot of users. I think since then, I basically feel like I have a fair idea of like, you know what's great about it, what's mediocre about it, and what's like, not good about it? And that helps kind of guide my prioritization list of like what I [00:43:00] wanna ship and what I wanna build.So now it's more kind of like, I would say, yeah, back to being more, more balanced. Alessio: All the companies we work with that are in open source, I always try and have them think through open source as a distribution model. Mm-hmm. Or like a development model. I was looking in the contributors list, and you have by far the most code, the second largest contributor. It's your husband. And after that it kind of goes, goes or magnitude lower. What have you found kind of working in, in open source in like a very fast moving project for, for the first time? You know, it's a, like with my husband, it's the community. No, no. It's the, it's the community like, A superpower to you?Do you feel like, do you feel like having to explain why you're doing things a certain way, like getting people buy in is maybe slowing you down when things move so quickly? I'm, I'm always interested to hears people's thoughts.Shreya: Oh that's a good question. I think like, there's part of like, I think guardrails at that stage, right?You know, I have like feature requests and I have [00:44:00] contributors, but I think right now, like I'm doing the bulk of like supporting those feature requests, et cetera. So I think a goal for me, and I remember we chatted about this as well you know, when we, when we spoke last, we're just like, okay.You know, getting into that point where, yeah, you, you essentially like kind of start nurturing and like getting more contributions from like the open source. So I think like that's one of the things that yeah. Is kind of the next goal for me. Yeah, it's been pretty. Fun. I, I would say like up until now, because I haven't made any big breaking a API changes, et cetera, so I haven't like, needed that community input.I think like one of the big ones that is coming right now is like the code, right? Like the code first, a API for creating rails. So I think like that was kind of important for like nailing that user experience, et cetera. So the, so the collaborators that I'm working with, there's basically an an R F C and community input, et cetera, and you know, what the best way to do that would be.And so that's actually, frankly, been like pretty fun as well to see the community be like opinionated about like, here's how I'm doing it and like, this works for me, this doesn't work for me, et cetera. So that's been like new for me as well. Like, I [00:45:00] think I am my previous company we also had like open source project and it was built on open source, but like, this is the first time that I've created a project with an open source project with like that level of engagement.So that's been pretty fun.Swyx: I'm always curious about like potential future business model, modern sensation,Shreya: anything like that. Yeah. I think I'm interested in entrepreneurship generally, honestly, trying to figure out like what the, all of those questions, right?Like business model, ISwyx: think a lot of people are in your shoes, right? They're developers. Mm-hmm. They and see a lot of energy they would like to start working on with open source projects. Mm-hmm. What is a deciding factor? What do you think people should think about when deciding whether or not, Hey, this is just a project that I maintained versus, Nope, I'm going to do the whole thing that get funding and allShreya: that.I think for me So I'm already kind of like I'm al I'm working on the open source full time. I think like the motivating thing for me was that, okay, this is. A problem that would need to get solved, like one way or another.This we talked about in variance earlier, and I do think that this is a, like being able to, like, I think if, if there's a contraction or a correction and [00:46:00] the, these LMS like don't have the kind of impact that we're, we're all hoping they would, I think it would be because of like, this problem because people kind of find that it's not as useful when it's running at very large scales when it's running in production, et cetera.So I think like that was very, that gave me a lot of conviction that it's something that I kind of wanted to work on and that was a switch for me. That it gave me the conviction to, for example, quit my job. Yeah. Also, yeah. Slightly confidential. Off the record. Off the record, yeah. Yeah.Alessio: We're not gonna talk about. Special project at Apple. That's a, that's very secret. Yeah. But you overlap Apple with Ian Goodfellow, which is obviously a, a very public figure in the AI space.Swyx: Actually, not that many people know what he did, so maybe we can, she can introduce Ian Goodfellow as well.Shreya: But, yeah, so Ian Goodfellow is the creator of Ganz or a generative adversarial network.So this was, I think I'm gonna mess up between 1215, I think 14, 15 ish if I remember correctly. So he basically created gans as a PhD student. As a PhD student. And he has a pretty interesting story of like how he thought of them and how [00:47:00] he kind of, Built the, and I I'm sure there's like interviews in like podcasts, et cetera with him where he talks about it, where like, how he got the idea for it and how he kind of like wrote the paper and did the experiments.So gans essentially were kind of like the first wave of generative images where you would see essentially kind of like fake auto-generated images, you know conditioned on like certain distributions. And so they were like very many variants of gans, like DC GAN, I'm gonna mess up the pronunciation, but dub, I'm just gonna call it w GaN.Mm-hmm. GAN Yeah. That like, you would essentially see these like really wonderful generative art. And I do think that like so I, I got the chance to work with him while at Apple. He had just moved to Apple from Google Brain and was building the cross-functional machine learning team within SPG.And I got the chance to work with him, which is very exciting. I learned so much and he is a fantastic manager and yeah, really, really enjoyed working withAlessio: him. And then he, he quit his job when they forced him to go back to the office. Right? That's theSwyx: Oh, really? Oh,Alessio: I didn't see that. Oh, okay. I think he basically, apple was like, you gotta go [00:48:00] back to the office.He said peace. That justSwyx: went toon. I'm curious, like what's some, some things that you learned from Ian that, or maybe some stories that,Shreya: Could be interesting. So there's like one, maybe machine learning specific and like one, maybe not machine learning specific and just general, like career stuff.Yeah. So the ML specific one was that well, Very high level. I think like working with him, you just truly see the creativity. And like after I worked with him, I was like, yeah, I, I totally get that. This is the the guy, like how his, how his brain works it's totally, it's so obvious that this is the guy who made like gans work basically.So I think he, when he does machine learning and when he thinks about like problems to solve, he thinks about it from a very creative out of the box way of thinking about it. And we kind of saw that with like, some of the problems where he was working on where anytime he had like feedback or suggestions on the, on the approaches that I was taking, I was like, wow, this is really exciting and like very creative and yeah, it was very, very cool to work on.So that was very high level machine learning.Swyx: I think the apple, apple standing by with like a blow dart if you, if like, say anymore.Shreya: I think the, the non-technical stuff, which [00:49:00] was I think truly made him such a fantastic manager. But when I went to Apple, I was, you know maybe a year outta school outta my job at that point.And I remember that I like most new grads was. Had like, okay, I, I need to kind of solve this problem on my own before I kind of get external help. Yeah. Yeah. And like, one of my first, I think probably my first or second week, like Ian and I, we were para programming and I remember that we were working together and like some setup issues were happening.And he would wait like exactly 45 seconds before he would like, fire up a message on Slack and like, how do I, how do I kind of fix this? How do they do this? And it just like totally transformed like, like, they're just like us, you know? I think not even that, it's that like. I kind of realized that I was optimizing for the wrong thing, right?By trying to like solve this myself. And instead of just if I'm running into a problem posting on Slack and like getting collaborative information, it wasn't that, yeah, it was, it was more the idea of my job is not like to solve this myself. My job is to solve this period.Mm-hmm. And the fastest way to solve this is the most, is the most correct way to do it. And like, [00:50:00] yeah, I truly, like, he's one of my favorite people. And I truly enjoyed working with him a lot, but that was one of my, Super early into my job there. Like I, I learned that that was You're verySwyx: lucky to do that.Yeah. Yeah. That's awesome. I love learning about the people side. Mm-hmm. You know, because that's what we deal with on a day-to-day basis, so. Mm-hmm. It's really nice to Yeah. To hear about that kind of stuff. Yeah. I was gonna go into one more academia question and then we'll go into lighting rounds.So you're close to Stanford. There'sShreya: obviously a lot of By, by my, yeah. My, my husband basically. Yeah. He doesn't have aSwyx: choice. There's a lot of interesting things coming on to Stanford, right. Vicuna, Alpaca and, and Stanford home. Are you keeping a close eye on like, the academic outputs? What are you seeing that is interesting to you?Shreya: I think obviously because of I'm, I'm focused on this problem, definitely looking at like how people are, you know thinking about the guard rails and like kind of adding more constraints.Swyx: It's such a great name by the way. I love it. Every time I see people say Guardrails, I'm like, yeah. Shreya: Yeah, I appreciate that. So I think like that is definitely one of the things. I think other ones are kind of like more out of like curiosity because of like some ML problems that I worked on in the past. Like I, [00:51:00] I mentioned that I worked on a efficient ml, so looking into like how people are doing, like more efficient inference.I think that is very fascinating to me. Mm-hmm. So, yeah, looking into that. I think evaluation helm was pretty exciting, really looking forward to like longer context length and seeing what's possible with that. More better fine tuning with like maybe lower data, et cetera. I think those are all some of the themes that I'm interested in.Swyx: Yeah. Yeah. Okay. So just because you have more expertise with efficiency, are you talking about quantization? Are you talking about pruning? Are you talking about. Distillation. I doShreya: think that the right way to solve these problems is always like to a mix. Yeah. A mix. Everything of them and like ensemble, all of these methods together.So I think, yeah, basically there's this like constant like tug of war and like push and pull between adding like some of these colonization for example, like improved memory, improved latency, et cetera. But then immediately you get like a performance hit, right? So like there's this like balance between like making it smaller and making it more efficient, but like not losing out on like what that performance is.And it's a big kind of experimentation framework. It's like understanding like where the bottlenecks are. So it's very, it's [00:52:00] very. You know, exploratory and experimental in nature. And so it's hard to kind of like be prescriptive about this is exactly what would work. It like, truly depends, like use case to use case architecture to architecture, hardware to hardware, et cetera.Yeah. WannaAlessio: jump into lightning round? Yeah. You ready?Shreya: I, IAlessio: hope so. Yeah. So we have five questions. Mm-hmm. And yeah, just respond in a sentence or two. Sean sometimes has the follow up tendency to follow up questions. The light. Yeah. You wanna get more info, which is, which is be ready. So the first one we always ask is what's your favorite AI product?Shreya: Very boring answer, but co-pilot life changing. Yeah. Yeah. Absolutely. Love it. Yeah.Swyx: Surprisingly not that many people have called out copilot in Oh, really? In our interviews. Cuz everyone's going to arts, like, they're like mid journeys, they will diff stuff. I see. Gotcha. But yeah, co-pilot is is great.Underrated. Yeah. It's still for $10 a month.Shreya: I mean, why not? Yeah. It's, it's, it's so wonderful.Swyx: I'm looking forward to co-pilot X, which is sort of the next iteration. Yeah.Shreya: I was testing on my co-pilot, so I [00:53:00] just got upgrade my laptop and then setting up vs code. And then I got co-pilot labs, I think is it?Or experimental. Yeah. Even that like Yes. Brushes and stuff. Yeah. Yeah. Yeah.Swyx: That was pretty cool. Talk to Amelia, who works on GitHub next. They, they build copilot labs and there's the voice component, which I don't know if you've tried. Oh, I, I stick whisper with co-pilot.Shreya: I see. It's just like your instructions and, yeah.Yeah. Oh,wellSwyx: also I have rsi. Mm-hmm. So actually sometimes it, it hurts when I type. I So, see it's actually super helpful to talk to your,Shreya: ah, interesting. Okay. Id, yeah, it's pretty, yeah. I, it was, Playing around with it yesterday, I was like, wow, this is so cool.Swyx: Yeah. Next question. What is something you thought would take much longer than, but it's already here.Like this is an acceleration question.Shreya: Let's see. Yeah, maybe this is getting like too developer focused too. Code focused. It's, but I, I do think like a lot of the auto generating code stuff is is really freaking cool. And I think especially if combine it with like maybe testing, right? Mm-hmm.Where you have like code and then you have like test to make sure the code work. And like you have this like, kind of like iterative loop until you refinement, until you're able to kind of [00:54:00] like self-heal code or like automatically generate code. I think like that is superSwyx: fascinating to you. Are you referring to some productsShreya: or demos that Actually I wouldn't give a, a plug for like basically this GitHub action called AutoPR, which like one of my community contributors kind of built using guardrails.And so the idea of what auto PR does is it takes a GitHub issue and if you have the right label for it, it automatically triggers this action where you create a PR given the issue text, et cetera. Huh? Yeah. Oh, it's so cool. It's, so your issue is the prompt. Yeah. Amongst like, other things other like Other context that you don't like?I'm gonna try this out right now. Yeah. Yeah. This is crazy. Yeah, it, it's, it's really cool. So I think like these types of workflows, it will take time before we can use them seamlessly, but Yeah. Truly very fascinating. Alessio: There's another open source project called a Wolverine by BiobootloaderYeah. Yeah, it's cool. It's really cool. It's basically like self-healing code. Yeah. You just let it run and then it makes a mistake and runs in a REPL, takes the code and ask it to just give you the diff and [00:55:00] like drops out the code and runs it again. It justSwyx: automates what I do anyway. Exactly.Alessio: So we can focus on the podcast.Shreya: This is one of the things that won't be automated away. Yeah. I think like, yeah, I, I saw over bringing, I think it was pretty cool and I think I'm very excited about that problem also because if you can think about it as like framing it within the context of these validators, et cetera, right?Like I think so bug-free sequel. What that does is like exactly that workflow of like generates code, executes, it takes failures, re-ask, et cetera. So implements that whole workflow like within a validator. Yeah. Swyx:The future is here.Alessio: Well, this kind of ties into the next question.A year from now, what will be will be the most surprised by in AI?Shreya: Hmm. Yeah. Not to be a downer, but I do think that like how hard it is to truly take these things to production and like get consistently amazing user experiences from it. But I think like this, yeah, we're at that stage where there's basically like a little bit of a gap between like what, what you kind of [00:56:00] see as being very exciting.And I think it's like, it's a demonstration of what's possible with this, right? But like, closing that gap between like what's possible versus like what's consistently deliverable. I think it's, it's a harder problem to solve. So I do think that it's gonna take some time before all of these experiences are like absolutely wonderful.So yeah, I think like a year from now we'll kind of like find some of these things taking a little bit longer than expected.Swyx: Request for startups or request for product. What's an AI thing you would pay for if somebodyShreya: built it? I think this is already exists and I just kind of maybe have to hook it up, et cetera, but I would a hundred percent pay for this, like emails.Emails in my tone. Oh, I see. Yeah, no, keep yeah,Swyx: emails, list your specs. Like what, what should it do? What should IShreya: not do? Yeah. I think like, I basically have an idea always of like this is tldr what I want this email to say. Sure. I want it to be in my tone so that it's not super formal, it's not super like lax, et cetera.I want it to be like tours and short and I want it to like I wanted to have context of like a previous history and maybe some [00:57:00] other like links, et cetera that I'm adding. So I wanted to hook it up to like, some of my data sources and do that. I think that would, I would like pay Yeah.Good money for that every month. Yeah. Nice.Alessio: I, I bill one the only as the, the email trend as the context, but then as a bunch of things like For example, for me it's like if this company is not in the developer tool space, I'm gonna pass on it. So direct to pass email, if the person is asking to schedule, please ask them to send them to send me their calendarly so I can pick a time from there.All these different things I see. But sometimes it's a new thread with somebody you already spoken with a bunch of times, so it should pull all of that stuff too. But I open source all of it because I don't want to deal with storing peoples email. It'sShreya: like the, the hardest thing. Do you find that it does tone well?Like does it match your tone or doesAlessio: it I have to use right now public figures as a I see thing. So it, I do things like write like Paul Graham or write or like, people that are like, have a lot of variety. Oh, that's actually pretty cool. Yeah. You know? Yeah. Yeah. It works pretty well. I see. Nice.There's some things Paul Graham would not [00:58:00] say that it writes in the, in the emails, but overall I would say probably like 20% of the drafts it creates are like, Usually good to go, like 70% it needs some work. And then there's like the 10% that is like, I have no idea why you just said that. It's completely like out of left field.I see. Yeah. But it will, it'll get better if I spend more time on it. But you know, it kind of adds up because I use G B D four, I get a lot of emails, so like having an autodraft responses for everything in my inbox, it, it adds up. So maybe the pattern of having, based on the label you put on the email to auto generate, it'sShreya: it's good.Oh, that's pretty cool. Yeah. And actually, yeah, as a separate follower, I would love to know like all of the ways it messes up and, you know if we get on guard, let's talk about it now. Let's,Swyx: yeah. Sometimes it doesn't, your project should use guardrails.Alessio: Yeah. No, no, no. Definitely. I think sometimes it doesn't understand the, the email is not a pitch, so somebody emails me something that's like unrelated and then it's like, oh, thank you.[00:59:00]But since you're not working in the space, I'm not gonna be investing in you. But good luck with the rest of your fundraise. But it's like, never mention a fundraise, but because in the prompt, it, as part of the prompt is like, if it's a pitch and it's not in the space, a pre-draft, an email, it thinks it has to do it a lot more than it should.Or like, same with scheduling somebody you know, any sales call that, any sales email that I get, it always wants to schedule a call with them. And I was like, I don't wanna meet with them, I don't wanna buy this thing. But the, the context of the email is like, they wanna schedule something so the responders you know, is helping you schedule, but it doesn't know that I don't want to, doesShreya: it like autodraft all, like is there any input that you give for each email or does it autodraft everything?Alessio: I just give it the tread and then a blank blank slate. I don't give it anything else because I wanted to run while I'm not in the inbox, but yours. It's a little better. What I'm doing is draft generation. What you wanna do is like draft expansion. So instead of looking at the [01:00:00] inbox in your case, you will look at the draft folder and look through each draft and expend the draft.Yeah, to be a full response, which makes a lot of sense.Shreya: Yeah, that's pretty interesting. I, I can think of like some guardrails that I can know quick, quick and dirty guardrails that I can hook up that would make some of those problems like go away. Yeah. Yeah,Swyx: like as in do they existShreya: now or they don't exist?They don't exist now, but I can like, think about like, I'm like always looking for problems so yeah. This is aSwyx: API design issue, right? Because if, if one conversation, you come away with like three guardrails and then another conversation, you come, none of three guardrails. How do you think about like, there's so many APIs that you could possibly do, right?You need to design for generally composable orShreya: reusable APIs. Yeah, so I would probably like break this down into like, like a relevant action item guardrail or something, right? And it's basically like essentially only talk about, or only like the action items should only be things that are within the context of those emails.And if something hasn't been mentioned, don't add context about that. So that would probably be a generic gar that I could, I could add. And then you, you could probably configure it with like, what are the sets of like [01:01:00] follow up action items that you typically have and, and correct for it that way.Swyx: We, we just heard a new API being designed live, which doesn't happen very often.Shreya: It's very cool. Yeah. AndAlessio: last but not least, if there's one thing you want people to take away about AI and kind of this moment that we're in, in technology, what would that be?Shreya: I do think this is the most exciting time in machine learning, as least as long as I've been working on it.And so I do think, like, frankly, we're all just so lucky to kind of be living through this and it's just very fascinating to be part of that. I think at the same time the technology is so exciting that you, you get like, Driven by wanting to use it. But I think like really thinking about like what's the best way to use it along with like other systems that have existed so that it's more kind of like task focused and like outcome focused rather than like technology focused.So this kind of like obviously I'm biased because I feel this way because I've designed guardrails this way that it kind of like merges LLMs with rules and heuristics and like traditional ML, et cetera. But I do think [01:02:00] that like this, this general framework of like thinking about how to build ML products is something that I'm bullish on and something I'd want people to like think about as well.Yeah.Alessio: Awesome. Well thank you so much for comingShreya: Yeah, absolutely. Thanks for inviting me. This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit www.latent.space
    16/5/2023
    1:02:28
  • The AI Founder Gene: Being Early, Building Fast, and Believing in Greatness — with Sharif Shameem of Lexica
    Thanks to the over 42,000 latent space explorers who checked out our Replit episode! We are hosting/attending a couple more events in SF and NYC this month. See you if in town!Lexica.art was introduced to the world 24 hours after the release of Stable Diffusion as a search engine for prompts, gaining instant product-market fit as a world discovering generative AI also found they needed to learn prompting by example.Lexica is now 8 months old, serving 5B image searches/day, and just shipped V3 of Lexica Aperture, their own text-to-image model! Sharif Shameem breaks his podcast hiatus with us for an exclusive interview covering his journey building everything with AI!The conversation is nominally about Sharif’s journey through his three startups VectorDash, Debuild, and now Lexica, but really a deeper introspection into what it takes to be a top founder in the fastest moving tech startup scene (possibly ever) of AI. We hope you enjoy this conversation as much as we did!Full transcript is below the fold. We would really appreciate if you shared our pod with friends on Twitter, LinkedIn, Mastodon, Bluesky, or your social media poison of choice!Timestamps* [00:00] Introducing Sharif* [02:00] VectorDash* [05:00] The GPT3 Moment and Building Debuild* [09:00] Stable Diffusion and Lexica* [11:00] Lexica’s Launch & How it Works* [15:00] Being Chronically Early* [16:00] From Search to Custom Models* [17:00] AI Grant Learnings* [19:30] The Text to Image Illuminati?* [20:30] How to Learn to Train Models* [24:00] The future of Agents and Human Intervention* [29:30] GPT4 and Multimodality* [33:30] Sharif’s Startup Manual* [38:30] Lexica Aperture V1/2/3* [40:00] Request for AI Startup - LLM Tools* [41:00] Sequencing your Genome* [42:00] Believe in Doing Great Things* [44:30] Lightning RoundShow Notes* Sharif’s website, Twitter, LinkedIn* VectorDash (5x cheaper than AWS)* Debuild Insider, Fast company, MIT review, tweet, tweet* Lexica* Introducing Lexica* Lexica Stats* Aug: “God mode” search* Sep: Lexica API * Sept: Search engine with CLIP * Sept: Reverse image search* Nov: teasing Aperture* Dec: Aperture v1* Dec - Aperture v2* Jan 2023 - Outpainting* Apr 2023 - Aperture v3* Same.energy* AI Grant* Sharif on Agents: prescient Airpods tweet, Reflection* MiniGPT4 - Sharif on Multimodality* Sharif Startup Manual* Sharif Future* 23andMe Genome Sequencing Tool: Promethease* Lightning Round* Fave AI Product: Cursor.so. Swyx ChatGPT Menubar App.* Acceleration: Multimodality of GPT4. Animated Drawings* Request for Startup: Tools for LLMs, Brex for GPT Agents* Message: Build Weird Ideas!TranscriptAlessio: Hey everyone, welcome to the Latent Space podcast. This is Alessio, partner and CTO on Residence at Decibel Partners. I'm joined by my co-host Wix, writer and editor of Latent Space. And today we have Sharish Amin. Welcome to the studio. Sharif: Awesome. Thanks for the invite.Swyx: Really glad to have you. [00:00] Introducing SharifSwyx: You've been a dream guest, actually, since we started drafting guest lists for this pod. So glad we could finally make this happen. So what I like to do is usually introduce people, offer their LinkedIn, and then prompt you for what's not on your LinkedIn. And to get a little bit of the person behind the awesome projects. So you graduated University of Maryland in CS. Sharif: So I actually didn't graduate, but I did study. Swyx: You did not graduate. You dropped out. Sharif: I did drop out. Swyx: What was the decision behind dropping out? Sharif: So first of all, I wasn't doing too well in any of my classes. I was working on a side project that took up most of my time. Then I spoke to this guy who ended up being one of our investors. And he was like, actually, I ended up dropping out. I did YC. And my company didn't end up working out. And I returned to school and graduated along with my friends. I was like, oh, it's actually a reversible decision. And that was like that. And then I read this book called The Case Against Education by Brian Kaplan. So those two things kind of sealed the deal for me on dropping out. Swyx: Are you still on hiatus? Could you still theoretically go back? Sharif: Theoretically, probably. Yeah. Still on indefinite leave. Swyx: Then you did some work at Mitra? Sharif: Mitra, yeah. So they're lesser known. So they're technically like an FFRDC, a federally funded research and development center. So they're kind of like a large government contractor, but nonprofit. Yeah, I did some computer vision work there as well. [02:00] VectorDashSwyx: But it seems like you always have an independent founder bone in you. Because then you started working on VectorDash, which is distributed GPUs. Sharif: Yes. Yeah. So VectorDash was a really fun project that we ended up working on for a while. So while I was at Mitra, I had a friend who was mining Ethereum. This was, I think, 2016 or 2017. Oh my God. Yeah. And he was mining on his NVIDIA 1080Ti, making around like five or six dollars a day. And I was trying to train a character recurrent neural network, like a character RNN on my iMessage text messages to make it like a chatbot. Because I was just curious if I could do it. Because iMessage stores all your past messages from years ago in a SQL database, which is pretty nifty. But I wanted to train it. And I needed a GPU. And it was, I think, $60 to $80 for a T4 on AWS, which is really slow compared to a 1080Ti. If you normalize the cost and performance versus the 1080Ti when someone's mining Ethereum, it's like a 20x difference. So I was like, hey, his name was Alex. Alex, I'll give you like 10 bucks if you let me borrow your 1080Ti for a week. I'll give you 10 bucks per day. And it was like 70 bucks. And I used it to train my model. And it worked great. The model was really bad, but the whole trade worked really great. I got a really high performance GPU to train my model on. He got much more than he was making by mining Ethereum. So we had this idea. I was like, hey, what if we built this marketplace where people could rent their GPUs where they're mining cryptocurrency and machine learning researchers could just rent them out and pay a lot cheaper than they would pay AWS. And it worked pretty well. We launched in a few months. We had over 120,000 NVIDIA GPUs on the platform. And then we were the cheapest GPU cloud provider for like a solid year or so. You could rent a pretty solid GPU for like 20 cents an hour. And cryptocurrency miners were making more than they would make mining crypto because this was after the Ethereum crash. And yeah, it was pretty cool. It just turns out that a lot of our customers were college students and researchers who didn't have much money. And they weren't necessarily the best customers to have as a business. Startups had a ton of credits and larger companies were like, actually, we don't really trust you with our data, which makes sense. Yeah, we ended up pivoting that to becoming a cloud GPU provider for video games. So we would stream games from our GPUs. Oftentimes, like many were located just a few blocks away from you because we had the lowest latency of any cloud GPU provider, even lower than like AWS and sometimes Cloudflare. And we decided to build a cloud gaming platform where you could pretty much play your own games on the GPU and then stream it back to your Mac or PC. Swyx: So Stadia before Stadia. Sharif: Yeah, Stadia before Stadia. It's like a year or so before Stadia. Swtx: Wow. Weren't you jealous of, I mean, I don't know, it sounds like Stadia could have bought you or Google could have bought you for Stadia and that never happened? Sharif: It never happened. Yeah, it didn't end up working out for a few reasons. The biggest thing was internet bandwidth. So a lot of the hosts, the GPU hosts had lots of GPUs, but average upload bandwidth in the United States is only 35 megabits per second, I think. And like a 4K stream needs like a minimum of 15 to 20 megabits per second. So you could really only utilize one of those GPUs, even if they had like 60 or 100. [05:00] The GPT3 Moment and Building DebuildSwyx: And then you went to debuild July 2020, is the date that I have. I'm actually kind of just curious, like what was your GPT-3 aha moment? When were you like GPT-3-pilled? Sharif: Okay, so I first heard about it because I was also working on another chatbot. So this was like after, like everything ties back to this chatbot I'm trying to make. This was after working on VectorDash. I was just like hacking on random projects. I wanted to make the chatbot using not really GPT-2, but rather just like it would be pre-programmed. It was pretty much you would give it a goal and then it would ask you throughout the week how much progress you're making to that goal. So take your unstructured response, usually a reply to a text message, and then it would like, plot it for you in like a table and you could see your progress over time. It could be for running or tracking calories. But I wanted to use GPT-3 to make it seem more natural because I remember someone on Bookface, which is still YC's internal forum. They posted and they were like, OpenAI just released AGI and it's GPT-3. I asked it like a bunch of logic puzzles and it solved them all perfectly. And I was like, what? How's no one else talking about this? Like this is either like the greatest thing ever that everyone is missing or like it's not that good. So like I tweeted out if anyone could get me access to it. A few hours later, Greg Brockman responded. Swyx: He is everywhere. Sharif: He's great. Yeah, he's on top of things. And yeah, by that afternoon, I was like messing around with the API and I was like, wow, this is incredible. You could chat with fake people or people that have passed away. You could like, I remember the first conversation I did was this is a chat with Steve Jobs and it was like, interviewer, hi. What are you up to today on Steve? And then like you could talk to Steve Jobs and it was somewhat plausible. Oh, the thing that really blew my mind was I tried to generate code with it. So I'd write the function for a JavaScript header or the header for a JavaScript function. And it would complete the rest of the function. I was like, whoa, does this code actually work? Like I copied it and ran it and it worked. And I tried it again. I gave more complex things and like I kind of understood where it would break, which was like if it was like something, like if it was something you couldn't easily describe in a sentence and like contain all the logic for in a single sentence. So I wanted to build a way where I could visually test whether these functions were actually working. And what I was doing was like I was generating the code in the playground, copying it into my VS code editor, running it and then reloading the react development page. And I was like, okay, cool. That works. So I was like, wait, let me just put this all in like the same page so I can just compile in the browser, run it in the browser and then submit it to the API in the browser as well. So I did that. And it was really just like a simple loop where you just type in the prompt. It would generate the code and then compile it directly in the browser. And it showed you the response. And I did this for like very basic JSX react components. I mean, it worked. It was pretty mind blowing. I remember staying up all night, like working on it. And it was like the coolest thing I'd ever worked on at the time so far. Yeah. And then I was like so mind blowing that no one was talking about this whole GPT three thing. I was like, why is this not on everyone's minds? So I recorded a quick 30 second demo and I posted on Twitter and like I go to bed after staying awake for like 20 hours straight. When I wake up the next morning and I had like 20,000 likes and like 100,000 people had viewed it. I was like, oh, this is so cool. And then I just kept putting demos out for like the next week. And yeah, that was like my GPT three spark moment. Swyx: And you got featured in like Fast Company, MIT Tech Review, you know, a bunch of stuff, right? Sharif: Yeah. Yeah. I think a lot of it was just like the API had been there for like a month prior already. Swyx: Not everyone had access. Sharif: That's true. Not everyone had access. Swyx: So you just had the gumption to tweet it out. And obviously, Greg, you know, on top of things as always. Sharif: Yeah. Yeah. I think it also makes a lot of sense when you kind of share things in a way that's easily consumable for people to understand. Whereas if you had shown a terminal screenshot of a generating code, that'd be pretty compelling. But whereas seeing it get rendered and compiled directly in front of you, there's a lot more interesting. There's also that human aspect to it where you want to relate things to the end user, not just like no one really cares about evals. When you can create a much more compelling demo explaining how it does on certain tasks. [09:00] Stable Diffusion and LexicaSwyx: Okay. We'll round it out soon. But in 2022, you moved from Debuild to Lexica, which was the search engine. I assume this was inspired by stable diffusion, but I can get the history there a little bit. Sharif: Yeah. So I was still working on Debuild. We were growing at like a modest pace and I was in the stable... Swyx: I was on the signup list. I never got off. Sharif: Oh yeah. Well, we'll get you off. It's not getting many updates anymore, but yeah, I was in the stable diffusion discord and I was in it for like many hours a day. It was just like the most exciting thing I'd ever done in a discord. It was so cool. Like people were generating so many images, but I didn't really know how to write prompts and people were like writing really complicated things. They would be like, like a modern home training on our station by Greg Rutkowski, like a 4k Unreal Engine. It's like that there's no way that actually makes the images look better. But everyone was just kind of copying everyone else's prompts and like changing like the first few words. Swyx: Yeah. Yeah. Sharif: So I was like using the discord search bar and it was really bad because it showed like five images at a time. And I was like, you know what? I could build a much better interface for this. So I ended up scraping the entire discord. It was like 10 million images. I put them in a database and I just pretty much built a very basic search engine where you could just type for type a word and then it returned all the prompts that had that word. And I built the entire website for it in like 20, in like about two days. And we shipped it the day I shipped it the day after the stable diffusion weights were open sourced. So about 24 hours later and it kind of took off in a way that I never would have expected. Like I thought it'd be this cool utility that like hardcore stable diffusion users would find useful. But it turns out that almost anyone who mentioned stable diffusion would also kind of mention Lexica in conjunction with it. I think it's because it was like it captured the zeitgeist in an easy to share way where it's like this URL and there's this gallery and you can search. Whereas running the model locally was a lot harder. You'd have to like to deploy it on your own GPU and like set up your own environment and like do all that stuff. Swyx: Oh, my takeaway. I have two more to add to the reasons why Lexica works at the time. One is lower latency is all you need. So in other words, instead of waiting a minute for your image, you could just search and find stuff that other people have done. That's good. And then two is everyone knew how to search already, but people didn't know how to prompt. So you were the bridge. Sharif: That's true. Yeah. You would get a lot better looking images by typing a one word prompt versus prompting for that one word. Yeah. Swyx: Yeah. That is interesting. [11:00] Lexica’s Explosion at LaunchAlessio: The numbers kind of speak for themselves, right? Like 24 hours post launch, 51,000 queries, like 2.2 terabytes in bandwidth. Going back to the bandwidth problem that you have before, like you would have definitely run into that. Day two, you doubled that. It's like 111,000 queries, four and a half terabytes in bandwidth, 22 million images served. So it's pretty crazy. Sharif: Yeah. I think we're, we're doing like over 5 billion images served per month now. It's like, yeah, that's, it's pretty crazy how much things have changed since then. Swyx: Yeah. I'm still showing people like today, even today, you know, it's been a few months now. This is where you start to learn image prompting because they don't know. Sharif: Yeah, it is interesting. And I, it's weird because I didn't really think it would be a company. I thought it would just be like a cool utility or like a cool tool that I would use for myself. And I really was just building it for myself just because I didn't want to use the Discord search bar. But yeah, it was interesting that a lot of other people found it pretty useful as well. [11:00] How Lexica WorksSwyx: So there's a lot of things that you release in a short amount of time. The God mode search was kind of like, obviously the first thing, I guess, like maybe to talk about some of the underlying technology you're using clip to kind of find, you know, go from image to like description and then let people search it. Maybe talk a little bit about what it takes to actually make the search magic happen. Sharif: Yeah. So the original search was just using Postgres' full text search and it would only search the text contents of the prompt. But I was inspired by another website called Same Energy, where like a visual search engine. It's really cool. Do you know what happened to that guy? I don't. Swyx: He released it and then he disappeared from the internet. Sharif: I don't know what happened to him, but I'm sure he's working on something really cool. He also worked on like Tabnine, which was like the very first version of Copilot or like even before Copilot was Copilot. But yeah, inspired by that, I thought like being able to search images by their semantics. The contents of the image was really interesting. So I pretty much decided to create a search index on the clip embeddings, the clip image embeddings of all the images. And when you would search it, we would just do KNN search on pretty much the image embedding index. I mean, we had way too many embeddings to store on like a regular database. So we had to end up using FAISS, which is a Facebook library for really fast KNN search and embedding search. That was pretty fun to set up. It actually runs only on CPUs, which is really cool. It's super efficient. You compute the embeddings on GPUs, but like you can serve it all on like an eight core server and it's really, really fast. Once we released the semantic search on the clip embeddings, people were using the search way more. And you could do other cool things. You could do like similar image search where if you found like a specific image you liked, you could upload it and it would show you relevant images as well. Swyx: And then right after that, you raised your seed money from AI grant, NetFreedman, then Gross. Sharif: Yeah, we raised about $5 million from Daniel Gross. And then we also participated in AI grant. That was pretty cool. That was kind of the inflection point. Not much before that point, Lexic was kind of still a side project. And I told myself that I would focus on it full time or I'd consider focusing on it full time if we had broke like a million users. I was like, oh, that's gonna be like years away for sure. And then we ended up doing that in like the first week and a half. I was like, okay, there's something here. And it was kind of that like deal was like growing like pretty slowly and like pretty linearly. And then Lexica was just like this thing that just kept going up and up and up. And I was so confused. I was like, man, people really like looking at pictures. This is crazy. Yeah. And then we decided to pivot the entire company and just focus on Lexica full time at that point. And then we raised our seed round. [15:00] Being Chronically EarlySwyx: Yeah. So one thing that you casually dropped out, the one that slip, you said you were working on Lexica before the launch of Stable Diffusion such that you were able to launch Lexica one day after Stable Diffusion. Sharif: Yeah.Swyx: How did you get so early into Stable Diffusion? Cause I didn't hear about it. Sharif: Oh, that's a good question. I, where did I first hear about Stable Diffusion? I'm not entirely sure. It must've been like somewhere on Twitter or something. That changed your life. Yeah, it was great. And I got into the discord cause I'd used Dolly too before, but, um, there were a lot of restrictions in place where you can generate human faces at the time. You can do that now. But when I first got access to it, like you couldn't do any faces. It was like, there were like a, the list of adjectives you couldn't use was quite long. Like I had a friend from Pakistan and it can generate anything with the word Pakistan in it for some reason. But Stable Diffusion was like kind of the exact opposite where there were like very, very few rules. So that was really, really fun and interesting, especially seeing the chaos of like a bunch of other people also using it right in front of you. That was just so much fun. And I just wanted to do something with it. I thought it was honestly really fun. Swyx: Oh, well, I was just trying to get tips on how to be early on things. Cause you're pretty consistently early to things, right? You were Stadia before Stadia. Um, and then obviously you were on. Sharif: Well, Stadia is kind of shut down now. So I don't know if being early to that was a good one. Swyx: Um, I think like, you know, just being consistently early to things that, uh, you know, have a lot of potential, like one of them is going to work out and you know, then that's how you got Lexica. [16:00] From Search to Custom ModelsAlessio: How did you decide to go from search to running your own models for a generation? Sharif: That's a good question. So we kind of realized that the way people were using Lexica was they would have Lexica open in one tab and then in another tab, they'd have a Stable Diffusion interface. It would be like either a discord or like a local run interface, like the automatic radio UI, um, or something else. I just, I would watch people use it and they would like all tabs back and forth between Lexica and their other UI. And they would like to scroll through Lexica, click on the prompt, click on an image, copy the prompt, and then paste it and maybe change a word or two. And I was like, this should really kind of just be all within Lexica. Like, it'd be so cool if you could just click a button in Lexica and get an editor and generate your images. And I found myself also doing the all tab thing, or it was really frustrating. I was like, man, this is kind of tedious. Like I really wish it was much simpler. So we just built generations directly within Lexica. Um, so we do, we deployed it on, I don't remember when we first launched, I think it was November, December. And yeah, people love generating directly within it. [17:00] AI Grant LearningsSwyx: I was also thinking that this was coming out of AI grants where, you know, I think, um, yeah, I was like a very special program. I was just wondering if you learned anything from, you know, that special week where everyone was in town. Sharif: Yeah, that was a great week. I loved it. Swyx: Yeah. Bring us, bring us in a little bit. Cause it was awesome. There. Sharif: Oh, sure. Yeah. It's really, really cool. Like all the founders in AI grants are like fantastic people. And so I think the main takeaway from the AI grant was like, you have this massive overhang in compute or in capabilities in terms of like these latest AI models, but to the average person, there's really not that many products that are that cool or useful to them. Like the latest one that has hit the zeitgeist was chat GPT, which used arguably the same GPT three model, but like RLHF, but you could have arguably built like a decent chat GPT product just using the original GPT three model. But no one really did it. Now there were some restrictions in place and opening. I like to slowly release them over the few months or years after they release the original API. But the core premise behind AI grants is that there are way more capabilities than there are products. So focus on building really compelling products and get people to use them. And like to focus less on things like hitting state of the art on evals and more on getting users to use something. Swyx: Make something people want.Sharif: Exactly. Host: Yeah, we did an episode on LLM benchmarks and we kind of talked about how the benchmarks kind of constrain what people work on, because if your model is not going to do well, unlike the well-known benchmarks, it's not going to get as much interest and like funding. So going at it from a product lens is cool. [19:30] The Text to Image Illuminati?Swyx: My hypothesis when I was seeing the sequence of events for AI grants and then for Lexica Aperture was that you had some kind of magical dinner with Emad and David Holtz. And then they taught you the secrets of training your own model. Is that how it happens? Sharif: No, there's no secret dinner. The Illuminati of text to image. We did not have a meeting. I mean, even if we did, I wouldn't tell you. But it really boils down to just having good data. If you think about diffusion models, really the only thing they do is learn a distribution of data. So if you have high quality data, learn that high quality distribution. Or if you have low quality data, it will learn to generate images that look like they're from that distribution. So really it boils down to the data and the amount of data you have and that quality of that data, which means a lot of the work in training high quality models, at least diffusion models, is not really in the model architecture, but rather just filtering the data in a way that makes sense. So for Lexica, we do a lot of aesthetic scoring on images and we use the rankings we get from our website because we get tens of millions of people visiting it every month. So we can capture a lot of rankings. Oh, this person liked this image when they saw this one right next to it. Therefore, they probably preferred this one over that. You can do pairwise ranking to rank images and then compute like ELO scores. You can also just train aesthetic models to learn to classify a model, whether or not someone will like it or whether or not it's like, rank it on a scale of like one to ten, for example. So we mostly use a lot of the traffic we get from Lexica and use that to kind of filter our data sets and use that to train better aesthetic models. [20:30] How to Learn to Train ModelsSwyx: You had been a machine learning engineer before. You've been more of an infrastructure guy. To build, you were more of a prompt engineer with a bit of web design. This was the first time that you were basically training your own model. What was the wrap up like? You know, not to give away any secret sauce, but I think a lot of people who are traditional software engineers are feeling a lot of, I don't know, fear when encountering these kinds of domains. Sharif: Yeah, I think it makes a lot of sense. And to be fair, I didn't have much experience training massive models at this scale before I did it. A lot of times it's really just like, in the same way when you're first learning to program, you would just take the problem you're having, Google it, and go through the stack overflow post. And then you figure it out, but ultimately you will get to the answer. It might take you a lot longer than someone who's experienced, but I think there are enough resources out there where it's possible to learn how to do these things. Either just reading through GitHub issues for relevant models. Swyx: Oh God. Sharif: Yeah. It's really just like, you might be slower, but it's definitely still possible. And there are really great courses out there. The Fast AI course is fantastic. There's the deep learning book, which is great for fundamentals. And then Andrej Karpathy's online courses are also excellent, especially for language modeling. You might be a bit slower for the first few months, but ultimately I think if you have the programming skills, you'll catch up pretty quickly. It's not like this magical dark science that only three people in the world know how to do well. Probably was like 10 years ago, but now it's becoming much more open. You have open source collectives like Eleuther and LAION, where they like to share the details of their large scale training runs. So you can learn from a lot of those people. Swyx: Yeah. I think what is different for programmers is having to estimate significant costs upfront before they hit run. Because it's not a thing that you normally consider when you're coding, but yeah, like burning through your credits is a fear that people have. Sharif: Yeah, that does make sense. In that case, like fine tuning larger models gets you really, really far. Even using things like low rank adaptation to fine tune, where you can like fine tune much more efficiently on a single GPU. Yeah, I think people are underestimating how far you can really get just using open source models. I mean, before Lexica, I was working on Debuild and we were using the GP3 API, but I was also like really impressed at how well you could get open source models to run by just like using the API, collecting enough samples from like real world user feedback or real world user data using your product. And then just fine tuning the smaller open source models on those examples. And now you have a model that's pretty much state of the art for your specific domain. Whereas the runtime cost is like 10 times or even 100 times cheaper than using an API. Swyx: And was that like GPT-J or are you talking BERT? Sharif: I remember we tried GPT-J, but I think FLAN-T5 was like the best model we were able to use for that use case. FLAN-T5 is awesome. If you can, like if your prompt is small enough, it's pretty great. And I'm sure there are much better open source models now. Like Vicuna, which is like the GPT-4 variant of like Lama fine tuned on like GPT-4 outputs. Yeah, they're just going to get better and they're going to get better much, much faster. Swyx: Yeah. We're just talking in a previous episode to the creator of Dolly, Mike Conover, which is actually commercially usable instead of Vicuna, which is a research project. Sharif: Oh, wow. Yeah, that's pretty cool. [24:00] Why No Agents?Alessio: I know you mentioned being early. Obviously, agents are one of the hot things here. In 2021, you had this, please buy me AirPods, like a demo that you tweeted with the GPT-3 API. Obviously, one of the things about being early in this space, you can only do one thing at a time, right? And you had one tweet recently where you said you hoped that that demo would open Pandora's box for a bunch of weird GPT agents. But all we got were docs powered by GPT. Can you maybe talk a little bit about, you know, things that you wish you would see or, you know, in the last few, last few weeks, we've had, you know, Hugging GPT, Baby AGI, Auto GPT, all these different kind of like agent projects that maybe now are getting closer to the, what did you say, 50% of internet traffic being skips of GPT agents. What are you most excited about, about these projects and what's coming? Sharif: Yeah, so we wanted a way for users to be able to paste in a link for the documentation page for a specific API, and then describe how to call that API. And then the way we would need to pretty much do that for Debuild was we wondered if we could get an agent to browse the docs page, read through it, summarize it, and then maybe even do things like create an API key and register it for that user. To do that, we needed a way for the agent to read the web page and interact with it. So I spent about a day working on that demo where we just took the web page, serialized it into a more compact form that fit within the 2048 token limit of like GPT-3 at the time. And then just decide what action to do. And then it would, if the page was too long, it would break it down into chunks. And then you would have like a sub prompt, decide on which chunk had the best action. And then at the top node, you would just pretty much take that action and then run it in a loop. It was really, really expensive. I think that one 60 second demo cost like a hundred bucks or something, but it was wildly impractical. But you could clearly see that agents were going to be a thing, especially ones that could read and write and take actions on the internet. It was just prohibitively expensive at the time. And the context limit was way too small. But yeah, I think it seems like a lot of people are taking it more seriously now, mostly because GPT-4 is way more capable. The context limit's like four times larger at 8,000 tokens, soon 32,000. And I think the only problem that's left to solve is finding a really good representation for a webpage that allows it to be consumed by a text only model. So some examples are like, you could just take all the text and pass it in, but that's probably too long. You could take all the interactive only elements like buttons and inputs, but then you miss a lot of the relevant context. There are some interesting examples, which I really like is you could run the webpage or you could run the browser in a terminal based browser. So there are some browsers that run in your terminal, which serialize everything into text. And what you can do is just take that frame from that terminal based browser and pass that directly to the model. And it's like a really, really good representation of the webpage because they do things where for graphical elements, they kind of render it using ASCII blocks. But for text, they render it as actual text. So you could just remove all the weird graphical elements, just keep all the text. And that works surprisingly well. And then there are other problems to solve, which is how do you get the model to take an action? So for example, if you have a booking page and there's like a calendar and there are 30 days on the calendar, how do you get it to specify which button to press? It could say 30, and you can match string based and like find the 30. But for example, what if it's like a list of friends in Facebook and trying to delete a friend? There might be like 30 delete buttons. How do you specify which one to click on? The model might say like, oh, click on the one for like Mark. But then you'd have to figure out the delete button in relation to Mark. And there are some ways to solve this. One is there's a cool Chrome extension called Vimium, which lets you use Vim in your Chrome browser. And what you do is you can press F and over every interactive element, it gives you like a character or two characters. Or if you type those two characters, it presses that button or it opens or focuses on that input. So you could combine a lot of these ideas and then get a really good representation of the web browser in text, and then also give the model a really, really good way to control the browser as well. And I think those two are the core part of the problem. The reasoning ability is definitely there. If a model can score in the top 10% on the bar exam, it can definitely browse a web page. It's really just how do you represent text to the model and how do you get the model to perform actions back on the web page? Really, it's just an engineering problem. Swyx: I have one doubt, which I'd love your thoughts on. How do you get the model to pause when it doesn't have enough information and ask you for additional information because you under specified your original request? Sharif: This is interesting. I think the only way to do this is to have a corpus where your training data is like these sessions of agents browsing the web. And you have to pretty much figure out where the ones that went wrong or the agents that went wrong, or did they go wrong and just replace it with, hey, I need some help. And then if you were to fine tune a larger model on that data set, you would pretty much get them to say, hey, I need help on the instances where they didn't know what to do next. Or if you're using a closed source model like GPT-4, you could probably tell it if you're uncertain about what to do next, ask the user for help. And it probably would be pretty good at that. I've had to write a lot of integration tests in my engineering days and like the dome. Alessio: They might be over. Yeah, I hope so. I hope so. I don't want to, I don't want to deal with that anymore. I, yeah, I don't want to write them the old way. Yeah. But I'm just thinking like, you know, we had the robots, the TXT for like crawlers. Like I can definitely see the DOM being reshaped a little bit in terms of accessibility. Like sometimes you have to write expats that are like so long just to get to a button. Like there should be a better way to do it. And maybe this will drive the change, you know, making it easier for these models to interact with your website. Sharif: There is the Chrome accessibility tree, which is used by screen readers, but a lot of times it's missing a lot of, a lot of useful information. But like in a perfect world, everything would be perfectly annotated for screen readers and we could just use that. That's not the case. [29:30] GPT4 and MultimodalitySwyx: GPT-4 multimodal, has your buddy, Greg, and do you think that that would solve essentially browser agents or desktop agents? Sharif: Greg has not come through yet, unfortunately. But it would make things a lot easier, especially for graphically heavy web pages. So for example, you were using Yelp and like using the map view, it would make a lot of sense to use something like that versus a text based input. Where, how do you serialize a map into text? It's kind of hard to do that. So for more complex web pages, that would make it a lot easier. You get a lot more context to the model. I mean, it seems like that multimodal input is very dense in the sense that it can read text and it can read it really, really well. So you could probably give it like a PDF and it would be able to extract all the text and summarize it. So if it can do that, it could probably do anything on any webpage. Swyx: Yeah. And given that you have some experience integrating Clip with language models, how would you describe how different GPT-4 is compared to that stuff? Sharif: Yeah. Clip is entirely different in the sense that it's really just good at putting images and text into the same latent space. And really the only thing that's useful for is similarity and clustering. Swyx: Like literally the same energy, right? Sharif: Yeah. Swyx: Yeah. And then there's Blip and Blip2. I don't know if you like those. Sharif: Yeah. Blip2 is a lot better. There's actually a new project called, I think, Mini GPT-4. Swyx: Yes. It was just out today. Sharif: Oh, nice. Yeah. It's really cool. It's actually really good. I think that one is based on the Lama model, but yeah, that's, that's like another. Host: It's Blip plus Lama, right? So they, they're like running through Blip and then have Lama ask your, interpret your questions so that you do visual QA. Sharif: Oh, that's cool. That's really clever. Yeah. Ensemble models are really useful. Host: Well, so I was trying to articulate, cause that was, that's, there's two things people are talking about today. You have to like, you know, the moment you wake up, you open Hacker News and go like, all right, what's, what's the new thing today? One is Red Pajama. And then the other one is Mini GPT-4. So I was trying to articulate like, why is this not GPT-4? Like what is missing? And my only conclusion was it just doesn't do OCR yet. But I wonder if there's anything core to this concept of multimodality that you have to train these things together. Like what does one model doing all these things do that is separate from an ensemble of models that you just kind of duct tape together? Sharif: It's a good question. This is pretty related to interoperability. Like how do we understand that? Or how, how do we, why do models trained on different modalities within the same model perform better than two models perform or train separately? I can kind of see why that is the case. Like, it's kind of hard to articulate, but when you have two different models, you get the reasoning abilities of a language model, but also like the text or the vision understanding of something like Clip. Whereas Clip clearly lacks the reasoning abilities, but if you could somehow just put them both in the same model, you get the best of both worlds. There were even cases where I think the vision version of GPT-4 scored higher on some tests than the text only version. So like there might even be some additional learning from images as well. Swyx: Oh yeah. Well, uh, the easy answer for that was there was some chart in the test. That wasn't translated. Oh, when I read that, I was like, Oh yeah. Okay. That makes sense. Sharif: That makes sense. I thought it'd just be like, it sees more of the world. Therefore it has more tokens. Swyx: So my equivalent of this is I think it's a well-known fact that adding code to a language model training corpus increases its ability to do language, not just with code. So, the diversity of datasets that represent some kind of internal logic and code is obviously very internally logically consistent, helps the language model learn some internal structure. Which I think, so, you know, my ultimate test for GPT-4 is to show the image of like, you know, is this a pipe and ask it if it's a pipe or not and see what it does. Sharif: Interesting. That is pretty cool. Yeah. Or just give it a screenshot of your like VS code editor and ask it to fix the bug. Yeah. That'd be pretty wild if it could do that. Swyx: That would be adult AGI. That would be, that would be the grownup form of AGI. [33:30] Sharif’s Startup ManualSwyx: On your website, you have this, um, startup manual where you give a bunch of advice. This is fun. One of them was that you should be shipping to production like every two days, every other day. This seems like a great time to do it because things change every other day. But maybe, yeah, tell some of our listeners a little bit more about how you got to some of these heuristics and you obviously build different projects and you iterate it on a lot of things. Yeah. Do you want to reference this? Sharif: Um, sure. Yeah, I'll take a look at it. Swyx: And we'll put this in the show notes, but I just wanted you to have the opportunity to riff on this, this list, because I think it's a very good list. And what, which one of them helped you for Lexica, if there's anything, anything interesting. Sharif: So this list is, it's pretty funny. It's mostly just like me yelling at myself based on all the mistakes I've made in the past and me trying to not make them again. Yeah. Yeah. So I, the first one is like, I think the most important one is like, try when you're building a product, try to build the smallest possible version. And I mean, for Lexica, it was literally a, literally one screen in the react app where a post-process database, and it just showed you like images. And I don't even know if the first version had search. Like I think it did, but I'm not sure. Like, I think it was really just like a grid of images that were randomized, but yeah, don't build the absolute smallest thing that can be considered a useful application and ship it for Lexica. That was, it helps me write better prompts. That's pretty useful. It's not that useful, but it's good enough. Don't fall into the trap of intellectual indulgence with over-engineering. I think that's a pretty important one for myself. And also anyone working on new things, there's often times you fall into the trap of like thinking you need to add more and more things when in reality, like the moment it's useful, you should probably get in the hands of your users and they'll kind of set the roadmap for you. I know this has been said millions of times prior, but just, I think it's really, really important. And I think if I'd spent like two months working on Lexica, adding a bunch of features, it wouldn't have been anywhere as popular as it was if I had just released the really, really boiled down version alongside the stable diffusion release. Yeah. And then there are a few more like product development doesn't start until you launch. Think of your initial product as a means to get your users to talk to you. It's also related to the first point where you really just want people using something as quickly as you can get that to happen. And then a few more are pretty interesting. Create a product people love before you focus on growth. If your users are spontaneously telling other people to use your product, then you've built something people love. Swyx: So this is pretty, it sounds like you've internalized Paul Graham's stuff a lot. Yeah. Because I think he said stuff like that. Sharif: A lot of these are just probably me taking notes from books I found really interesting or like PG essays that were really relevant at the time. And then just trying to not forget them. I should probably read this list again. There's some pretty personalized advice for me here. Oh yeah. One of my favorite ones is, um, don't worry if what you're building doesn't sound like a business. Nobody thought Facebook would be a $500 billion company. It's easy to come up with a business model. Once you've made something people want, you can even make pretty web forms and turn that into a 200 person company. And then if you click the link, it's to LinkedIn for type form, which is now, uh, I think they're like an 800 person company or something like that. So they've grown quite a bit. There you go. Yeah. Pretty web forms are pretty good business, even though it doesn't sound like it. Yeah. It's worth a billion dollars. [38:30] Lexica Aperture V1/2/3Swyx: One way I would like to tie that to the history of Lexica, which we didn't go over, which was just walk us through like Aperture V1, V2, V3, uh, which you just released last week. And how maybe some of those principles helped you in that journey.Sharif: Yeah. So, um, V1 was us trying to create a very photorealistic version of our model of Sable to Fusion. Uh, V1 actually didn't turn out to be that popular. It turns out people loved not generating. Your marketing tweets were popular. They were quite popular. So I think at the time you couldn't get Sable to Fusion to generate like photorealistic images that were consistent with your prompt that well. It was more so like you were sampling from this distribution of images and you could slightly pick where you sampled from using your prompt. This was mostly just because the clip text encoder is not the best text encoder. If you use a real language model, like T5, you get much better results. Like the T5 XXL model is like a hundred times larger than the clip text encoder for Sable to Fusion 1.5. So you could kind of steer it into like the general direction, but for more complex prompts, it just didn't work. So a lot of our users actually complained that they preferred the 1.5, Sable to Fusion 1.5 model over the Aperture model. And it was just because a lot of people were using it to create like parts and like really weird abstract looking pictures that didn't really work well with the photorealistic model trained solely on images. And then for V2, we kind of took that into consideration and then just trained it more on a lot of the art images on Lexica. So we took a lot of images that were on Lexica that were art, used that to train aesthetic models that ranked art really well, and then filtered larger sets to train V2. And then V3 is kind of just like an improved version of that with much more data. I'm really glad we didn't spend too much time on V1. I think we spent about one month working on it, which is a lot of time, but a lot of the things we learned were useful for training future versions. Swyx: How do you version them? Like where do you decide, okay, this is V2, this is V3? Sharif: The versions are kind of weird where you can't really use semantic versions because like if you have a small update, you usually just make that like V2. Versions are kind of used for different base models, I'd say. So if you have each of the versions were a different base model, but we've done like fine tunes of the same version and then just release an update without incrementing the version. But I think when there's like a clear change between running the same prompt on a model and you get a different image, that should probably be a different version. [40:00] Request for AI Startup - LLM ToolsAlessio: So the startup manual was the more you can actually do these things today to make it better. And then you have a whole future page that has tips from, you know, what the series successor is going to be like to like why everyone's genome should be sequenced. There's a lot of cool stuff in there. Why do we need to develop stimulants with shorter half-lives so that we can sleep better. Maybe talk a bit about, you know, when you're a founder, you need to be focused, right? So sometimes there's a lot of things you cannot build. And I feel like this page is a bit of a collection of these. Like, yeah. Are there any of these things that you're like, if I were not building Lexica today, this is like a very interesting thing. Sharif: Oh man. Yeah. There's a ton of things that I want to build. I mean, off the top of my head, the most exciting one would be better tools for language models. And I mean, not tools that help us use language models, but rather tools for the language models themselves. So things like giving them access to browsers, giving them access to things like payments and credit cards, giving them access to like credit cards, giving them things like access to like real world robots. So like, it'd be cool if you could have a Boston dynamic spot powered by a language model reasoning module and you would like to do things for you, like go and pick up your order, stuff like that. Entirely autonomously given like high level commands. That'd be like number one thing if I wasn't working on Lexica. [40:00] Sequencing your GenomeAnd then there's some other interesting things like genomics I find really cool. Like there's some pretty cool things you can do with consumer genomics. So you can export your genome from 23andMe as a text file, like literally a text file of your entire genome. And there is another tool called Prometheus, I think, where you upload your 23andMe text file genome and then they kind of map specific SNPs that you have in your genome to studies that have been done on those SNPs. And it tells you really, really useful things about yourself. Like, for example, I have the SNP for this thing called delayed sleep phase disorder, which makes me go to sleep about three hours later than the general population. So like I used to always be a night owl and I never knew why. But after using Prometheus it pretty much tells you, oh, you have the specific genome for specific SNP for DSPS. It's like a really tiny percentage of the population. And it's like something you should probably know about. And there's a bunch of other things. It tells you your likelihood for getting certain diseases, for certain cancers, oftentimes, like even weird personality traits. There's one for like, I have one of the SNPs for increased risk taking and optimism, which is pretty weird. That's an actual thing. Like, I don't know how. This is the founder gene. You should sequence everybody. It's pretty cool. And it's like, it's like $10 for Prometheus and like 70 bucks for 23andMe. And it explains to you how your body works and like the things that are different from you or different from the general population. Wow. Highly recommend everyone do it. Like if you're, if you're concerned about privacy, just purchase a 23andMe kit with a fake name. You don't have to use your real name. I didn't use my real name. Swyx: It's just my genes. Worst you can do is clone me. It ties in with what you were talking about with, you know, we want the future to be like this. And like people are building uninspired B2B SaaS apps and you and I had an exchange about this. [42:00] Believe in Doing Great ThingsHow can we get more people to believe they can do great things? Sharif: That's a good question. And I like a lot of the things I've been working on with GP3. It has been like trying to solve this by getting people to think about more interesting ideas. I don't really know. I think one is just like the low effort version of this is just putting out really compelling demos and getting people inspired. And then the higher effort version is like actually building the products yourself and getting people to like realize this is even possible in the first place. Like I think the baby AGI project and like the GPT Asian projects on GitHub are like in practice today, they're not super useful, but I think they're doing an excellent job of getting people incredibly inspired for what can be possible with language models as agents. And also the Stanford paper where they had like the mini version of Sims. Yeah. That one was incredible. That was awesome. Swyx: It was adorable. Did you see the part where they invented day drinking? Sharif: Oh, they did? Swyx: Yeah. You're not supposed to go to these bars in the afternoon, but they were like, we're going to go anyway. Nice. Sharif: That's awesome. Yeah. I think we need more stuff like that. That one paper is probably going to inspire a whole bunch of teams to work on stuff similar to that. Swyx: And that's great. I can't wait for NPCs to actually be something that you talk to in a game and, you know, have their own lives and you can check in and, you know, they would have their own personalities as well. Sharif: Yeah. I was so kind of off topic. But I was playing the last of us part two and the NPCs in that game are really, really good. Where if you like, point a gun at them and they'll beg for their life and like, please, I have a family. And like when you kill people in the game, they're like, oh my God, you shot Alice. Like they're just NPCs, but they refer to each other by their names and like they plead for their lives. And this is just using regular conditional rules on NPC behavior. Imagine how much better it'd be if it was like a small GPT-4 agent running in every NPC and they had the agency to make decisions and plead for their lives. And I don't know, you feel way more guilty playing that game. Alessio: I'm scared it's going to be too good. I played a lot of hours of Fallout. So I feel like if the NPCs were a lot better, you would spend a lot more time playing the game. Yeah. [44:30] Lightning RoundLet's jump into lightning round. First question is your favorite AI product. Sharif: Favorite AI product. The one I use the most is probably ChatGPT. The one I'm most excited about is, it's actually a company in AI grants. They're working on a version of VS code. That's like an entirely AI powered cursor, yeah. Cursor where you would like to give it a prompt and like to iterate on your code, not by writing code, but rather by just describing the changes you want to make. And it's tightly integrated into the editor itself. So it's not just another plugin. Swyx: Would you, as a founder of a low code prompting-to-code company that pivoted, would you advise them to explore some things or stay away from some things? Like what's your learning there that you would give to them?Sharif: I would focus on one specific type of code. So if I'm building a local tool, I would try to not focus too much on appealing developers. Whereas if I was building an alternative to VS code, I would focus solely on developers. So in that, I think they're doing a pretty good job focusing on developers. Swyx: Are you using Cursor right now? Sharif: I've used it a bit. I haven't converted fully, but I really want to. Okay. It's getting better really, really fast. Yeah. Um, I can see myself switching over sometime this year if they continue improving it. Swyx: Hot tip for, for ChatGPT, people always say, you know, they love ChatGPT. Biggest upgrade to my life right now is the, I forked a menu bar app I found on GitHub and now I just have it running in a menu bar app and I just do command shift G and it pops it up as a single use thing. And there's no latency because it just always is live. And I just type, type in the thing I want and then it just goes away after I'm done. Sharif: Wow. That's cool. Big upgrade. I'm going to install that. That's cool. Alessio: Second question. What is something you thought would take much longer, but it's already here? Like what, what's your acceleration update? Sharif: Ooh, um, it would take much longer, but it's already here. This is your question. Yeah, I know. I wasn't prepared. Um, so I think it would probably be kind of, I would say text to video. Swyx: Yeah. What's going on with that? Sharif: I think within this year, uh, by the end of this year, we'll have like the jump between like the original DALL-E one to like something like mid journey. Like we're going to see that leap in text to video within the span of this year. Um, it's not already here yet. So I guess the thing that surprised me the most was probably the multi-modality of GPT four in the fact that it can technically see things, which is pretty insane. Swyx: Yeah. Is text to video something that Aperture would be interested in? Sharif: Uh, it's something we're thinking about, but it's still pretty early. Swyx: There was one project with a hand, um, animation with human poses. It was also coming out of Facebook. I thought that was a very nice way to accomplish text to video while having a high degree of control. I forget the name of that project. It was like, I think it was like drawing anything. Swyx: Yeah. It sounds familiar. Well, you already answered a year from now. What will people be most surprised by? Um, and maybe the, uh, the usual requests for startup, you know, what's one thing you will pay for if someone built it? Sharif: One thing I would pay for if someone built it. Um, so many things, honestly, I would probably really like, um, like I really want people to build more, uh, tools for language models, like useful tools, give them access to Chrome. And I want to be able to give it a task. And then just, it goes off and spins up a hundred agents that perform that task. And like, sure. Like 80 of them might fail, but like 20 of them might kind of succeed. That's all you really need. And they're agents. You can spin up thousands of them. It doesn't really matter. Like a lot of large numbers are on your side. So that'd be, I would pay a lot of money for that. Even if it was capable of only doing really basic tasks, like signing up for a SAS tool and booking a call or something. If you could do even more things where it could have handled the email, uh, thread and like get the person on the other end to like do something where like, I don't even have to like book the demo. They just give me access to it. That'd be great. Yeah. More, more. Like really weird language model tools would be really fun.Swyx: Like our chat, GPT plugins, a step in the right direction, or are you envisioning something else? Sharif: I think GPT, chat GPT plugins are great, but they seem to only have right-only access right now. I also want them to have, I want these like theoretical agents to have right access to the world too. So they should be able to perform actions on web browsers, have their own email inbox, and have their own credit card with their own balance. Like take it, send emails to people that might be useful in achieving their goal. Ask them for help. Be able to like sign up and register for accounts on tools and services and be able to like to use graphical user interfaces really, really well. And also like to phone home if they need help. Swyx: You just had virtual employees. You want to give them a Brex card, right? Sharif: I wouldn't be surprised if, a year from now there was Brex GPT or it's like Brex cards for your GPT agents. Swyx: I mean, okay. I'm excited by this. Yeah. Kind of want to build it. Sharif: You should. Yeah. Alessio: Well, just to wrap up, we always have like one big takeaway for people, like, you know, to display on a signboard for everyone to see what is the big message to everybody. Sharif: Yeah. I think the big message to everybody is you might think that a lot of the time the ideas you have have already been done by someone. And that may be the case, but a lot of the time the ideas you have are actually pretty unique and no one's ever tried them before. So if you have weird and interesting ideas, you should actually go out and just do them and make the thing and then share that with the world. Cause I feel like we need more people building weird ideas and less people building like better GPT search for your documentation. Host: There are like 10 of those in the recent OST patch. Well, thank you so much. You've been hugely inspiring and excited to see where Lexica goes next. Sharif: Appreciate it. Thanks for having me. This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit www.latent.space
    8/5/2023
    50:37

Más podcasts de Tecnología

Acerca de Latent Space: Founders, Engineers, and News on Software 3.0, DevTools, Computer Vision, Data Science, AI UX

The podcast by and for AI Engineers! We are the first place over 50k developers hear news and interviews about Software 3.0 - Foundation Models changing every domain in Code Generation, Computer Vision, Data Science, and more, directly from the founders, builders, and thinkers involved in pushing the cutting edge. Striving to give you both the definitive take on the Current Thing down to the first introduction to the tech you'll be using in the next 3 months! Guests from Databricks, Glean, Replit, Roboflow, MosaicML, UC Berkeley, OpenAI, and more.

www.latent.space
Sitio web del podcast

Escucha Latent Space: Founders, Engineers, and News on Software 3.0, DevTools, Computer Vision, Data Science, AI UX, Tinku Arquitectura Inmobiliario y muchas más emisoras de todo el mundo con la aplicación de radio.net

Latent Space: Founders, Engineers, and News on Software 3.0, DevTools, Computer Vision, Data Science, AI UX

Latent Space: Founders, Engineers, and News on Software 3.0, DevTools, Computer Vision, Data Science, AI UX

Descarga la aplicación gratis y escucha la radio como nunca antes.

Tienda de Google PlayApp Store