Open Source Friday: Building Data Pipelines with dlt and Elvis Kahoro
Chapters9
Host introduces the episode, mentions Microsoft Build, and urges viewers to register for the digital events.
DT data load tool (DT) makes moving data between sources and warehouses feel like writing Python—easy to adopt, highly scriptable, and open source.
Summary
GitHub’s Open Source Friday hosts Elvis Gahoro and Tier (Tier? actually Siri as DT maintainer) from DTHub to spotlight DT, an open Python toolkit for building data pipelines. Elvis emphasizes that DT runs anywhere Python runs, from laptops to cloud CI pipelines, and can move data across popular destinations like Snowflake, BigQuery, Databricks, and more with a single, consistent API. Siri explains that DT’s strength lies in making pipelines code-first, with features like ETL/ELT workflows, incremental loading, and automatic schema inference that eliminates a lot of boilerplate. The hosts showcase practical demos, including a HuggingFace destination so models and traces can be turned into ready-to-use datasets, and they highlight the growing ecosystem (millions of downloads, thousands of enterprise users). A central theme is developer experience: the SDK is designed to be self-explanatory, with declarative patterns, schema contracts, and decorators to guide AI agents while keeping humans in the loop. They also discuss pro offerings (DT Pro) for deployment, observability, and governance, while keeping the core open source core free and extensible. The conversation touches safety and governance when using AI agents, recommending scoped, well-defined agents and secure secret management. Throughout, the team demonstrates how a notebook-like, data-centric lens (via tools like Marimo) can visualize and validate pipelines and data quality before production. The session closes with encouragement to contribute to the open source project, join their Slack, and explore first issues to start contributing. Overall, DT aims to replace brittle, vendor-locked pipelines with a flexible, code-based, extensible data builder stack that scales with AI-assisted development.
Key Takeaways
- DT is a Python library that lets you move data from sources to destinations (e.g., Salesforce to BigQuery, Snowflake, or Databricks) using familiar Python code.
- Changing a pipeline’s destination is literally one line of code, enabling fast vendor-mobility and reducing lock-in.
- Custom destinations exist (e.g., HuggingFace) so you can load data into ML/data science platforms as easily as a warehouse.
- DT supports incremental loading to avoid re-ingesting the same data, improving performance in production pipelines.
- Schema contracts and decorators give agents and humans clear guardrails for data structure and behavior, enabling safer AI-assisted pipelines.
- DT Pro provides deployment, observability, and governance tooling, while the open-source core remains freely accessible.
- Data visualization and notebooks (via Marimo) help validate data quality and pipeline outcomes before production.
Who Is This For?
Essential viewing for data engineers, ML/AI practitioners, and developers who want a code-first, open-source approach to building, validating, and deploying data pipelines with AI agents and notebooks.
Notable Quotes
"DT data load tool. Um it's a Python library. So, you can install it anywhere your laptop on a machine on the cloud in a CI pipeline."
—Elvis introduces DT as a Python library usable anywhere.
"What’s cool about DT is that everything should be code."
—DT emphasizes a code-first approach for data pipelines.
"Changing the destination is literally just one line of code change."
—Highlighting how easy it is to switch data destinations.
"We launched our DT Hub Pro product and we’re going to give a quick demo of it."
—Introduction to the pro offering and live demo setup.
"Schema contracts and guardrails help you review the code defined by the agent and decide what to modify."
—Emphasizes human-in-the-loop safety and review.
Questions This Video Answers
- How does DT convert GitHub issues into a normalized data model without SQL boilerplate?
- Can DT’s one-line destination change really help avoid vendor lock-in in large enterprises?
- What is DT Pro and how does it differ from the open-source DT core?
- How do schema contracts work in practice when an AI agent changes a dataset schema?
- What tools exist for validating data pipelines visually before production with DT (e.g., notebooks, Marimo)?
Full Transcript
Heat. Heat. Heat. Hello everyone. Good morning, good afternoon, good evening, wherever in the world you're joining from. Welcome to Open Source Friday. We made it. We made it to Friday. I don't know how your week has been. I hope it's been a very nice and fruitful one. Um, from my end, I can tell you it's been a very busy week. We are getting ready for Microsoft Build next week and there's going to be a lot of announcements. There will be some open source announcements as well. So, I hope all of you tune in next week.
I'm going to right now put a link on the screen so that you can do an awesome thing for me and that is go register. Even if you're not going to be there in person, if you register, you will get a notification of uh the digital events and there's going to be a ton of stuff. There's going to be not only good educational content, but I think there's going to be some giveaways and we're going to be doing some live coding sessions. I'm doing one on the 4th I believe or the 2nd I don't know it's on the schedule so you have to go and and register so that you can see when but the program in this year it's very focused on developers developer experience and yeah it's got so much GitHub in it I think it's going to be a good one so I'm very much looking forward to next week if you are in San Francisco please come by and say hi oh my gosh my guest today is going to be there what Elvis come by and see Okay, please come by and see me.
Happy happy happy happy open source Friday, Amy. Thank you for being here. Welcome Estfano. Welcome Randy all the way from South America. Let me know in the chat where you're joining us from. Um Usman, welcome. Thank you for being here. For those of you who don't know the show, this is open source Friday and this is a show where we highlight the contributors, maintainers and the projects that are just making our lives so much easier and we host this program powered by GitHub with the objective of giving visibility to this amazing projects and then hopefully also getting some more contributors, getting some more participants and getting some more stars.
So that's your homework for today. Please go by and don't forget to register for build. You'll be doing me a massive massive favor if you do that. I'm going to put that link there. It's it's a short link. It's gh.ioicrosoft-build. No spam just so you can see the programming and tune in digitally to see the presentations and then we are going to ask you to also please go and start the repository. So let's get started. Today I'm joined by Elvis Gahoro and theory Gene who are from DTH hub and we're going to talk about DT hub.
If you're here already obviously you care about developer tools, you you care about data pipelines. I typically do a little bit of looking ahead of the project. But I'm not even going to lie to y'all. This week has been a week that I did not have a chance to do absolutely nothing other than survive and create content for you all for next week and prepare for my my coding sessions. So, I'm gonna have to introduce my guests and ask them to introduce themselves. And I will start with you, Elvis, if you could just say who you are, where you're based, and then we'll popcorn over to Tier and we'll get started.
Cool. Yeah. Uh, hey everyone. Uh, my name is Elves team his. I live in San Francisco. Uh, I work at Dy Hub. Uh, I am a developer advocate and I focus primarily on just like doing top offunnel marketing. um meeting with people, working with partners so that we can just kind of like you know advance our partnerships. Cool. Hi everyone. Uh I'm Siri uh I'm lead AI engineer at DTHub maintainer of the Python library DT which you'll hear about today. Um, DLT is a Python library to do extract and load, move data around, and um, it's the aim is to be very accessible to developers, but also a powerful toolkit for everyone.
Um, I'm also committee member for Apache Hamilton, another project in the Python data space, and I like to contribute to different projects when when time allows. So excited to be here. Awesome. Well, I'm super excited to learn about this tool. Um, let me know. First of all, we're going to start with a very basic question and then I'm going to pass it on to you all. I know you have a presentation ready and a demo which our audience always loves, but for people who are new to DT, can you tell us in plain English like what is this?
Yeah, sure. So, um DT data load tool. Um it's a Python library. So, you can install it anywhere your laptop on a machine on the cloud in a CI pipeline. And what it allows you to do is move data from a source to a destination. A lot of our users, for example, will move data from a system where they have, for example, HR data or like a production database for their web application and move it to a data warehouse. So you can think of like BigQuery, Snowflake, Data Bricks, Doug DB, Lance DB, any of these tools.
And what's cool about DT is that you're familiar and comfortable with DT. All of these sources and destination are supported by the same tool. Um there's uh this falls under like the ETL or ELT acronyms. So extract load transform or extract transform load. If you're a data engineer, this is very familiar to you, but I'm trying to explain it to to everyone here in the audience. And um what we the bet that we made with DT is that everything should be code. uh other like a project in this space or product are uh subscription services or you need a a big container to run constantly on a machine some somewhere but with DT is just Python code that you do Python the name of your script and it runs so it makes it very easy to adopt but also very powerful and um easy to integrate in your own platform I think that's why the enterprises like using DT I think we're now at uh six plus million downloads per month on Pi.
Uh we have 10,000 enterprise customers that use the open source library. Uh and and some paid customers, but like the open source libraries like is very much alive and there's a lot of growth recently with AI. Maybe awesome. Awesome. Okay. Well, I think that that's that sounds like a good overview. Um Elvis, do you want me to start sharing your screen so we can kick off the present? Yeah, let's do it. All right. Yeah. Yeah. So, I think uh Terry gave a pretty good intro into the the library. Um, one of the some of the benefits of it being a Python SDK is that it can run anywhere Python can run.
And so, you know, running it on your computer, running it in GCP, running it in Airflow. Uh, we're just trying to make it as easy as possible for people to actually like, you know, work with data and build it and also, you know, just make data useful. Uh these slides are actually a little outdated and so uh it was around 6 million downloads uh back in early March and I think this month we hit uh 10 million monthly downloads. Wow. And yeah and the and the interesting thing is that a lot of this is just agent driven.
And so one of the nice things about DT being open source is that you know if someone is like using cloud or codeex or copilot uh the the the frontier labs essentially are actually will recommend DT as a tool when people are trying to work with data. Um, and so it's nice having like both a vetted tool and also an SDK that tries to abstract away the things that are difficult about moving data, right? And so, so for for example, like you don't want to like have cloud build your schema normalization and incrementally loading algorithms, right?
Uh, and especially you don't want to do it each time you're actually working with data. So the hope here is that if we can build a super nice SDK, Claude is essentially just filling in the blanks and just like you know turning these levers that we give Claude through you know decorators and through these like declarative strongly typed um representations of like these data workloads and so this helps cloud uh essentially just have super super good performance when it's like making pipelines. Uh I also this is also out of date. I think this is a little over 9,000 companies using the open source and then we have uh uh I think around 850 uh shared customers with snowflake and so a lot of people are using it to you know load data into snowflake um or bigquery etc.
And so the hope here is just like wherever your data lives, if you want to pull it, let's say from like Salesforce into your local machine, that way you can like look at it locally or even you have a bunch of data on your machine, think like a bunch of CSVs or a bunch of JSONs or paret files and then you want to upload them to the cloud. It's like no matter the direction, no matter the device, ideally you have like a standardized framework for doing this type of work. Yeah. Very cool. I have a question though because I you just said something that is really interesting to me how this is sort of optimizing for agent I guess for both agents and developers.
Yes. Um and you say that the frontier labs so like is it the way that the models are trained that they're picking up the tool as a tool to suggest and that's what the agent is choosing or is there some partnership there just out of curiosity. Yeah I think that's the primary component. Um, one of the interesting things there is that like there's the the frontier labs actually have different behavior. So like like codeex is way more likely to do web search than claude in our experience. And so if you like look at our web traffic like the majority of the agent traffic is coming from codeex and so it's like it's actually interesting seeing how the the labs are like deferring or like diverging in terms of like you know their I guess nuances or idiosyncrasies.
Um, and so in like the cloud case, we end up having to like for example compensate more by like making it really easy for claude to like download skills via the DT CLI if DT is already installed. And so it's been interesting being like how do we like optimize performance for like copilot versus codeex versus cloud. That way you know no matter what you're using like you're still as productive and people don't feel like oh DT works the best with this agent. Like ideally it works great with all of them. So that's that was a great point.
Yeah, I love that. And well, not only for the agents also, but you're like you're allowing people to pull data from a million different places like which is that's that's that's pretty cool. If I can add something like one thing that we found like an interesting inflection point is just like we would regular regularly like just ask OpenAI like what are the popular ETL tools like the the tools in our category or data pipelining tools and you can see like at a certain like cut off like now DT was recommended. So you can like track how well uh like how aware the different uh LLM providers are about your project.
And um for us what we've seen when we talk about like adoption is just people are building more and more custom connectors which was not possible before because LLMs are so good at writing Python code now. And I I guess it will make sense as we move into the demo. But um I've been building Python SDKs and different libraries for three four years now full-time. And like I'm obsessed with developer experience. Like I want things to be obvious, be smooth, be self-explanatory, have good error messages, have good tests, have well- definfined behavior, docs. And if you're able to do this for humans, like LLM pick it up very easily.
It's not a separate skill set or things. It's not SEO. It's not optimizing for robots. It's like actually you optimize it for humans and then you getting the benefit of having like this helper alongside like the copilot. Yeah, I'd say that the the APIs good APIs are good for both people and agents. I think where agents start to defer is like their access patterns like a human is not going to run a thousand queries in one minute, right? But an agent actually good. And so it's like so a lot of the a lot of the changes and accommodations we have to make for agents are actually more on the actual infrastructure side, less so like the SDK.
Um, which is interesting. I love that. Okay, before I take it take it back to Yellow. I promise I'll be quiet. So what before before there was DT like what what did people what were we using like and and where did people get stuck I guess when they were building these pipelines like without a tool like this is kind of how do you even do it? Can you even do it? Um yeah so I I think um DT came from a very organic need. Uh one of the co-founder at DTub was a data engineer independent contractor in Berlin were very successful had a lot of contracts but he got tired of always building the same things for all of the customers and he's he wanted to put this in a toolkit or like package in something that that was reusable.
So I I think like there was a lot of services or like paid offering that were offered but there's only like a set of connectors or destinations so load data from like the logos that you see on the slide but in a company like you probably have a custom API thousand places that you would like to pull data from and so you always have like a data scientist or data engineer that creates a Python script it works on a Monday and it doesn't work on a Friday and then becomes a lot of maintenance and that's where I I think people started to adopt DT when it was introduced.
is it made this like long tail the these like custom sources manageable and this creates like a new market like businesses were looking at like 20 sources of data but now that they can create custom sources they might want to look at 40 sources of data and now with agents maybe they look at 100 because there's a scalability problem that's being solved and it's like it's at the cutting edge right right now we're everyone's figuring this out yeah and then the and then the library provides like standardization that makes it so that you have the same design patterns for all these pipelines.
That way the organization can guarantee like hey this this these data pipelines all have a particular shape and look a certain way. And so even if we have this bespoke source like we don't really have to worry later about understanding how we collected that data because like the actual pipelines look similar we're just changing the underlying source. And actually one thing that will be interesting for you is a lot of these pipelines are primarily run in GitHub actions because like yeah so like uh yeah and even like the most popular source I think is actually like Python dictionaries right and so it's like people have these like custom data types that they want to load into their database and especially if you're like a solo developer it's like you're not going to like you're not trying to like spin up like a GCP VM that you like pay for constantly if your pipeline only runs once a day.
And so GitHub actions is actually a place where a lot of this compute lives especially for this newer like these new these new coming like new grads and like these like AI Python oriented AI engineers like they don't even like know what like a lot of the managed platforms even are. They're just like cuz they just came out of school or you know they just like started work. And so it's been fun being able to build for those people and seeing like how they think about tools differently than the than the prior world. I love this.
And what a nicer developer experience for them. So yeah, I know. Yeah, that's awesome. That's awesome. Cool. So I'm going go to the next slide. So yeah, so another thing I wanted to call out is that we have like custom destinations and so you can actually build a custom destination with your own logic. And uh uh me, Terry, and uh Quinton, Quinton is from HuggingFace, uh we actually built out a Hugging Face destination a few months ago. Um probably like two months ago actually. And so in addition to like loading data, you know, into like your data warehouse, you can actually just load it into uh Hugging Face as well.
And so we have people who are for example taking like an agent trace um like exporting all of them in bulk and then eventually like sharing them as like a data set face which is also really nice and because DT is like standardized like changing the destination is just changing one line of code. So if I want to switch from like snowflake to data bricks it's literally just one line of code change. And so this is really good for large enterprises because you can imagine a world where like a company's like oh we built our data platform on DT in the future if let's say our current provider you know spikes their prices we don't have to take six months to migrate to a new platform because it's too expensive we can literally just go into the code run like one line of code change for all these pipelines and we just switch the destination and so that makes um that kind of like helps the consumer and the user are uh be protected from like vendor lock in.
I love that. I love that connection there. And I take it as something that you probably help do really natively if you're working with hugging face. Um like as far as like what file formats you need to package choices, etc. to be able to like create those data sets. That's very cool. I did not know that was something that you could do. Okay, cool. Yeah, we'll go to the next slide. Do you want to talk about this, Terry? This a lot of your Yeah. Um, I just want to give you a sense of like what like companies are doing or um I I'm like lead AI engineer by title.
I do a lot of things but um what it allows me to do as as my full-time job is engage with people on the frontier labs. Uh people working with AI. Everyone's running agents right now and people are trying to make sense of these agents. Uh first like how much do they cost? Second are they good? Are they solving user problems? and like how do we prioritize our next project like like does it make sense to ship this as a feature and DT is useful there because there's no like standard ways to connect uh to the agents if you will so agent traces uh telemetry like for example you're using copilot it produces like a history of like the things it did and you can ingest that because DT can move any type of data and um right now I'm spending a lot of time like working with here I think it's referencing like distill labs which is a company that takes these traces and then builds like optimized model on top of these uh traces and they use DT to move data around and uh yeah that's it maybe Elvis you had other comments.
Yeah. Yeah. So for for me the the question like one of the questions that I'm looking forward to like people being able to answer through like Ti's project is like like there's a world or there's going to come a time where anthropic and co-pilot stop subsidizing your tokens as much as they are now. Oh it has come my friend. Yeah I know. Yeah we are there. And and so so the question is like okay so we're spending like this amount of money on all these tokens like where are we actually getting the most value as it relates to this token spend and and people are using like AI across many different mediums right like not even just like different agents but it like might be in the CLI it might be in like a Slack app it might be here so it's like how can you like consolidate them how can you you know bring them to one place and and this is like ideally despite your actual um tracing provider, right?
So, someone might be using Brain Trust, another team might be using a different tracing provider, and it's like ideally you can export from all of these providers into one place and then get like analytics on your whole organization's like AI spend. Yeah. And so, the hope is just like for people to be able to like make decisions about their own data and about their own spend. So, and eventually do what Terry is saying where you can like train a small open source model and then actually like undercut the frontier labs as it relates to like you know specific domain tasks which is also pretty exciting.
Yeah, I go to the next one. Yeah. And so we actually launched our DTH hub pro product and so this is super nice and we're going to give a quick demo of it. Um but the the whole idea here is that you know uh agents are pretty good at building pipelines but ideally you can also give them a CLI for them to like deploy these pipelines as well as the developer. And so instead of you having to like switch into a different tool or do the thing we're talking about earlier where you like set up like a small pod um or even set up a VM and then run your data pipelines there, you should be able to just run it run a different command in your terminal and then things just work.
And so going to switch over to uh switch over to my terminal and then kind of show people the developer experience around this. And so the the main command we use for this is uvx dt hub uh hyphen start. And then I'm actually going to add this uh latest annotation uh just because you know we kind of like make updates every day to this. And so in this case I might say like you know we have a workspace called GitHub. And then we can choose between like starter or minimal. And so starter is going to load um essentially more skills also load like for example like an MCP server that you can run locally so that once you do have data the agent can actually like kind of inspect it and then I can essentially choose my my my agent right and so I'm going to I'm going to add skills for all the agents and then I can install the workspace dependencies.
Yeah. Yeah. And so when this is done running, I'll see that you know we're going to essentially kick off the project with a starter pipeline with a starter transformation and then also some starter data quality. Um essentially like functions that help the agent for example say like okay this is what the data looks like. How can I make sure I run certain checks against the data? And so now that we have that I can actually you know cd into this directory. So I can see you know we have you know our agents um agents folder.
And so I can I can see that we have a bunch of skills in here. And then the hope here is that all of these skills can get grouped into certain jobs to be done. So a job might be exploring data and validating it. And ideally that job we can compose or break down into a set of skills. That way the agent could do most of it and is really just asking you for confirmation versus like asking you like how do I do this? Right? And so hopefully we can streamline this. And another nice thing about this like skills format is like eventually let's say you're like a consulting shop and you do data projects for large companies like monthly.
Uh ideally you could actually embed your company's practices and ideologies into the skills themselves. That way any consultant who's helping like a starting a new project they can just pull down the skills for your particular consulting shop. And so the hope here is that people can also standardize kind of the code they produce for clients this way as well. Yeah. And so I'm actually going to make a a get directory. So, okay. So, let's track this. Cool. So, we have a bunch of files here. Um, and so now, now that I'm here, what I can do is I have this I have another set of commands.
And so, so I can, for example, let me actually close that pane. So, I can say like DT AI toolkit list. And then this will essentially show me some of the other toolkits I have, right? And so, you know, we have one for working with file systems. We have one for working with REST APIs, you know, transformations. And then installing these is like super easy. I can just do, you know, AI toolkit and then I can do the toolkit and install it. Um, yeah. So, I might install this toolkit for example. And so, when I jump into like cloud for example, I can go to skills and then I can see that I have a bunch of these skills working.
And so the skill that I might want to use is I might want to use find source and then I might say like GitHub I want to download um the issues for our project and I can say you know DT hub DT and then the agent can just like go off and like use this find source and so the intuition around find source is uh we have a we have a context server and so what the agent can do is the agent can say, I'm trying my my my user is essentially telling me to like work with this particular API.
Do you actually have context on that API, right? And so we have this like workload that we just run weekly where we like fetch new contexts for all of these different APIs. And so I'll show you what that looks like. So like I like showing the one password one because a lot of people use one one password. And so I can essentially say like what types of data can I load from the API? we can talk about the different methods for the particular API and then we can also give the agent like help as to like how to authenticate with the API and you know how do you set it up and run the pipeline and so this in combination with an SDK that's declarative right so like the agent for example can say like we have one password I just tell DT that it's a source and then the agent is pretty much just filling in this template so it's saying like this is the base URL this is the authentication type these are the different resources that I'm using and then so so having this nice like SDK I guess primitive plus the context makes it so that the agent can pretty much one-shot um any pipeline, right?
And so if I uh if I close this close this pane and then I Yeah. Yeah. And so this is the whole idea here is that we have a declarative uh I guess UI for the agent and then we have context and it can just one shot. And so another nice thing that we do to help increase the agent performance is we try to we try to let the agent express logic through decorators. And so the hope here is that if I have a source or a resource, the agent can say like okay the primary key is this uh whenever we want to load this data uh whenever we want to write new data we're going to merge it versus replacing the original table versus you know like upserting.
That way the agent can can essentially say like we have these design patterns. I'm just going to choose the design pattern I want. The agent can also say for example that like this particular column is PII. And so whenever we load this we might want to anonymize this or we might want to hide this or we might want to drop this particular column. We can also have schema contracts. And so this is like saying like if the underlying API so use a GitHub token. Yes, let's let's use it. Okay, I will add a GitHub token into the session as an end bar.
and then let's do open only exclude PRs and submit answers. Cool. Yeah. And so, if I go back to this Yeah. So, we have like schema contracts. So, an agent can say like, oh, the schema for this underlying API changed. What do I do? Do I do I like stop the pipeline? Should I accept the new rows but then add new columns? And so you can essentially give because the thing is like people are going to have particular patterns and our our idea is just to like expose those in the SDK. That way the agent can just ask what do I do for this pattern, right?
Okay, cool. So it's writing our pipeline for us. Um nice. Yeah. While it's running, can I ask you about guardrails? Because I've seen so many like horror stories about agents that are able to work. Actually, Jerry should talk about this done this. Yeah, because now like you know as teams are bringing data agents into data workflows like it does get scary. Like I've seen some a couple of really terrifying stories online. So how do what's your approach to safety there? Yeah. Uh I will answer this in uh two two parts. uh one like yes the risks are real and you should be very careful and uh like read about the different like things you allow your agent.
I know when you're using cloud it can be frustrating like or copilot or any tool can be frustrating to read every line that it wants to do because it spams you with commands it wants to run but either you have a very restrictive list and you give it permission to run everything either you take the time to read what it says because it can uh I don't know delete your files look at your secrets so be careful uh so that's the first thing like acknowledging like be careful with agents on your laptop or any device.
Uh second, uh what I'm excited about with DT is that um like I'm building these features more integrated with our platform, but this applies like for anyone using DT with AI. Um DT is um it's an SDK, right? So your the main thing your agent is doing is not running like crazy code uh on your machine. It's just defining DT pipelines like a human would define DT pipelines. So the only like time that you have to be careful is when you run the pipeline like before running the pipeline you should take the time to read the pipeline that it created.
Um but then for example like Elvis mentioned like the secret DT has a different methods for like handling secrets. You can use environment variables, you can use files, you can have like a provider like an Azure credential store. And so, um, when you're running your DT pipeline, you can just, um, have it load the secrets from a secured place for your enterprise. Um, but I think in general, like for example, Elvis showed like the context. Um, we give curated context to the LLM. So, it reads from like files that we know about like the instructions we're careful about.
And then it doesn't need to search the web. You can allow it to search the web. Maybe like you try to inest from GitHub, you have an HTTP error and maybe you want to search the web to figure out how to solve that error, but it it should be able to work without the the the uh the internet which is where you might see dangerous things for your agents. So it works on your laptop. And now the key thing with DT because it's just Python and it runs on any machine, you can write to like a local file or like a local duct DB or a local SQL light um with the data and make sure that the pipeline works and it doesn't destroy data or it doesn't do anything crazy.
And once you're happy with what the agent built, you can change where the pipeline writes to. Like as Elvis showed before, like it's a oneline change. Instead of saying like file system, you can say big query or Microsoft fabric. And now your pipeline that you like carefully reviewed writes to the the place you care about where your data warehouse or your your business uh destination. Nice. So I might need a new token here. Uh let's use the This is live TV friends. Yeah, one thing I learned is uh there's a lot of end points that you don't need a secret for.
So if you just look at the code or you tell tell it like in justest issues, you don't need credentials for issues, it should I I realized that uh this week. All right, cool. If you tell it explicitly, you don't need credentials to just from the issue endpoint will stop. I think we'll just we'll just use these like lower rate limits. Yeah. Cool. Yeah. So we're just going to let this keep cooking. Yeah. So uh yeah, so the decorator is really nice because it it kind of you know makes it easy for the agents to like focus on the actual task at hand and not like you know go into these rabbit holes trying to solve some like very common data engineering thing that's you know actually solved and ideally solved and embedded into the library.
Um yeah and so yeah another nice thing is that we try to like make the all of these different like sources we try to make them very ergonomic and so like a good example here is like I can say I can add this processing steps uh field to my rest API source and then I can tell DT like for this we're going to apply this function to like a particular object right and so in this case I have flatten guests and this is just a Python function that will like unnst all of the subjson Right?
That way when this is done loading uh we end up having a flat table instead of a table with a bunch of columns that have just nested JSONs in them. Um we also have uh just regular resources and so we can think of a resource as a table in a database and so in the case of GitHub a resource might be like an issue it might be a pull request might be a repo and so you can essentially just have a regular Python function and then have the function yield data. It might yield data after you know calling some API and then you can just say like this is a resource and then eventually once you have you know multiple resources you can group them into one source and so uh quota exhausted wait for the reset um yeah and so in this case I have like let's say a resource for events a resource for guests and then I can put them into one source and this is just like you know returning a list of resources and then when I want to run the pipeline I pretty much just pass in this source into this pipeline.
Um, so I say like we have a pipeline. I'm going to run it and I'm going to run it with the source. And so typically a pipeline is going to be just like a destination and a name and you know credentials. And so you can pass in multiple sources into that same pipeline because it's like a pipeline to snowflake or pipeline to you know data bricks. Yeah. And I can also and because this is all regular Python, we can use like closures. And so instead of me having these resources as individual functions and then I put them into another function that is a source, I can also just embed them under like one main function.
And this is helpful because you can now share like the same client across all of these resources since they're hitting the same underlying source. Uh we also try to make it super easy to support new destinations. So again like changing the destination should just be one line of code. You literally just swap out this destination field and then we also make it easy to load other types of data. So like we have you know data frames you can pass in a data frame directly. So you could be you know a data scientist working in Jupyter notebook and when you're finally done you're like all right now I want to upload this and then you can just pass in that data frame directly.
Uh, we also make it really easy to load things from your file system. And so I can essentially point to a folder and say, you know, I want you to like parse all the CSVs in that folder. And then I'm going to, you know, I essentially just say like these are my files. This is the reader that I'm going to use for those subsets subset of files. And then I just pass in this as a source. Um, and this makes it really nice for like uploading like JSONs and bulk and things. And then uh this is back with the schema contracts.
And so you have different you know again design patterns. You can like evolve the schema, you can freeze it, you can also discard the rows that you know break the schema. And so we just try to make it easy for people to like think what's the trade-off that I want to have and then how can I just select the trade-off versus like trying to build the trade-off. I think going back to um the earlier question about like guard rails and with agents um what's being like a challenge in enterprise with agent is okay we're getting really good at producing a lot of code but like nobody wants to maintain it that's that's the first and then are we sure it's secure and because DT was built to describe data or how data moves it's very easy to review the code defined by the agent for example Looking at the the snippet that Elvis has here, it just like it it reads almost like plain English.
It says like okay the tables you can evolve like the tables can change. This means you can have new tables, you can rename tables but it it also wrote uh columns freeze. So the columns cannot change like if you have a column that's there it's it's expected to always be there and the data type is expected also to not change. So even if you have claude, copilot, codeex writing a pipeline, it's uh rather uh simple for the human to review this code and agree or disagree and tell the agent to modify the code. Whereas if you don't use an SDK or a library like DT, you just have like a bunch of functions and uh Cloud is telling you like trust me, just press enter and run the the code.
Here you have a bit more control over what you're running on your machine. Thank you for adding that. So, I mean because contracts are like even more important with agents, but going back to your earlier point about it being kind of annoying like to continue like thinking of irration like how do you make it like I guess strict enough to where it's safe but then also so it's not annoying as you continue to like grow in production. Yeah, I think it's something that everyone is facing like me working on the product or like how we what we think is the ideal way to run agents like the practical tip I can give is cloud code copilot codeex they're like generalist agents like you run them on your machine and they need to be able to do everything that's why they're so helpful right they can look at your files they can search the web they can ask you questions but when you're getting closer to the data or like your your infrastructure, you want agents that are more constrained, like they're able to do one thing.
So, for example, for us, we have an agent that can write the pipeline. It can just like write the pipeline, cannot run the pipeline, it cannot search the web. It just like creates Python code to define a pipeline. We have another agent that's able to look at the data set. So it's able to like look at the data, look at like what the columns uh that are present, their type, maybe like do some SQL queries, but it's not able to do things outside of that. And once you have a lot of these agents that are well scoped, Elvis can be on his machine and say like, okay, I'm running the agent to create the pipeline.
Okay, he runs the pipeline. Okay, now I'm running the agent that inspects the data set. Okay, now I'm running the agent that adds like uh human readable descriptions to columns and tables. So like the the semantic layer and we separate these things so you keep the human in control. You can still make sense of what's going on in your company and you can move with confidence with these pipelines. Yes, that's excellent advice, friends. Excellent advice. Awesome. While Terry is talking, I I made a new API key. So we uh we're like loading all of it now.
Yeah. And so uh another nice thing about these schema contracts is that um they're just you can just express them as base models, right? So we can just make this like pantic base model and then I can essentially just like you know define the schema that I want to use and uh even among there are also design patterns around like how you apply the these schema contracts. And so the the way that you might pass in and say like actually apply the schema is what you do is you in the in the pipeline run right we have a resource.
So this is like the data this might be GitHub issues or something and then the resource has a name and what I do is I can add this pipe and then I can say make sure you run this resource against this contract right so the actual contract or I guess the schema you pass in here and then gatekeeper is a type of schema contract pattern and so there are multiple types of contract patterns and gatekeeper for example just say like apply this to every row and then for example like prevent this from running um if the columns change, right?
And we know this because we have the columns freeze here. Yeah. And so this is also very ergonomic having this like pipe annotation because if I wanted to, for example, like temporarily relieve the schema contract enforcement again, it would just be one line of code. I just I just drop this this pipe and I drop this here and then I can keep the pipeline running if for example like we have like some SE or you know we want to we don't want to drop any like new rows. Yeah. So yeah. Yeah. And so the next slide is kind of a quick summary of what DT gave us and it'd be helpful if you could do this theory and then I'll like talk to Clash.
Um yeah, I was looking for the unmute button. Okay, cool. Um yeah, so we we showed like a lot of things on screen, but the the key thing we want people to remember is that DT is just Python code. You can pip install it. It's free. It's open source. Uh it's Apache 2 license. Um and it it's like a toolbox. It allows you to build things and the DT tries to solve like some of the difficult data engineering problems and you don't have to solve them every time you start a new project. So those are some of the problems that DT solves.
Schema inference if you did data engineering before often the first thing you do is like okay what are all the tables that I'm working with and then you need to define your tables in BigQuery or define your tables in Snowflake. DT does that automatically for you. For example, now we ingested issues. We never wrote any SQL code, right? We we automatically created the tables on the destination. And then normalization. Um if you load an issue in GitHub, there's a lot of like sub fields. So like who's the person that opened the issue? Are there comments?
What repo is it related to? And this is like a big JSON blob. DT does a good job of like creating separate tables for these different things. It just like looks at what's a list, what's a dictionary, and it like expands it into multiple tables. This creates like databases that are nice to work with instead of like a massive table. Incremental loading just means like you don't want to be ingesting the same data twice. I think GitHub will also be happy if the agents stop loading all of the issues every day, every second. This is a problem I think for a lot of companies right now.
They want to provide data access, but the agents are like very aggressive in how they load data. Incremental loading allows the agent to be polite. Like it will only request the data it hasn't loaded yet. Um, secret management DT provides like some very simple way to pass your secrets safely to your pipeline. This is mostly useful when you deploy the pipeline. Um, because your your like security team or your ops team uh might have a specific way they want you to pass secrets and DT supports a lot of them. And then the other things like data set API I think we we haven't shown it uh here but we showed the pipeline which is how you write data with DT but there's a whole set of things that we call the data set which is how you read data.
So if you have data on S3 as files or on big query in a table or on a local machine in duct DB um it's all the same for DT and you can just write your things for like creating your reports or uh reading data for like a web application uh with a consistent API. So agents also love that right? They don't need to know about the specificities of all the the places where you have data. It always looks the same to the agent. That's it. Thank you. That's very helpful. I was actually going to ask you to do that a bit of a mapping of like the product as a whole so that people understand what the open source part of it is and like what lives with the SDK, what lives with the CLI and then I do want you at the end if you don't mind sharing a little bit about the pro I guess what you're offering is because I think you know it's fair like to to share it with teams need more operational support and they want to move into that pro layer like we need to keep the lights on open source on.
However, sure I would share that. Wait, Terry, can you do that? And then I will I'll get this notebook set up so I can show them what the notebook looks like. Okay, cool. Um, yeah. So, okay. I've been working on like open source for a long time, like working like it's my full-time job to write code that is free for people to use. And I love open source and we're dedicated to open source because everything that we build as a product is built on top of open source. So we have like people in like big enterprises that use DT open source and we're happy to have them as users because they contribute back code.
They help us improve what DT is and we believe that's why we see so much growth and success with the LLMs and with the agents and it's because like we're a community and we're working together on this. Um so like the the open source part is all of the DT code, all of the Python code is open source. um what Elvisa has been showing on on like the AI workflow. So like the AI helper is part of the DTM pro product. So it just helps people use DT and um we have a DTM platform which is part of the the pro offering.
So if you want to deploy your pipeline in a very simple way, something that's compliant, some things that's like high quality, you can. It's very simple uh with the LTL Pro and it's also intended that um your agent can deploy your pipeline in a safe way. Um if you don't want to use DTL Pro, you can run your DT pipeline, your Apache 2 code on GitHub action. It runs for free. You can run it in any cloud that you might be using, any orchestrator. We have good support for Airflow. So that's like the separation, but all of the Python code that you've seen today is open source.
Thank you. Yeah. Yeah. And then the the thing that I also wanted to add is that like typically in I guess software development, we validate whether or not the thing that we built is correct with a poll request, right? But in the case of data, it's like a data engineer can't look at just the code for a data pipeline to know whether the output of that pipeline is correct. And so the horistic that's actually gives you much more signal and has a much higher ROI is actually a notebook like ideally I can get a notebook and the notebook can essentially give me like info as to whether or not this is good data.
uh because I want to see like charts. I want to like I want to see statistics on the notebook. And so I'm going to show you guys what that looks like. Before I do, I can say help me deploy this pipeline to DTOB Pro using CLI. Yeah. And so I'm going to make a new tab here. I'm going to, you know, activate my virtual environment. And then I'm actually going to open up this this notebook that we just built. And so I'm going to bring this over to this screen. And then I'm going to press.
So this is a tool called Mario. It's another open source project. Uh we like it a lot and we we use it and we show it because uh yeah we we we work to be interoperable as Elvis said early and we think it's a great interface to like look at your data. I love that. If you have the link to that would you drop it on the on the chat for me please? Yeah totally. Um cool. Yeah. So we have this notebook and the nice thing that this notebook again it's code and so cloud copilot they can they can actually build the notebook for us and so in this case what we can do is we can say like I'm going to attach to the pipeline that I built right so we built this git issues pipeline I can attach to it and after I've attached to it I can actually start looking at the data that we loaded and so I we have this like data set object and so once I have a data set I can for example see like the tables so we have issues we have issue labels issue assignees I can see the schema as mermaid.
I can also export the schema as JSON. And then once I have my schema, I might, for example, want to look at various charts, right? And so, you know, we have a chart here that just shows me, you know, how many issues were opened over time. And so, I think, uh, if I actually jump to the end of this, we can see that, you know, there are a lot more issues that were opened, you know, in 2026 as opposed to when the project first started, you know, back in 2022, right? So I can also scroll down.
I can get I guess a line chart of this, right? So we can see that you know new issues created over time are kind of just like generally increasing. Uh I can see the top issue reporters. And so this is actually so we have uh we have Marchin, he's our CTO. Uh we have Tierry and then you know like I'm over here at six. I'm very very behind Martin and Tierry. Um yeah we can also you know get this as a graph. We can see like the most common labels and so a lot of these labels are going to be enhancement, quality of life.
Yeah. And so the hope here is that we can just very quickly go from like an idea to a pipeline to then, you know, having like a dashboard that actually lets me introspect on whether my idea was good. And one of the projects that Tierry's been working on that's like super exciting is uh data quality checks. And so the idea here is that like we have these checks that we want to run against our data. And so a good check might be like, you know, make sure the ID is, you know, never null. Make sure every issue has a number.
And then I might I might, for example, want to know like how many issues, you know, are are considered open versus like closed, right? And so I can actually run this against all of my data and then we can see for every particular row, you know, whether it passed that check. I can also like, you know, change this to table. And then if I change the table I can see like we have 285 uh open issues and then I can see for example how many like pass each check right and so in this case right like let's say I do this is closed right this should be zero right um and that's because we only downloaded the open issues right and so these quality checks are really nice because one uh they're written as Python but they actually get run as SQL expressions and so since the data has already been loaded it into my data warehouse.
We can run these checks as native SQL. And then this is helpful because that's really performant and we don't have to pull data out of the data warehouse to run these checks. And then eventually what we can do is we can have another DT pipeline. And so the output so this this like data set object can actually be passed in as a source or as input into another pipeline. So I can chain together pipelines. And so in this case I can say like this is our raw data pipeline and then I can say run these checks against my raw pipeline and then take the output of those checks and you know load those into my staging table right and so we can essentially bring like this CI/CD type workflow to data pipelines where I can say only only load the only load the data that passes these checks into staging right that way we can kind of have these like better guarantees as to the data and the data that we're loading and And this is super helpful because we also have this concept of a transformation.
And so transformations, let's see if I can actually show you guys the code for this. So transformation is essentially like uh it's here. So it's pretty similar to that resources decorator where I can say, you know, we have a function, this function has an input, this function has an output, and then we add a decorator to tell DT what to think of it. But in the case of a transformation, the input is always going to be a data set. And so this is the representation of the data I loaded with another pipeline. And I can for example say like okay we're going to pull out the data from this table.
Uh and then I can do like a SQL function right or a SQL expression against this data set. And then what I can do is I can make multiple of these transformations and then I can essentially return these as a source. And so this is super helpful because this data set, it might be for example all of the failing rows of a particular check. So I might say like all of the rows that fail this check, I want you to run these transformations against just those rows that failed that check, right? And so I actually use this for our CRM.
Like we might have like data where we're missing like someone's email or missing someone's LinkedIn. And then what I can do is I can you know have a transformation that you know let's say even calls like Python for example if I wanted to to like go fetch like that person's email with some you know enrichment tool and so this is super ergonomic and it gives you just like a lot of power. Yeah. And so uh let's see if Yeah. As you see we're very excited about what we're building. I love it though. That is beautiful.
Um what a cool tool you showed us as well. And very efficient like Amy says. Thank you for watching, Amy. Let me see if I can I'm in the wrong directory. Yeah, I think Elvis, we're getting close to the the deadline. All right, cool. I'll just jump back to the slides then because the slides kind of have a screenshot of this. Uh, okay. The slides are over here. Yeah. Yeah. So, that's a quick demo of DT. This is a quick summary. So like you know we want to have all these different skills and all these different toolkits for different parts of your data job.
And so whether you're deploying or validating data or transforming it ideally we can help you and the agent like do this very effectively. Um here's a quick summary of some of the things that aids struggle with uh as it relates to modern infrastructure. And so we kind of are trying to coin this concept called a builder stack. And so we used to have what's called a modern data stack. And we just argue that because of the way some of these products were designed, they're actually not cut out for agents and we kind of have to build some of these new data tools specifically centering agents because agents just work very differently than people, right?
And so and one of the big one of the big I guess examples of this is like if everything can be code then it makes it so that agents can actually introspect on how things work and the agent essentially doesn't have to like assume or it doesn't have to like you know build something based on an abstraction that actually turns out to not be true and it can't even check the abstraction because the abstraction is like in some managed platform that it can't introspect. Another thing and that leads to like you know errors compounding over time right.
Uh another thing is that there's a lot of implicit state when it comes to these like you know managed platforms and so a pipeline might fail or it might fail partially and the agent can't like figure out like okay did all of the rows run did some of them run etc and so we also try to expose that to the agent. Uh another thing that happens is that uh a lot of the modern data tools or data platforms they're very rigid as it relates to compute. So they weren't designed to be serverless and so an agent for example again could like run 10,000 queries in one hour and then not do anything for the next 20 hours and then run like 1,000 queries again.
So it's like how do you design for this like very like serverless elastic access pattern and then the last point that I want to make is that uh a lot of so this new world like us Mimo Ibis Arrow we're all open source and this actually enables us to build product experiences that center the developer because we don't have to like submit a ticket to this other company and then wait for them to fix it. we can actually just send them a pull request that adds the thing that we want from them and vice versa.
And so because of the previous rigidity of prior tools, people ended up building an internal data platform anyway just to connect the tools together because the tools were not composable and they didn't actually connect together well. And so because I think Tierry's done like maybe over 100 hours to Marimo. That's one of the thing that I I enjoy the most about my my role. Like I'm able to spend time at work working on other projects that I like and um like I got my first job in this space being a open source contributor on my own time and I think uh if you want to get involved like just like go on an issue uh I think it's best when it's something you care about.
So if you use DT and you find a problem like we will answer your questions. We will help you get started and that's what's cool about open source. You can impact the tools that that you like and when you make something better it benefits thousands if not hundreds of thousands of people if not more. So yeah. Yeah. Would you bring me to my next comment I wanted to make or a comment question like well for folks who watch this and they're like all in they they want to contribute to the project like how can they do that?
What kind of contributions are you looking for besides the ones that you just mentioned? Yeah. I I think uh like the the first thing is always like make sure that you discuss with the maintainer so you don't work for nothing. like we we had some very good contributions but sometimes it's out of scope. So you there's a set of open issues if you visit GitHub and there are some labels so like filters that you can apply I think two goods two good sets of labels are good first issue uh and there's the quality of life issu uh label that usually are both easier um or like smaller in scope but I'm giving you a warning like DT is a very big code base and so sometimes it looks like a simple problem because the the area of the problem is small but it goes very deep and it becomes a rabbit hole.
So uh that's why it's good to talk to people who know the codebase well before and they will be able to help. They will send you in the right direction or they will like warn you that this is a difficult problem. But like I said, um I think it's best whether it's like with DT or any projects if you want to get involved in open source like get involved in projects that you use because then you really care about having the issue fixed like if it's something that like irks you like oh this is a bad error message like go open an issue and if you want to go further like try to make a better error message right open the codebase see where the error message is and open a PR and like I I've never heard of a uh project that's unhappy about receiving like human generated issues.
I think what we've been hearing a lot is like it's difficult to manage AI but whenever I some thoughts and write down like what they find difficult and what they want to improve like I will spend time replying and helping them. So yeah, cool. I'm actually going to I'm going to go to this this is like our conclusion slide. Um Oh, please. Yes. Yeah. So yeah, so agents are already pretty good at writing DT because it's open source. Um because you know it's already in these model weights. Uh we can use skills to like do higher level tasks like data validation, bullying notebooks, etc.
Also deploying with DTF Pro. Um but in terms of where we're trying to go next is like so we have an agent making a pipeline. Ideally the agent can also run it, right? They can actually see if the pipeline works, see if the pipeline actually loads data. Uh but the next project that we're working on is like making this this sandbox concept but for data in addition to compute and so ideally we can also give an agent a temporary database. Uh so this is very similar to like neon but in this case we're doing OLAP because we're doing like analytics.
Um so ideally we can give it a temporary database and the agent can actually run the pipeline and load data into the temporary database and then it can then run queries through like an MCP server uh to actually see the data that it loaded. That way whenever I give an agent a prompt uh it doesn't just give me back code and then I hope that the code works. the agent's like, "Oh, I I'm giving you back code that I actually already ran and I actually actually already looked at the data that the code loads and I validated that the data is good." Right?
And so the hope for us is like how can we increase the surface area that agents have access to. That way the data engineer can kind of like give this agent a problem and then like have very high confidence when it gets back is actually decent. And then the the next thing and Terry talked about this earlier is you know making it easy for people and for us to like have these self-improvement loops where you know the agent just did all this cool work but ideally the whole time we also exported the traces and then the agent a second agent you know can introspect on the traces and come up with ways to improve that whole process right so this is like meta feedback loop and then the third thing that we've been thinking a lot about is how to bootstrap an anttology and So uh the idea here is that DT is the tool that connects your underlying source.
And so again if the schema changes we know and so ideally it's like how can we like take the fact that we know the schema we know what people are pulling from these sources and then create metadata that we can give to other tools. So you might have an analytical dashboarding tool. It's like how can DT give that tool metadata on the actual pipelines. That way that tool can for example like have their AI work well because they know what the schema looks like. They know like what we're trying to do because the pipeline kind of will embed the intent of the data engineer in the first place.
And we're hoping that this can also be birectional like maybe this data vis relation this data viz tool um we might for example let's say get like query patterns like what are the dashboards people look at the most uh whenever they look at a dashboard do they change it and then if so like how can we for example parse these changes and then change the underlying data pipeline that way you might for example one day never even write a data pipeline you might just like have these dashboards and then you change the dashboard and then an agent automatically builds a pipeline for how the dashboard needs to be changed, right?
So, we're hoping this can be birectional. And then the last slide is that once you uh do deploy a pipeline with DTF Pro, you get a really nice uh I guess like dashboard into the observability of all your pipelines. So you can see like so we have people who run like I think 450 pipelines a day or something like like unique pipelines not like necessarily pipeline runs and so and you need like one place to like see how all of your pipelines are doing how many rows each pipeline loaded uh when did they run these pipelines and so that's what we're trying to bring with DTO pro and so yeah and so this is kind of what it looks like uh we have these like charts we have you know number of pipeline runs and then I can actually click into all of these And then I can like see even more metadata for those particular pipelines.
And then that same notebook that we built earlier with cloud, we can also run that in the cloud. Instead of it pointing to our local database, it might switch out to your actual production data warehouse. Yeah, that's a that's it for me. I don't know if Tierry has anything. That's amazing. I think I saw Did I see you had a slide with the QR code for your Slack to join your community? Am I making this up? Yeah. Uh I think I I did I had one earlier, but I think I deleted it so I can move it here.
D I I'll I'll find the link. Uh but it's definitely it's on the repo. So folks, you can just go to the repo and start it. Yeah. And then if you want to join their Slack community. Oh, it's actually super easy. It's your website plus/ community. So yeah, and this is for sure. Yeah. So this is so I actually I had cloud run the pipeline that we did. So Oh no don't worry. Yeah. Yeah. So this is it running in the cloud and so I can like you know click into this and then I can for example see like you know these are tables these are resources we loaded 285 rows because we have 285 open issues.
And again this isn't connected to my local ductb but it's connected to my data warehouse. And so DT like ran this pipeline in the cloud and loaded it into my production data warehouse. And I can also you know I can like see the server logs. And so I can see like you know the actual pipeline running and things. And so the hope here is that going from like remote and local and vice versa requires no code changes and is literally just like changing the command you run in the CLI. Yeah. But that's it for us.
Very very cool. Thank you so much. I learned a ton. I I learned more about Yeah. No, definitely I learned more about ITL today that I had it was not in my previous so I appreciate it. Um I think thank you for sharing the of course the other project and we also give them their flowers but folks don't forget to go by the DT repository and take a look there is I I'm just kind of barely clicking through and there is some good first issues and the quality of life label so that will be a good starting point but as Tori mentioned make sure you have a conversation before you start digging the code so that you're not working a lot and maybe something in in like that you don't know about or it might be a bigger problem that one person can handle.
For both of you, where can people reach you online if they want to follow you and then we'll wrap it up? Uh I am active on LinkedIn. I think it's a good place to follow data news but it's also very noisy. So otherwise Slack is where like we have the most uh active community and uh we have our blog that we post regularly on. I'll share the links in a second. Perfect. Yeah. The only social media I have is also LinkedIn. So that good for the both of you. I'm so happy for you. GitHub counts.
Get kind of counts. Yeah, it does. Yeah, that GitHub is probably where I'll reply the fastest actually. Yeah, honestly. Yeah. If we get time to GitHub, that's high priority. All right, that's awesome. Thank you both so much. I'm going to put the link uh here to the blog. It sounds like there'll be some really interesting content there. Um, thanks for keeping open source awesome and you'll have to come back and teach us on more. I've haven't had many like data related projects. So, this is Yeah. So this is very interesting especially now more than ever honestly where you say like it's so easy for agents to like write their pipelines like before you have to have the one person that knew how to do it and now you can all kind of do it at some point like everything starts looking like a data pipeline like you're like oh I'm like me moving data even from my mind to this other tool or my mind to Google Docs is in a way a data pipeline and so now you like will not unsee it just they're just data pipelines everywhere.
But that is so real. And listen now, like I love the connection with hugging face and being able to turn that data into actual like sets that you can do things with. That's so cool. So hey, bright future for data friends. Put your money on it. Thank you both so much for coming. Yeah. No, thank you both so much for coming to Open Source Friday and I wish you many stars and I can't wait to catch up. Elvis, come by on to the things. Come by and see. Yeah. And Terry Terry will be in town too.
So Oh, sweet. Okay, come by. Come by. We have a lot of fun things happening that week. Like, believe it or not, it's going to be a fun one. Um, and then there is a couple of events at uh GitHub headquarters, including we're doing a open claw meet up on Thursday. So, come to that. Cool. All right. Thank you, friends. We'll catch you on the next open source Friday. Don't forget to go by the repository and start it. Everyone who joined, we have people join from all over the world. Thank you so much. I'm so grateful you chose to spend your Friday afternoon with me.
We'll see you next week. Take care. Heat. Hey, Heat.
More from GitHub
Get daily recaps from
GitHub
AI-powered summaries delivered to your inbox. Save hours every week while staying fully informed.







