Open Source Friday with Gunnar Morling with Hardwood
Chapters9
Bruno and Gunnar introduce the Open Source Friday session and set the stage for discussing Hardwood, a Java Parquet reader built with AI assisted workflows.
Gunnar Morling showcases Hardwood, a lean, multi-threaded Apache Parquet reader in Java, built to be dependency-light and AI-assisted in development and testing.
Summary
Gunnar Morling interviews on Open Source Friday with Bruno Borges dive into Hardwood, a Java-based Apache Parquet parser created to be minimal, fast, and easy to integrate. Morling explains the motivation: existing Parquet implementations carried heavy dependencies from the Hadoop ecosystem, which Hardwood aims to sidestep with zero mandatory dependencies and true multi-threading. The project centers on efficient columnar access, supporting both row-based and columnar APIs, with additional features like predicate push-down and S3-compatible remote access. A core theme is using performance tooling to push the design toward scalable analytics, including JDK Flight Recorder, Mission Control, and async-profiler to locate bottlenecks and guide optimizations. The discussion also highlights the practical workflow of building and testing with AI agents: design first, implement second, and iterate with a strong emphasis on public API stability and test-driven confidence. Hardwood's CLI, a separate native image built with GraalVM, showcases how a JVM library can yield a fast, platform-native tool for working with Parquet files. Morling presents the design philosophy in the repo—design docs, API boundaries, and a clear, test-driven approach to evolving the project while keeping maintenance approachable. The conversation ends with a tour of the repository’s API surface (row reader, column reader, predicate filtering, and S3 access) and an invitation to contribute, star, and engage with the project on hardwood.dev. The session underlines a broader message: AI-assisted coding can accelerate thoughtful design and iteration, provided humans maintain ownership of architecture, testing, and documentation.
Key Takeaways
- Hardwood aims to parse Apache Parquet with minimal dependencies, avoiding the heavy Hadoop stack that plagues many implementations.
- Hardware’s core APIs include a Row Reader for record-by-record access and a Column Reader for batch, vectorized processing of a single column.
- Predicate push-down and projection capabilities enable efficient querying against large Parquet datasets stored remotely (e.g., S3) without downloading entire files.
- Performance tools like JDK Flight Recorder, Mission Control, and Async Profiler are used to identify bottlenecks and guide code improvements.
- AI-assisted development is used as a design and iteration partner, but Gunnar keeps API design, testing, and long-term maintenance under human control.
- An optional native CLI for Hardwood is built as a GraalVM native image, providing fast startup and cross-platform access to Parquet tooling.
- Contributing to Hardwood follows a design-led flow: draft designs in repo docs, implement in code, and validate against Apache Parquet test files.
Who Is This For?
Essential viewing for Java developers and data engineers who work with Apache Parquet and seek lightweight, high-performance parsing libraries. It’s especially valuable for maintainers exploring AI-assisted development workflows and open-source project governance.
Notable Quotes
"The waste majority of the code in this project is implemented using cloud code."
—Gunnar comments on the extent of AI-assisted generation in Hardwood’s codebase.
"There is a design-first process... we would have a design document and then iterate on it."
—Describes the disciplined flow for using AI tools in development.
"This is a readymade CLI... a Swiss Army knife for dealing with Parquet files."
—Discusses the CLI utility added to the Hardwood project.
Questions This Video Answers
- How does Apache Parquet differ from CSV for big data analytics?
- Why would a Java Parquet parser avoid Hadoop dependencies, and what are the trade-offs?
- What are predicate push-down and columnar storage, and how do they speed up analytics on Parquet files?
- How can AI agents assist in building open-source Java libraries without compromising API stability?
- What tools show you why a Java Parquet reader is slow and how to fix it (JFR, Mission Control, Async Profiler)?
Full Transcript
Heat. Heat. Heat. All right, welcome everybody. Welcome to Open Source Friday. I am Bruno Borges. I'm a developer advocate at GitHub Microsoft. And uh today we're going to talk about a project uh on um on from the Java ecosystem um that um has been developed uh in collaboration with AI agents but not solely and we're going to talk about what the project does. is going to talk about how uh this developer I'm going to walk welcome on stage built it the principles used and then learn how other developers and other maintainers open source maintainers can use uh the techniques um shared in today's video so uh let's welcome on stage uh Gunnar hello hey there hello thanks so much for having me super excited yeah my pleasure it's thank you for being here with us just for folks to understand like you are in Germany right now.
Yes, that's right. So, 7 a.m. Friday evening, you know, best time of the day to hop on a live stream with you and talk about this project. Awesome. I'm uh I'm from Montreal. I'm I I live in Montreal and uh it's it's amazing to have this connection because uh um you know, you're in Germany, I'm in Montreal and I'm sure we have viewers from all around the world uh in on this uh on today's session of Open Source Friday. So tell tell us a little bit about yourself and uh what what are you working on these days?
Right. Uh yeah um so I work as you mentioned I work as a software engineer and technologist at Conflint. So that's the company which originally created Apache Kafka and uh so in this role it's you know quite multiaceted. So I spent a large time of my share just to understand where the market in the data streaming space is uh heading. So what are like new technologies? How does AI for instance play into all that? I try to stay on top of all those projects like Kafka, Apache, Flink, um the besium change data capture which I used to work on for a long time and you know for instance then to understand so is this relevant for our company how is it relevant?
Um so all this kind of stuff like let's say you know understanding the market and competitive analysis that's part of it but then also I spend a fair time of fair share of time like writing on my blog uh going out uh on conferences um thought leadership in the widest sense I guess and then I'm also like of course working hands-on so I do open source projects like the one which we're going to uh to talk about today I'm helping the company with prototypes for new project ideas um so all all those uh kinds of All right.
So your your blog, I think it's moline.dev, right? My last name. Yeah, I write a lot about Java. So, you know, I'm part of the Java community for a long time. Maybe people have heard about the uh one billion road challenge which was a coding challenge which I did like two years ago. Um so yeah al you know try to do fun stuff uh like that but generally like working on uh well with Java then in the data space in data streaming space that's sort of my my sweet spot. So so for folks who don't know uh there was a challenge a couple years ago called the one billion record challenge which was how to process a file or a bunch of files not just single files.
Well one one file one file with one billion rows. That was the idea. Yes. 1 billion rows in a single file. Uh I'm pulling I'm pulling here the the link to your to that project on your GitHub. There you go. So it's.com/gunnnering onec. Um what was what was the deal with that project? Like why why did you come up with that project? Yeah, people keep asking it was I don't it was kind of random to be honest. I it was like yeah two years ago between Christmas and New Year's Eve I had some you know free time over the holidays and I was ticker tickering around a bit with stuff and then I felt okay yeah let's see whether we can do this kind of coding challenge let's just see how far we can take Java to process that 1 million rows which by the way is like 13 gigabytes or so of of data and uh well then it kind of you know went viral people from the Java community but also beyond joined um still the repo is every now and then I still get like notifications people still I don't know open issues or try to contribute I mean it's kind of like you know has been finished for two years now but still people are talking about it and yeah and also people ask will there be like another challenge and uh yeah maybe maybe at some point what about what about two billion records yeah that could you could or like you know people let's have like a one trillion row challenge or whatever but no I think it would have to be something different but it would have to be like you know as easily to to explain like everybody understands that problem but then also to it needs to provide all the potential to tinker for like a couple of weeks with it right and uh so that's the challenge to find another problem like that so um I I take like one of one of the challenges of like re reading large files um and and I'm I'm going to ask you a bunch of stupid questions um sure so so please do feel free to call me stupid uh but but the idea is to learn, right?
And um and and and to me when I think about a challenge like reading a large file is one if I know the data that I want, how can I access that data directly? Two, right, if if the file is not structured in a way that I can access the data directly, how can I load that file in memory and then scan the data? But if the file is too big, what are the strategies that I I have to use to find the data that I'm looking for? Was that some some of the things that that challenge helped?
Uh absolutely. And uh totally I mean it's a lot or you know that challenge it was a lot about parallelizing the problem. I mean just to give some context so the file had like 1 billion rows and it had like random temperature measurements uh keyed by like weather station ids and the task was to find the minimum the maximum and the average temperature value per weather station. So you had to scan all those 1 billion records and then to parse those records and to essentially aggregate them right and like find min, max and average. And so yes, it's a lot about paralyzing that problem.
I mean, it lends itself very well to being processed with multiple calls, right? Um so yes, you want to broadly load it into like um um native memory to work, you know, not on the heap and um you want to deal as much as possible with primitive data types, all those all those kind of things. Yes. So um and do you think so? So the my my recollection of that challenge was that the file was a CSV file. Yes, it was a a a semicolon separated uh file actually. Yes, sort of CSV. Okay. Not not standardized CSV.
Yeah. And I mean that was the thing all was a little bit random. So I mean just to you know because I didn't really expect that many people to participate. So for instance what I had done is I the output which I showed it was just like what Java's hashmap tworing function would happen to give and um so that was kind of like the output contract and people then tried to solve that challenge with all kinds of technologies also for instance like relational databases and then they would go to a great length to emulate that Java hashmap to string output with like a postgress or whatever database right and it was just yeah I don't know I was kind of caught by surprise is that that many people would participate and uh yeah had I known of course I would have you know be much more prescriptive about the constraints and but so we had to kind of build that plane as we were flying.
Was was the SQL database implementation the most uh curious one or was there a different another one that you felt like okay never thought somebody would solve this problem this way. Oh yeah. I mean there was like you know Cobalt uh like of course like all the databases like Snowflake all kinds of languages uh you know of course Java. It was meant for the Java community but then also people did it in like all kinds of languages you could imagine. Um tools like people would do it like in uh A on the CLI. So every possible tool you could imagine for solving that problem people uh did did use it.
Have you have you tried since then? Because it's been a while. It's been a couple years, right, since that challenge went live. Have you thought about throwing coding agents to solve the problem and see the what they come up with, right? Yeah, it would be interesting. I mean, back then actually there was one entry by uh Antonio Gonalves. Uh he's a colleague of yours, right? Yeah. Um so he back then, I don't know what he used. Was it Copilot or I'm I'm not sure. It was like pretty early for those coding agents, right? And back then he came up with something which solved the problem, but it was kind of slow.
I mean to give people an idea like the top implementation it would run on my machine like uh with all the CPU cores. So this was a big machine 32 cores this would uh crank through that data set in 300 milliseconds. So it's like you know it was like insanely fast but also like super bespoke optimized and I think uh that implementation which he did took a couple of minutes. So you know definitely not up there but also now of course like quite some time has passed and it would be interesting I would imagine uh well I mean all the agents have been trained on that repo right so yeah I wonder if they would like just pick the fastest implementation and have a variation of it you know um so so we have when when we talk about data sets we talk about I mean there's structured data on relational databases we have nonstructured data on on right nonsql databases or data links and whatever.
But when you at some point the data is inside a file, right? And and there are different file formats for for hosting data and um some file formats are tax based plain text like god forbid yl but xml and JSON and toml and yes whatever right CSV for example right when it comes to when it comes to binary format for storing data what are some of the options today right so I think uh you know as often depends on the specific use case and what exactly you want to do with that data. This determines how you would want to store it.
Now, as you mentioned CSV, I mean that's a row based format, right? I mean, everybody has seen it like one record of your data is it like customers or like temperature measurements or whatever it is, it's like one line in that file. And now if you want to get that entire data set, well, you can uh you know iterate through it. And if you want to have like one entire record, well, if you know it's a line, it's like it's offset that file, you just can go there and get that entire record. So that is kind of useful if you want to yeah access entire data records, right?
And probably would add something like an index so you can do it efficiently so you know where to go in that file. So and typically that's what you would consider as an operational workload. So like you know this is kind of like how databases work. uh well they don't use CSV files obviously but you know they work with that row based uh data representation now there's another world uh where you want to do large scale analytics queries on like throves of data so not you don't want to access like a single customer record or a single purchase order but instead you're interested in processing that entire data set you want to know what's the average order value per customer category per month for this kind of query well you need to look at large volumes of your of your data, right?
And then you aggregate it and you I don't know you sum it up, you you take the average, you find the maximum, you do maybe window queries, whatever it is, right? Now, for doing this sort of uh analytical query, storing that data in that row based format, it's not good because well, let's say you want to just for the sake of the example, you want to find the maximum temperature value in that data set. Well, what would you have to do? You would have to go through all those rows. Then you would have to find where is the specific column I'm interested in.
So you would have to navigate to that. Then you would have to go to the next row to the next row and so on. Right? It would would be pretty inefficient. So now this is where those what's called a columner file formulas come into the picture because they essentially pivot the data. So they store the data well in a columner way. So this means like all the values of a specific column they are stored consecutively. So all our temperature values they would be stored like one after the other and then all the values from the next column they would be stored one after the other.
And now if I want to go and find out like my maximum temperature well I can you know iterate very quickly through that chunk of temperature values um just by scanning that specific file. So you know it's very uh caching friendly. I can do things like uh vectorzed operations on that data. And this is typically how you would store data for like those analytical um use cases. So for so for folks to kind of visualize and correct me if I'm wrong. So when you have a traditional row based file format or data structure, it's like a Excel spreadsheet, right?
But what what you're doing is kind of turning the table uh right to the to the side like a like a 90°ree uh uh turn, And then and then now you have Okay. row one has the column A and all the values for all the records. Right. Right. Exactly. Right. And this So that's exactly right. And now this opens up a few interesting um well opportunities. So for instance, let's say um you do what's called a projection. So you are just interested in that uh single column. Well, you just have to read that uh part of the file, right?
As you say that it wouldn't be a row, but you know that's a consecutive section of a file. So you just have to read that um or if you want to like process it well you can go through it just by reading reading that file and also um or reading that section of the file and also it's uh actually uh tends to be uh very efficient uh to be stored depending on the kind of data. So let's say your values would be like ordered timestamps uh maybe it's measurements and you know each each measurement has a time stamp and and this data is ordered.
Well, what you could do now is you could say instead of storing like each of those full time stamps, you just store the very first one and then you just store like the delta to the next one. So maybe that's like just two milliseconds after the first one. So you just have to store a two instead of the full thing, right? And so you would do what's called like delta encoding which is that very very space efficient. Um you can compress the data very well. So that's you know all those reasons why you would store data uh if you have this kind of of um if you have this kind of use case and you can do that with anything right not just time stamp you can do that with financial values with numbers absolutely and and exactly there's all kinds of encoding.
So for instance there's what's called the dictionary encoding. So let's say you do have strings and but it's like a known set. So maybe it's like a hundred different city names. So instead of like spelling out those names you could uh you know create like a dictionary. So you' say okay Montreal that's one Hamburg that's two and so on and then in your file you would just store well the dictionary at some place right and then in the actual column values would just store those those ids and then if you pass the file would go to the dictionary and then you would be able to resolve those those values so does it so it it it kind of it it is compressed by design right like the file it's it's very compression friendly yes exactly because uh yeah if you have you know let's say just those like numbers one after each this could be very efficiently compressible.
So what what options do developers have today if if they come to a point where hey I need a columner file format and and I need to to transpose my data from my database to this file so I can have analytical anal process on it what are the options today right and uh great question I would say most of the times people you know would maybe even do that already maybe not even knowingly because if you are using what's called the data link so something like Apache iceberg um this is exactly doing that. So you know those kinds of what's called open table formats they manage uh those column files and now there's different um options but uh Apache Parket that's the most uh most prominent one um has been around for quite a while and yeah that's definitely the effective file standard for for those column files.
So paret Apache paret is so this one here, right? Exactly. Right. That I put on the on on the website. So this is just a file the Apache paret is a project uh that defines the file format. Right. So they define the file format. They defi there's like a very well-ritten specification. There's a uh uh implementation of course. So there's a park Java puzzle project and and and writer project. And then also very importantly there's a a canonical suite of test files. So they have like you know a couple of hundred paret files which show all the different encodings different uh compression algorithms uh all the different features different data types and so on.
And now if you want to implement like a new implementation of that format uh well having that specification and then also having those those test files that's uh super valuable. Yeah, I'm looking I'm looking at the documentation here and interesting. Right. So yeah, they you know they give you in in using thrift they give you like the definition of all the different uh blocks of the file. So um you know there's what's called column chunks which are organized in row groups and using those uh schema definitions here you have a very precise understanding of of how that is structured.
What is what is the what is the magic number used for? Uh it's just there to tell you you know that this is a parquet file. So a little bit like every Java class file it starts with like cafe babe. Right. So that's the same thing here. Okay. Gotcha. Cool. So uh when it comes to implementation there are a few options here. I see. Uh see one is missing. I should there is. Yeah. Let's let's talk about that. Um so so I mean clearly there's a bunch of of implementations. I mean I I see three implementation in C++.
One of I don't know what is cudf C++ I I don't know. Do you know? Uh no I'm not not sure. I have no idea. Uh but I see one two in Rust right. Of course one in go and one in JavaScript. I'm sure there there's others in other languages. It's it's widespread. But but do you want to do you want to talk uh about why you came up with a new implementation? Right. So yes as you as you see here so there is an implementation in Java right which is widely used and um you know people use it successfully but uh I think there's two or let's say there were two reasons for me to to build a new implementation and we should say that so that's what we talking about right like a new project hardwood is a new implementation of a pocket paser in Java so let me let me plot here on the screen uh so project hardware this one here.
Okay. Right. Exactly. Right. So, you know, so the idea is like parket hardwood, different kinds of flooring. So, that's that's uh where the name is coming from. And so, why a new implementation? Well, um a couple of reasons. So, first of all, the existing one uh it has been around for quite some time and in particular it has quite some heavy dependency baggage. So essentially you you know you you just want to parse a paret file but you pull in the entire Hadoop stack of dependencies. So that's like dozens of jars. It's like your glass path.
It's it's not like fully standalone like parsing library and you know which made sense because it came from that context and this is how the project came to be. Uh but so that's that's a huge problem right? So you have this huge dependency footprint and you know it opens uh all kinds of questions around yeah having different versions of your jars on your class path how are all those transitive dependencies are they secure and up to date and you know are they conflicting with each other so it's a it's kind of like a hassle I would say so that's a dependency situation and then also it's single threaded and uh I you know this problem lends itself very well towards being parallelized as we have learned during the one bill road challenge.
And so I thought okay it would be nice to build a new per and well hopefully in the future also a rider uh for Apache parquet which is um you know which has minimal dependencies. So that's the one thing uh the like one of the leading paradigms. So there's no mandatory dependencies. Uh there's some optional dependencies and we can talk about those later on but no mandatory dependencies. If you pull in this hardwood core jar into your project, uh it has uh you know no mandatory uh transitive dependencies. So that's the one and then well it's it's fully multi-threaded.
So it tries to take care it take advantage of your you know CPU uh of all the CPU cores you have. So that's the idea. Um so you know I felt okay there's just a need for building a new per with those ideas in mind. And then also well I just wanted to also I saw it as a nice opportunity to yes put those agentic uh coding tools to work and see how good are they for building such a relatively lowlevel thing right so I'm not building like a crut uh UI react web application here it's relatively low level um and I was curious okay how can I do this with those coding agents and and you I think yeah I'm going to open here uh your blog because I think you you wrote when you announced the project back in February, there was something really cool, right?
I love this section here. I mean, there is for folks who want to understand like uh go go deeper in what what Gunnar just said about why the project uh his blog post on morning.dev covers in details, right? Um and there is the parsy performance and you just mentioned right like the the one billion record challenge helped come up with some design principles for this new implementation and I think that's that's fascinating when I wonder I wonder what other stories we have from people who solved the advent of code and then came up with new ideas for new libraries you know right yeah it's kind of funny because back then actually you know people also asked hey why are you doing this challenge And so there was uh different uh ideas like let's say conspiracy theory.
So people said oh yeah you know because grow a native binaries they did really well in the challenge. So they said yeah this is like a hidden advertisement for gravium which it was not. And also some people said okay I just had this problem at my workspace and I couldn't figure it out how to solve it myself. So I kind of farmed it out to the community. Um which also wasn't the case. I just thought it's a fun challenge. But now actually yes I'm taking some of those learnings and I'm applying them to building this uh production worthy parquetation.
Well, I I want to I forgot to ask you this in the beginning because you said, "Oh, it was like at the end of the year and you decided to do the 1 billion record challenge, right?" Was that because you got bored because you finished the advent of code and you know you always have some some low time between the years as we say in German, right? And so I just clip into that and this year actually uh or let's say last Christmas well this was uh when uh Hardwood was was born. Awesome. Awesome. So, so back back to your blog.
Uh so there is a performance and uh I mean for folks from outside the Java ecosystem one the JVM the hotspot JVM the runtime of of the Java ecosystem has amazing capabilities for analyzing performance right and I think I think a lot of folks from outside the ecosystem and even some folks inside the ecosystems don't know the capabilities so can you talk a little bit about what we're looking at here and uh a little bit of the tool set that you used for performance analysis of Absolutely. I mean so what we see here it's a tool which is called a JDK flight recorder.
So um this has been around in Java for quite some time but um initially um yeah it was like open source and there were some licensing questions around but since a couple versions it's open source and you can use it at your own liberty and essentially it's exactly that it's a uh recorder for all kinds of uh events uh which again are stored in a very efficient file format actually. And so the idea is to you know have a place to store all kinds of diagnostic information um about garbage collection about class loading about method profiling about locking about IO all kinds of metrics and they keep adding more and more event types in each Java version.
So they are stored in this uh flight recorder file format and here we see see this tool mission control actually that's the client for uh you know uh analyzing and uh examining those those recordings and um so yes you you can use those tools to get like really deep insight into the performance characteristics of your job application. So I don't know you have let's say you know high tail latency so you would be able to understand okay this is related I don't know to garbage collection at certain points in time and why are we uh creating that many objects uh you can see what are your allocation hotspots and uh what can what can we do about it there's tons of built-in metrics but here as I'm a big fan actually of those tools uh there's also the ability to have a custom flight recorder events and so that's what I'm also using in in the hardwood project so there's you know uh event types which will be emitted I don't know if you have finished a row group or if we have like a you know our consumer is blocking it's waiting for IO so you know we would emit such an event and uh then you can use mission control to to gain those performance uh insights so that's a big one and then uh async profiler that's I would say the the standard the the standard profiler in Java for yeah identifying like what are your bottlenecks in terms of CPU time, wall clock time again also memory allocation and so on.
So, so you're we have two tools here that can be used by not just Java developers but any JVM based language developer that that uses the hotspot JVM. So that is Java, Scala, Cotling, Closure, even even Ruby like if somebody's using J Ruby, right? The J the Ruby implementation for the JVM, they can use these tools for performance analysis of of their applications, their libraries and and and so on. It's really cool. So I'm I'm putting here I put here in the stream the link to Java. It's a it's a it's a website where developers can learn about Java and uh there's this page for JFR or the JDK flat recorder.
So you can learn more and yeah that's great I mean as you say it's still say a bit underutilized not many people or not too many people know about it. Uh so definitely yeah flight recorder is great and yes as profiler that's that's the profiler which I but also like most people in the Java community use heavily. You know what's fun? I was um um a month ago, two almost two months ago, I decided to to use AI agents to build an SVG to PNG converter tool or library in Java. Um, and after having some design ideas on the API, I actually I I told the agent, hey, use the Cairo SVG, which is a Python library um API design to expose the same approach on how the the developer would interact with the library with the API.
Uh, but the internal implementation is completely different because it relies on Java 2D graphics um API. Um but then eventually I was like hey agent now let's do performance testing. Right. And um I said do go do some flame graphs go do some analysis of of where the bottleneck is and the agent decided by itself to use the async profiler. I didn't I didn't I was not explicit about async profiler. The agent found out oh I can use the async profiler for that. Right. Right. Right. Yeah. And it's actually also yeah um you know it can analyze those those flame crafts.
Um so it's uh yeah it's all coming together. So um for for folks who want to learn more about flame uh oh sorry async a sync profiler that has flame graphs uh here's the uh the URL github.com/asyncprofiler. Right. Cool. So so back to your blog. So the the performance analysis helped a lot with you know making sure you tackle every bottleneck and after you tackle bottleneck you go to the next one and the next one and the next one right right making sure the design is clean at the same time so you can you can add capabilities in the future but the section that that I love more is this one here built with AI not by AI can you tell me can you tell us uh about this right right absolutely so as I mentioned you know the one idea he also was yes let me use this project as some sort of a yeah real world testing bet to see how far can we get uh with the modern agent coding tools and um I would say you know they are doing like a really a really good job so the waste majority of the code in this project is implemented uh I'm using cloud code I should say so you know that's the waste of majority of the code is built by cloth but uh I'm not wipe coding and I just also think it's not a good idea generally speaking so Yes, I use those tools a lot to you know generate code but then I also uh well I want to make sure I understand it and I understand uh you know the structure and I'm actually very deliberate and explicit about like uh APIs uh what is public what is internal to our code base um how is the general architecture so um uh you know I I work kind of with a relatively prescriptive design process so actually also you can see that in repo we have those design documents.
So typically that would be like the first phase to yes let's see um you know certain feature I don't know new kind of um let's say want to add support for bloom filters um you know we would first have decoding agent do like a design for that and then we would iterate on it would try to nail down corner cases and so on uh think about testing think about what's what's the user facing API of that and then uh well we would have it uh you know actually implemented um so yes and that's exly right why it's you know it's called with AI so heavily using this um also for reviewing contributions and so on but it's not uh AI in the driver seat I would see it as one of the tools we are using right so I feel like I feel like this this is the sort of like reason why I like to advocate for AI agents to to be part of the flow not be an isolated flow like I like I like when I when I did the the SVG library.
Um I was interacting heavily with it. Um but at some point I felt like some tasks I could delegate and then just review the the the after. Right. Right. Right. Absolutely. I mean um yeah sometimes it's also just not uh very good what it gives to you. Right. And so yeah you got to guide it then into a different direction. And I mean we have all seen that right? Oh when it tells you oh yeah that's a great idea and you're so right. So you know sometimes it can also become like a bit annoying. Absolutely right.
You're absolutely right. And so why did you not think about it then in the first place? Um but yeah so you know I think I mean in no place I would have made that that much progress as I did uh in that short short period of time without using those tools but also there's there's limits and um yeah I mean in terms of like understanding the code so as I mentioned I think it's you know very important to to be really on top but the way I think about it is like I know I mean there are certain pockets in the code which I don't fully understand to the last line of the code let's say I haven't looked at Um because I mean you know let's say there's one out of five different encodings and they can be tested very well.
They are very well defined in the specification. Um and so for that you know I'm fine with uh uh knowing okay there's this implementation it it uh passes all those uh test files which the parket community provides and then I'm good. So I'm not looking super closely at those parts of the because I also know I mean it's the abstraction principle right? So it's like right behind the interface you don't need need to care that much uh how it's looking there but what I'm really keen is to yeah like the overall architecture like the IO flow how do we parallelize um what is public API what is not public uh I'm really on top of those things because I mean yeah they are harder to to change right and I think that's with everything you want to pay the most attention to those things which are are hard to change or costly to change it's it's almost like you want to make sure that in the same way that we interact with LLMs without exactly understanding the the implementation details of the LLM.
I mean libraries are similar to that the for for the end user who's using the library they don't care about the implementation details to some degree as long as it's well documented they understand oh this method is thread safe it's not thread safe exactly things like that but the the the the outside the the the public API as you said that is the part that you want to have strict control once you have that and and and once you have that it allows you to make changes internally using AI or without affecting the end user.
Yes, absolutely. That's also why I think it's just I don't know I feel like those end to- end tests they work just like really well for this kind of problem or for this kind of implementation approach let's say because you know in the end I have a set of test files which I even get from the parket community so they provide them I can compare what the hardwood pausing result is I can compare that to what the upstream park Java puzzer emits and if there's a difference well then obviously I mean Yeah, probably there's a bug in our implementation, right?
Um, and so yeah, I can iterate on on those things very easily, right? So in initially we didn't support or we didn't pass all those test files. Uh, and this was like a sweet spot for the AI tools. I could say, okay, you know, there's this one test file, we don't support it yet. We we don't pass it yet. The test is failing or we have like a different result than the other parer. Please go and make it work. And this works perfectly for with those tools. I I I used a similar approach when I u when I when I needed more task cases for SVGPNG.
I just said AI, hey, go go build like 50 SVG use cases that cover all the capabilities of the SVG 1.1 format so that I could have unit tests for for every single um capability that was implemented but then also a mix of them like how how do different features of the SVG format can work together and if the library would render properly. Absolutely. Yes. So um would you mind to show us um your library um talk about it over showing the code and make run maybe run some some examples um just just so people can have a feeling of like what the code looks like and right yeah absolutely I mean I also I'm also interested in you showing your flow how you interact with um AI coding agents uh you mentioned that you use cloud and that's totally fine we are happy to to to show cloud here in open source Um it's what what is really interesting and I think what people will actually enjoy is seeing how you as a maintainer of a library is using AI coding agents whether that's cloud gemini codeex or github copilot it doesn't matter what matters is how can how can other maintainers learn from your experience using AI coding agents for for an open source project.
Yeah I'm I'm I mean I should say I'm I feel like I'm still like super early on my own journey and I'm not sure how much people can learn from me. I feel like I'm learning uh you know not so much every day myself. Uh but yes, of course I'm happy to to share uh we can take a look. So you want to bring my screen online? Yeah, sorry about that. There you go. Right. Okay. So here we have the uh hardwood repo on GitHub. By the way, you totally should give it a star. We should have many more stars.
That would be awesome. Um so yeah, let's it's the code here. Let's see. we have this website uh which you know gives you the uh like how you get started, how do you what's the maven coordinates and so on. So that's hardwood.dev and so yeah let's let's start with the API. So you know it's I would say relatively straightforward. So there's this what's called a paret file reader. By the way is this big enough. Should we make it bigger or is it good a little bit? All right. So there's what's called a par file reader.
uh you know you can point it to a paret file and then you can retrieve the values in in in two flavors. So there's what what we call like the row reader API and this now essentially gives you a row based representation. So we go through all the records which are stored in a column away in the file but here we access them as as records right. So I have this API where iterate through the records in the file and we will get all the values from those column chunks and we can just you know give me a long for the ID field give me a string for the name field and so on.
So that's that's the most basic way and there is also what's called the uh columner API. um because here this actually I mean as you see it's kind of goes against the grain of dealing with column data right so let's say you want to aggregate all the values of the I don't know uh balance column doing this sort of thing it seems not pretty efficient right so this is why there's also what we call the um column API so there's a a column reader and now this column reader this actually gives us the values from a specific column And there's also like you know there's a multiolumn reader and things like that.
But here in this case let's say I have this fair amount column. So this is maybe people even know this. This is like this uh NYC taxi ride data set. So that's a pretty popular data set of parket files which you can get from the New York transit authority with all the taxi rides in in New York City. So let's say you know we want we are interested in those fair amount values and now here we actually get batches essentially arrays of values from from that column. And now here you can yeah if you you can process those very efficiently right because you know it's like arrays of uh consecutive values you could think about like threading this out to multiple processes of columns but yeah so we have those two flavors row based and uh and and and column based and now you know this is how how you generally deal with that now there's many more things you can do so there's things like um for instance filtering so let's say I am interested only in specific subsets of the data.
I know maybe here I have like you know this kind of um HR data and I'm only interested in those parts uh or those records where the salary is greater than 50,000 and here we do what's called a predicate push down because very uh commonly those files actually are not local on my own local disk but they are stored on object storage. uh let's let's say um SS3 and um I don't know what's the counterpart in Azure Bruno you need to help me out but there's also storage well we have we have Azure storage that would be the oh Azure storage right so you know those files would be stored there and now if you think about it you don't want to download if you can avoid it you don't want to download the entire file right so instead you would only try to get those parts of the file which have those records you're interested in and park supports that for instance with statistics so there is the there's a way where it tells you okay you know in this chunk of that column uh from this offset to that offset you would only find records uh with salaries between 70 and and 80 for let's for the sake of the example right so now if I run this sort of query then okay uh if I'm just interested in anything uh which let's say is below 50 well then I don't even have to fetch that entire chunk of the file right so that's what's called predicate push down we only get those sections of the file which are matching those filters.
So we have that. Um I of course I can do what's called projections. So I only get specific columns. I don't need to get the entire thing. Um so yeah there's all that. Let's see what else is there. There's the uh this S3 API. I mentioned that already. So I can you know I can instead of getting files locally I can get them from like remote object storage. And now this is actually also an interesting thing which I realized while I was working on that because so you know if I'm using this object storage very often I would use the uh S3 uh SDK for doing that right so uh there's from the AWS API they give you an SDK for accessing object storage which works great but again it's pretty dependency heavy so if you look at that SDK uh it's yeah do I want to pull in this trans this this dependency and everything it pulls in it by itself.
So what I did here is instead um I uh I implemented pretty much everything from scratch. So because in the end of the day this is just like HTTP requests, right? So this is an REST API. So we can use the built-in HTTP client in Java to fetch those files. And now uh in the case of S3 those requests they need to to be signed and there's a bespoke signing algorithm which AWS documents how do you sign how do you have to sign those those requests? And now back in the day this would not have been something which I would have considered to implement myself because I feel I can mess it up.
I don't want to maintain this code. I I would just get this dependency. But now here I think the consideration is actually very different because well they define that signing algorithm again very well and they give you like a test suite of all kinds of you know edge cases for how the signing has to look like. So sending off a coding agent to implement this works very well and so I have my own custom request cider. It passes the test suite so I know this is guaranteed to work with S3 and I don't have to pull an external dependency.
So this entire you know make versus well not buy because I mean I'm not buying those dependencies right but let's call it make versus buy. This consideration it gets moved uh dramatically with with those tools. you can take ownership actually of of code parts and functionality which you would not have done uh typically in the past. Now you mentioned you mentioned something interesting here which is u you you ask the coding agent to go and implement an S3 client um um component within your library basically right but but because you have such a very specific use case it didn't have to implement all the capabilities that the S3 SDK has now exactly there is a question though of if the S3 SDK was not open source how how would have been the result of the Kodi agent and I I believe I tended to believe that it would not be that bad either because I mean S3 does have a rest API does have I mean the API is there right so I think it would work well and I mean the thing is you need to be cautious you know where you draw the line because there's actually one small dependency which we have here and this is uh related to how you authenticate because if you have used S3 you will know there's like a gazillion of different ways you could authenticate right so like with machine identities and IM roles and like local access key and so on Now I felt I didn't want to implement all those authentication mechanisms by myself.
So this is where I pull in something but yeah you can the needle shifts uh quite quite a bit. Interesting. So we got a question here on the on on on the stream. Can we can we show the CLI? Absolutely. Yes. So for context um you know so this started as a library and you know so you would use this in your JVM based applications to parse park files but then I thought at some point I had this idea okay that also could be a readym made library right it's a little bit like curl and lip curl so often times you have like those CLI tools which expose functionalities of a given library right but I thought this is something which I wanted to which I want to tackle later on so right now I'm like focusing on the parsing aspect that's what I want to nail down for 1.0 But then you know it's open source and people uh if they are interested they work on the stuff they are interested in and and and what they need.
So actually a contributor Brandon Brown he came and he said okay I want to work on that CLI and I want to work on it right now. And so that's why we have that uh CLI and so I have a a little demo here. Um and so here the idea is should make this bigger. The idea is you know it's like a Swiss army knife of uh for dealing with park file. I I don't want to reimplement duct DB or like a full-fledged database client but it's a you know Swiss army knife for dealing with park files for instance to to show me the schema.
So this is something which I can do or to you know to get some metadata. I can see okay this is what the kind of data is here in this file. This is uh like you know a couple of rows from the file. Um I can take a look at those dictionaries which we mentioned and like you know the encodings and so on can convert data. So that's that's the idea. Uh yeah, Swiss Army knife for park files and now the interesting thing is uh I to me interesting anyways. So also this API is uh is written in in Java and it's very snappy and you know back in the past people often times felt yeah Java is not great for CLI tools because it's like too slow to start up and this kind of thing and now actually this is a native binary uh which we distribute for Windows, Mac and Linux uh built with Gravm and um you know so now this is a self-contained native image which starts up super fast and so you also can use Java to build those CLIs.
So this is what it does and if people want to play with the CLI uh you can get it from the releases page. So you know everything is on GitHub and uh we have like as I mentioned like those readym made CLI binaries for all different uh platforms and also like um Intel and um ARM. Okay, we got another question here also. Um how does this compare to polers? I don't know what polers is. You know yeah that's I I I should know. So I'm not 100% sure either of P SP is it not like a data frame format or something.
Uh so I feel before I say something stupid uh I probably I know can I ask Claude about it? Let's see what what he says. Yeah, let's do that. How does hardwood compare to polars? I have no idea. I'm curious. Let's see. So for folks for folks tuning in on the stream, we are talking about the project hardwood which is uh an implementation of a Java based implementation of a of a library to parse and read paret files right 100% exactly right. Oh yes. Uh so yeah it says a very different category Harvard Java library for CLI for passing par files minimal dependencies all the things we spoke and pull a rust oh I see it's so this is for the Apache arrow memory yes okay so um yes Apache arrow that's an in-memory format so it's not a persistent file format but it's in memory format also dealing with uh column data so there's like a you know relationship between storing column data in files and park and then reading them into Apache arrow in memory vectors.
Um so yeah um I mean people see it here right? So it's different uh different tools uh for for for different use cases I I would say I mean there's some overlap right as I see it here. So if you want to analyze data filter aggregate join so yeah with hardwood it would give you at least some filtering capabilities but for instance not joining. So um I suppose polar is on a you know like another level let's say. So um now you explained the tools you explained the library the API and you explain the CLI.
Um right can you walk us through a little bit like if you if you go to the GitHub issues right now on the hardwood project and you decide you know I'm going to tackle this issue here. What is what does your flow look like? I mean as a as a contributor or you or myself as a as a maintainer as a maintainer of the project. All right. Okay. I I this this issue here is bugging me as well. It was filed by another person or maybe by you a week a few weeks ago and you okay I I got tired.
I already solved all the advent of code. I already did the two billion record challenge again and I have nothing else to do. I'm going to fix another issue on hardwood. Right. Yeah. I mean to be honest I would uh point uh my coding agent to it and say I mean then it depends a little bit on the complexity. So you know either this warrants like a design uh document and I want to say okay you know give me like a design document. By the way we have those all in the also in the repo so people can can see those um uh somewhere don't find them.
There you go. Um you know so I would ask it to create something like that. So you know if it's like a complex thing like okay let's talk about product push down. I mean we have that by now but let's say we want that right. And so we would I would ask it to create something like that. And then we would iterate we would talk about it. Oh I feel like you know something's missing here and so on and at some point I would feel okay this is ready to to be implemented. So that could be happen or if it's like a I know like we spoke bug fix.
So for instance earlier today there actually was an issue somebody asked so hey this uh you know inequality filters uh and how they work with null that's not sufficiently documented. Now this is something I can just point uh you know claude or whatever tool to it and okay yes I please make the doc updates to to do that and or maybe explain that issue first of all maybe you know I'm a bit dense I don't understand uh so yes walk me through that issue first um so that's that's uh what I would typically do yes um do you so I I love the idea of the designs folder because I mean at the end of the day that's that's what spec driven development is about right you right you first come up with a specification whether you use AI or not it's irrelevant the point is you have a specification first well thought out ideally reviewed by a human ideally reviewed even before a human by other agents using maybe different models but the idea is here's my concept I have a a challenge I have a problem I need to solve it here's an a way to solve it um absolutely one one thing that um that maintainers um can consider is you mentioned about like evaluating the effort of fixing a bug or implementing a feature.
Um there is something really cool on GitHub now called GitHub agentic workflows and and this capability allows developers and especially maintainers to create aic workflow like GitHub actions but agentic so it triggers a coding agent and you can do things like if an issue is filled uh or create a new issue is created triggers the workflow automatically to do an analysis of that issue and provide maybe an effort analysis or identify duplicate to identify um scope or context or or maybe adding relevant data that is missing from the issue or maybe reformat the issue submitted by the developer and uh I've seen a few projects now using it especially of course at Microsoft and GitHub but they are using they're using these uh GitHub agentic workflows for that purpose right yeah absolutely I mean you know the sky's is the limit so I also I actually have it hooked up with the GitHub CLI so Well, no, not not authenticated, but um usually, you know, if I were authenticated, then you know, this would be able to interact with my uh issue tracker.
It would I would sometimes I would ask it, hey, can you quickly lock a follow-up issue for that particular kernel case? Uh I would ask it to grab issues from the tracker and so on. Uh so I also do that a lot. And actually, I would like to get to a state uh I mean right now all this happens here in my terminal, right? And I would like to get to a state where actually can have also this planning phase on the designs actually through GitHub issues. So I would like to have like a conversation there you know and it would maybe come up with a draft in the GitHub issue or wherever and then I would I would read it and I would then you know maybe reply by email.
Okay, this sounds great but can you also like you know think of that whatever and that would would be great. That's something that you can do right now. I mean you can you can go and create an issue and prop propose an idea and then assign to copilot with a with an additional prompt hey come up with like two different proposals for this problem and then you have a pull request uh created associated to that issue with the plans and you can you can instruct to drop those plan files in your underscore designs folder absolutely yeah yeah I know but for me it's always you I very important to have like this human interloop step so I want to view the stuff and I want to be because in the end of the day this is also you know if when people send the PR and it's crap you cannot say AI did it right so you have the ownership so in the moment when you propose something you do this you have to own it and if you propose crap because you know AI told you so um this is totally on you so you have to be on top of it I think I think we when I and I did a once one contribution to to the project recently and I went through a few iterations over it.
Um, so it's a pull request that has lots of commits, not a single commit. Like, oh, here's a problem. Go solve it. Like, there's a bunch of back and forth and that I think that's a sign. So, so developers who who used to uh uh like squash merge and and put a pull request on a single commit. I actually kind of feel the other way around now. Like I think I think P request should have as many commits are as are required and as part of the flow of of developing and iterating and and showing that you know developer cares about uh the what they're proposing to totally and you can rework them by the way those are all your uh all the things and that's one one pending um so yeah and you know and in in the end you can squash them right so I mean you don't need that uh temporary when you merge the P you can you can squash it but I think sending pull request with a single commit.
I think that's that that can show that developer did not care just used AI to generate the whole thing and yeah yeah yeah totally right a single shot cool all right uh we are at the end of the stream um so how how can folks reach out to you uh Gunnar if they have questions about Artwood Confluent Kafka Park and um yeah I mean they can I'm I'm on all kinds of uh places let me see uh people can find me on uh my blog on my website. I'm on on on you can find me on Twitter, you can find me on LinkedIn.
Um so yeah, you know, you can reach out to me uh on all all those channels. If you want to contribute to the project, I mean by now like dozen or so of people have contributed. I mean, you're very welcome to do so. Um go to hardwood.dev. That's the starting point. Give it a GitHub star if you haven't done so. That would be amazing. And yeah, I don't know. I I'm super excited about uh all that. And you're you're also on Twitter, Blue Sky, whatever, all all the social networks, I guess. Oh, right. Yes. I mean, is I'm not sure.
Is Blue Sky still a thing? It seems like it's I don't know, but I'm there and I'm on LinkedIn, on Twitter, and uh obviously on on GitHub and uh you can send me an email. Uh everything is is great. Awesome. All right, Gard, thank you so much for this. Uh this was a great session. Appreciate you sharing uh the project and the work that you have been doing. Um for for the next open source Friday, we're gonna have uh Tesa from Kong uh who will explain um API gateway, AI connectivity and uh also have another project called open meter.
Have you heard about open meter? Um not sure. No, I don't think so. Okay. Well, I've heard about Kong. Yeah, maybe you should tune in on open source Friday next week. I Yeah, that's a good point. Awesome, bro. This was fun. Thank you so much for having me. This was great, man. Was really great. Thank you so much. And for everybody watching, thank you so much for following. And uh see you next Friday. Bye-bye.
More from GitHub
Related Videos
![AWS Solution Architect Full Course 2026 [FREE] | AWS Solution Architect Tutorial 2026 | Simplilearn thumbnail](https://rewiz.app/images?url=https://i.ytimg.com/vi/vpXzo5ZN8Vo/maxresdefault.jpg)
AWS Solution Architect Full Course 2026 [FREE] | AWS Solution Architect Tutorial 2026 | Simplilearn
10:14:51
![AWS Solution Architect Full Course 2026 [FREE] | AWS Solution Architect Tutorial 2026 | Simplilearn thumbnail](https://rewiz.app/images?url=https://i.ytimg.com/vi/d51_SrBqFUI/maxresdefault.jpg)
AWS Solution Architect Full Course 2026 [FREE] | AWS Solution Architect Tutorial 2026 | Simplilearn
10:13:49
![AWS Solution Architect Full Course 2026 [FREE] | AWS Solution Architect Training 2026 | Simplilearn thumbnail](https://rewiz.app/images?url=https://i.ytimg.com/vi/rHjGmAm_0eE/maxresdefault.jpg)
AWS Solution Architect Full Course 2026 [FREE] | AWS Solution Architect Training 2026 | Simplilearn
03:37:21
![AWS Solution Architect Full Course 2026 [FREE] | AWS Solution Architect Training 2026 | Simplilearn thumbnail](https://rewiz.app/images?url=https://i.ytimg.com/vi/9N4oQ9QrcZ0/maxresdefault.jpg)
AWS Solution Architect Full Course 2026 [FREE] | AWS Solution Architect Training 2026 | Simplilearn
03:36:40


Get daily recaps from
GitHub
AI-powered summaries delivered to your inbox. Save hours every week while staying fully informed.



