LLMs to Transform Data

rw-book-cover

Metadata

Author: Javi Santana
Full Title: LLMs to Transform Data
URL: https://failingwithdata.substack.com/p/llms-to-transform-data

Highlights

I transform data every day and I usually do 2 kinds of transformations, changing the data format so I can use it in a tool (CSV to parquet) or the shape, like running an aggregation so I can understand it. I’m using LLMs more and more for this because it saves me a lot of time (and it’s cool) (View Highlight)
If you don’t know what the llm command is, please, go to check the fantastic llm cli tool from Simon Willison. The second one has many benefits: • The code will run way faster, LLMs are still slow compared to regular CPUs • The transformation can be audited and fixed (View Highlight)
Let’s test it. I have a file with NMEA records from a GPS. NMEA according to Wikipedia “is a combined electrical and data specification for communication between marine electronics such as echo sounder, sonars, anemometer, gyrocompass, autopilot, GPS receivers and many other types of instruments”. I NMEA was invested today would have been NDJSON but at that time machines were sending data through a 9600 bauds per second comm line so they needed to optimize. Parsing is also super easy (probably they couldn’t afford to spend a lot of code for the parsing) but let’s get back to the transformation thing. (View Highlight)
I have some data I got from my car’s GPS (which still sends the info using NMEA these days) in a file, I grep GPRMC sentences (the ones that have the coordinates) and pipe into the llm command (using gemini-2.0 code execution). This would be the command (I shortened it for clarity) (View Highlight)
It sounds like it did the right transformation (indeed, checking the data, it’s accurate). Just in case you are checking the data carefully, speed attribute feels like too high but it’s a car in a race track, so it’s expected. (View Highlight)
But how could we make sure it’s doing it right? I’d not trust the transformed data right away but I can use what we have been using in software development for years: tests. So let’s ask the LLM to generate not just the transform, but also the test with the backwards transformation: (View Highlight)
It fails to run because of the pynmea2 dependency but if you run it locally it manages to do it. So running that self-test gives me some confidence about the transformation function and I’d trust it to be in a pull request. (View Highlight)

Pelayo Arbués

Explorer

Recent Notes

A recommender beast

The next generation of weak learners

Building a Semi-Automated Link Blog for Weekly Reads

LLMs to Transform Data

Metadata

Highlights

Graph View

Table of Contents

Backlinks

Now Reading

Compact Vision-Language With Open Weights, Faster Learning, Diffusion in Few Steps, LLMs Aid Tutors