Building a Scalable ETL Pipeline with AWS Glue (CSV to Parquet + Partitioning)
- Habib Abdurrasyid
- 08 Apr, 2026
Why I Started Looking Into AWS Glue
In my current role, I work with data at a scale I hadn’t really dealt with before: we’re talking terabytes, with more coming in every day. I can’t go into specifics, but the nature of the industry means there’s a lot of it, and keeping that data queryable and manageable is a real engineering challenge.
That exposure pushed me to go deeper into how large-scale data pipelines are actually built. AWS Glue kept coming up as the go-to tool for this kind of work, so I decided to properly sit down with it and build something from scratch: not just read the docs.
I put together a pipeline using a sales dataset from Kaggle (Chocolate Sales Dataset 2023-2024) to recreate the core patterns I see at work in a simpler, shareable context. This post is that walkthrough.
The Architecture
The idea was pretty straightforward: take raw CSV files, process them into a more query-friendly format, and make them accessible through Athena. Here’s how it flows:
graph LR
A[S3 raw CSV] --> B[Glue Crawler]
B --> C[Data Catalog]
C --> D[Glue Job PySpark]
D --> E[S3 Parquet]
E --> F[Athena]
Nothing groundbreaking, but this is pretty much the pattern you’d see in a real data lake setup.
Step 1: Getting the Raw Data Into S3
First thing: upload the CSV files to S3. I kept the folder structure clean from the start since I learned early on that your S3 layout basically is your data model. If it is messy from the beginning, everything downstream gets messy too.
raw/├── sales/├── products/├── stores/├── customers/└── calendar/Each folder holds one CSV file. Simple. One folder, one schema, do not mix files with different structures in the same folder. The Glue Crawler will thank you for it.
Step 2: Let Glue Crawler Figure Out the Schema
Instead of defining everything by hand upfront, I let Glue Crawler scan the S3 folders and auto-detect the schema. Took a few minutes to run, and it registered all the tables into the Data Catalog.
My Crawler config was:
- Data source:
s3://<bucket>/raw/ - Crawl subfolders: yes
- Database:
choco_db - Table prefix:
raw_
After the crawl, I had these tables registered:
raw_salesraw_productsraw_storesraw_customersraw_calendarThis makes the raw data immediately queryable through Athena, which is handy for sanity-checking before you do any transformation.
One thing I noticed: the Crawler sometimes gets the data types wrong, especially for columns that look like numbers but are actually IDs, or for date columns. Worth double-checking before you rely on the inferred schema for anything serious.
Step 3: The Actual ETL: PySpark on Glue
This is the core part. The Glue Job reads from the Data Catalog, does some light transformations, and writes back to S3 in Parquet format with partitioning by year and month.
The code is cleaner than I expected for PySpark:
from awsglue.context import GlueContextfrom pyspark.context import SparkContextfrom pyspark.sql.functions import col, to_date, year, month
sc = SparkContext()glueContext = GlueContext(sc)spark = glueContext.spark_session
# Read from Data Catalogdatasource = glueContext.create_dynamic_frame.from_catalog( database="choco_db", table_name="raw_sales")
df = datasource.toDF()
# Parse dates and extract partition columnsdf = df.withColumn("order_date", to_date(col("order_date"), "yyyy-MM-dd"))df = df.withColumn("year", year(col("order_date")))df = df.withColumn("month", month(col("order_date")))
# Write as partitioned Parquetdf.write \ .mode("overwrite") \ .partitionBy("year", "month") \ .parquet("s3://<your-bucket>/processed/sales_parquet/")The .partitionBy("year", "month") line is doing most of the work here. Glue (via Spark under the hood) automatically organizes the output into subfolders based on those column values.
Step 4: What the Output Looks Like in S3
After the job finishes, the folder structure in S3 becomes:
processed/sales_parquet/├── year=2023/│ ├── month=1/│ ├── month=2/│ └── ...├── year=2024/This is what makes partitioning so useful. When you query “give me all sales from January 2023,” Athena does not scan the entire dataset. It only reads the year=2023/month=1/ folder. For a large dataset, this can cut query time and cost a lot.
It is one of those things that seems obvious in hindsight but easy to miss when you are just starting out with data engineering.
Step 5: Manually Defining the Schema in Athena
Even though the Crawler can detect schemas, I prefer defining them manually for the processed tables. It gives you explicit control, avoids surprises with type mismatches, and makes the table easier to document.
CREATE EXTERNAL TABLE processed_sales_parquet ( order_id STRING, order_date DATE, product_id STRING, store_id STRING, customer_id STRING, quantity BIGINT, unit_price DOUBLE, discount DOUBLE, revenue DOUBLE, cost DOUBLE, profit DOUBLE)PARTITIONED BY ( year INT, month INT)STORED AS PARQUETLOCATION 's3://<your-bucket>/processed/sales_parquet/';Then you need to run this to get Athena to recognize the existing partitions:
MSCK REPAIR TABLE processed_sales_parquet;Without that, Athena sees an empty table even though the data is sitting right there in S3.
Step 6: Querying With Athena
Basic count check to confirm everything loaded correctly:
SELECT COUNT(*) FROM processed_sales_parquet;Then a partition-aware query to verify the filtering actually works:
SELECT COUNT(*)FROM processed_sales_parquetWHERE year = 2023 AND month = 1;If those two return different numbers, partitioning is doing its job.
What I Took Away From This
Auto schema detection is helpful, but fragile. The Crawler is great for exploring new data quickly, but I would not rely on it for production schemas. It tends to infer partition columns as strings when they should be integers, which breaks partition pruning.
Your S3 folder structure is your query strategy. I did not fully appreciate this until I saw how directly it affected query performance. Think about how you will query the data before you decide how to partition it.
Parquet is a big deal. I knew Parquet was better than CSV for analytics, but seeing it in practice, faster queries and smaller storage footprint, made it feel more real. If you are building anything analytics-related, there is really no reason to stay on CSV.
Glue Jobs are expensive if not configured right. DPUs add up quickly. For this small project it was fine, but in a real workload you would want to tune the number of workers and consider whether to use G.1X or G.2X workers depending on the job.
What’s Next
This is just the foundation. The interesting part comes next: joining all five tables together: sales, products, customers, stores, calendar: to build a proper enriched analytics dataset. That’s where you start getting into the models that can actually power a dashboard or feed into a machine learning pipeline.
I’ll write that up in the next post.
Repository
Add your GitHub repository link here