Type something to search...

Building a Scalable ETL Pipeline with AWS Glue (CSV to Parquet + Partitioning)

AWS Glue ETL Data Engineering

Why I Started Looking Into AWS Glue

In my current role, I work with data at a scale I hadn’t really dealt with before: we’re talking terabytes, with more coming in every day. I can’t go into specifics, but the nature of the industry means there’s a lot of it, and keeping that data queryable and manageable is a real engineering challenge.

That exposure pushed me to go deeper into how large-scale data pipelines are actually built. AWS Glue kept coming up as the go-to tool for this kind of work, so I decided to properly sit down with it and build something from scratch: not just read the docs.

I put together a pipeline using a sales dataset from Kaggle (Chocolate Sales Dataset 2023-2024) to recreate the core patterns I see at work in a simpler, shareable context. This post is that walkthrough.


The Architecture

The idea was pretty straightforward: take raw CSV files, process them into a more query-friendly format, and make them accessible through Athena. Here’s how it flows:

graph LR
    A[S3 raw CSV] --> B[Glue Crawler]
    B --> C[Data Catalog]
    C --> D[Glue Job PySpark]
    D --> E[S3 Parquet]
    E --> F[Athena]

Nothing groundbreaking, but this is pretty much the pattern you’d see in a real data lake setup.


Step 1: Getting the Raw Data Into S3

First thing: upload the CSV files to S3. I kept the folder structure clean from the start since I learned early on that your S3 layout basically is your data model. If it is messy from the beginning, everything downstream gets messy too.

raw/
├── sales/
├── products/
├── stores/
├── customers/
└── calendar/

Each folder holds one CSV file. Simple. One folder, one schema, do not mix files with different structures in the same folder. The Glue Crawler will thank you for it.


Step 2: Let Glue Crawler Figure Out the Schema

Instead of defining everything by hand upfront, I let Glue Crawler scan the S3 folders and auto-detect the schema. Took a few minutes to run, and it registered all the tables into the Data Catalog.

My Crawler config was:

  • Data source: s3://<bucket>/raw/
  • Crawl subfolders: yes
  • Database: choco_db
  • Table prefix: raw_

After the crawl, I had these tables registered:

raw_sales
raw_products
raw_stores
raw_customers
raw_calendar

This makes the raw data immediately queryable through Athena, which is handy for sanity-checking before you do any transformation.

One thing I noticed: the Crawler sometimes gets the data types wrong, especially for columns that look like numbers but are actually IDs, or for date columns. Worth double-checking before you rely on the inferred schema for anything serious.


Step 3: The Actual ETL: PySpark on Glue

This is the core part. The Glue Job reads from the Data Catalog, does some light transformations, and writes back to S3 in Parquet format with partitioning by year and month.

The code is cleaner than I expected for PySpark:

from awsglue.context import GlueContext
from pyspark.context import SparkContext
from pyspark.sql.functions import col, to_date, year, month
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
# Read from Data Catalog
datasource = glueContext.create_dynamic_frame.from_catalog(
database="choco_db",
table_name="raw_sales"
)
df = datasource.toDF()
# Parse dates and extract partition columns
df = df.withColumn("order_date", to_date(col("order_date"), "yyyy-MM-dd"))
df = df.withColumn("year", year(col("order_date")))
df = df.withColumn("month", month(col("order_date")))
# Write as partitioned Parquet
df.write \
.mode("overwrite") \
.partitionBy("year", "month") \
.parquet("s3://<your-bucket>/processed/sales_parquet/")

The .partitionBy("year", "month") line is doing most of the work here. Glue (via Spark under the hood) automatically organizes the output into subfolders based on those column values.


Step 4: What the Output Looks Like in S3

After the job finishes, the folder structure in S3 becomes:

processed/sales_parquet/
├── year=2023/
│ ├── month=1/
│ ├── month=2/
│ └── ...
├── year=2024/

This is what makes partitioning so useful. When you query “give me all sales from January 2023,” Athena does not scan the entire dataset. It only reads the year=2023/month=1/ folder. For a large dataset, this can cut query time and cost a lot.

It is one of those things that seems obvious in hindsight but easy to miss when you are just starting out with data engineering.


Step 5: Manually Defining the Schema in Athena

Even though the Crawler can detect schemas, I prefer defining them manually for the processed tables. It gives you explicit control, avoids surprises with type mismatches, and makes the table easier to document.

CREATE EXTERNAL TABLE processed_sales_parquet (
order_id STRING,
order_date DATE,
product_id STRING,
store_id STRING,
customer_id STRING,
quantity BIGINT,
unit_price DOUBLE,
discount DOUBLE,
revenue DOUBLE,
cost DOUBLE,
profit DOUBLE
)
PARTITIONED BY (
year INT,
month INT
)
STORED AS PARQUET
LOCATION 's3://<your-bucket>/processed/sales_parquet/';

Then you need to run this to get Athena to recognize the existing partitions:

MSCK REPAIR TABLE processed_sales_parquet;

Without that, Athena sees an empty table even though the data is sitting right there in S3.


Step 6: Querying With Athena

Basic count check to confirm everything loaded correctly:

SELECT COUNT(*) FROM processed_sales_parquet;

Then a partition-aware query to verify the filtering actually works:

SELECT COUNT(*)
FROM processed_sales_parquet
WHERE year = 2023 AND month = 1;

If those two return different numbers, partitioning is doing its job.


What I Took Away From This

Auto schema detection is helpful, but fragile. The Crawler is great for exploring new data quickly, but I would not rely on it for production schemas. It tends to infer partition columns as strings when they should be integers, which breaks partition pruning.

Your S3 folder structure is your query strategy. I did not fully appreciate this until I saw how directly it affected query performance. Think about how you will query the data before you decide how to partition it.

Parquet is a big deal. I knew Parquet was better than CSV for analytics, but seeing it in practice, faster queries and smaller storage footprint, made it feel more real. If you are building anything analytics-related, there is really no reason to stay on CSV.

Glue Jobs are expensive if not configured right. DPUs add up quickly. For this small project it was fine, but in a real workload you would want to tune the number of workers and consider whether to use G.1X or G.2X workers depending on the job.


What’s Next

This is just the foundation. The interesting part comes next: joining all five tables together: sales, products, customers, stores, calendar: to build a proper enriched analytics dataset. That’s where you start getting into the models that can actually power a dashboard or feed into a machine learning pipeline.

I’ll write that up in the next post.


Repository

Add your GitHub repository link here

Share:

Related Posts

Article

Creating Learner Progress Monitoring Using Python, Pandas and Streamlit

Python Pandas Streamlit

How I built an automated learner progress monitoring dashboard using Python, Pandas, and Streamlit to track assessment scores and participation across cohorts in Canvas LMS.

4 min read 04 Apr, 2022
Read