Skip to content

Data Pipeline

Build a CSV data processing pipeline that reads sales data, transforms it with chained collection operations, and prints a summary report. This is the exercise where Kotlin’s collection operators start to feel like a small query language — the kind of map/filter/groupBy chaining you’d reach for array.reduce or a for loop in TypeScript/Go.

  • Chaining collection operations: map, filter, groupBy, sortedBy, sumOf
  • Sequences for efficient large-data processing
  • Scope functions (let, also, run)
  • Destructuring in loops and lambdas
  • buildList / buildMap for constructing results

The program reads CSV rows with the columns date,region,product,quantity,unit_price and produces a report with five sections.

  1. Parse each CSV line into a Sale data class (skip the header, ignore malformed rows).
  2. Print the totals: total revenue, total order count, average order value.
  3. Break revenue down by region, sorted by revenue descending.
  4. Rank the top products by revenue, with their total units.
  5. Show the daily trend — revenue per date, in date order.
  6. List high-value orders (revenue over $100).
date,region,product,quantity,unit_price
2024-01-15,North,Widget,10,9.99
2024-01-15,South,Gadget,5,24.99
2024-01-16,North,Gadget,3,24.99
2024-01-16,East,Widget,20,9.99
2024-01-17,South,Widget,15,9.99
2024-01-17,North,Doohickey,8,4.99
2024-01-18,East,Gadget,12,24.99
2024-01-18,West,Widget,7,9.99
2024-01-19,North,Widget,25,9.99
2024-01-19,South,Doohickey,30,4.99
=== Sales Report ===
Total Revenue: $1,279.52
Total Orders: 10
Average Order Value: $127.95
--- Revenue by Region ---
North: $524.57 (4 orders)
South: $424.55 (3 orders)
East: $499.68 (2 orders)
West: $69.93 (1 orders)
--- Top Products ---
1. Widget - $769.23 (77 units)
2. Gadget - $374.85 (20 units)
3. Doohickey - $189.40 (38 units)
--- Daily Trend ---
2024-01-15: $224.85
2024-01-16: $274.87
2024-01-17: $189.77
2024-01-18: $369.81
2024-01-19: $399.55
--- High Value Orders (> $100) ---
2024-01-16, East: 20x Widget = $199.80
2024-01-18, East: 12x Gadget = $299.88
2024-01-19, North: 25x Widget = $249.75
2024-01-19, South: 30x Doohickey = $149.70

A single-module Gradle project with one Kotlin file. No serialization library this time — the data is plain CSV, parsed by hand.

  • Directorydata-pipeline/
    • build.gradle.kts deps + build config
    • settings.gradle.kts project name
    • Directorysrc/
      • Directorymain/
        • Directorykotlin/com/example/datapipeline/
          • Main.kt the whole pipeline
        • Directoryresources/
          • sales.csv sample data for the stretch goal

The simplest possible build: the JVM plugin to compile, application so ./gradlew run works, and nothing else. There’s no runtime dependency because the parsing is all standard-library String operations.

build.gradle.kts
plugins {
kotlin("jvm") version "2.1.0"
application
}
group = "com.example"
version = "1.0-SNAPSHOT"
repositories {
mavenCentral()
}
dependencies {
testImplementation(kotlin("test"))
}
tasks.test {
useJUnitPlatform()
}
application {
mainClass.set("com.example.datapipeline.MainKt")
}
settings.gradle.kts
rootProject.name = "data-pipeline"

Three small data classes model the domain. Sale is the parsed row; the two summary types hold the aggregated results. The interesting line is the computed property revenue — it’s not a stored field, it’s recalculated on every access (get()), the Kotlin equivalent of a TypeScript getter.

src/main/kotlin/com/example/datapipeline/Main.kt
package com.example.datapipeline
data class Sale(
val date: String,
val region: String,
val product: String,
val quantity: Int,
val unitPrice: Double
) {
val revenue: Double get() = quantity * unitPrice
}
data class RegionSummary(
val region: String,
val totalRevenue: Double,
val orderCount: Int
)
data class ProductSummary(
val product: String,
val totalRevenue: Double,
val totalUnits: Int
)

Two functions, both built from collection operators. parseCsvLine returns Sale? — a nullable — and uses the Elvis operator ?: return null so a bad number short-circuits the whole row to null. parseCsv then chains the cleanup: drop(1) skips the header, filter removes blank lines, and mapNotNull parses each line and drops any that came back null in one step. That mapNotNull is the idiom to reach for whenever a TS dev would write .map(...).filter(Boolean).

src/main/kotlin/com/example/datapipeline/Main.kt
fun parseCsvLine(line: String): Sale? {
val parts = line.split(",")
if (parts.size != 5) return null
return Sale(
date = parts[0].trim(),
region = parts[1].trim(),
product = parts[2].trim(),
quantity = parts[3].trim().toIntOrNull() ?: return null,
unitPrice = parts[4].trim().toDoubleOrNull() ?: return null
)
}
fun parseCsv(csv: String): List<Sale> {
return csv.lines()
.drop(1) // skip header
.filter { it.isNotBlank() }
.mapNotNull { parseCsvLine(it) }
}

This is the heart of the exercise — five independent collection pipelines, each one reading top-to-bottom like a description of what you want, not how to loop.

src/main/kotlin/com/example/datapipeline/Main.kt
fun generateReport(sales: List<Sale>) {
val totalRevenue = sales.sumOf { it.revenue }
val totalOrders = sales.size
val avgOrderValue = if (totalOrders > 0) totalRevenue / totalOrders else 0.0
println("=== Sales Report ===")
println()
println("Total Revenue: $${"%,.2f".format(totalRevenue)}")
println("Total Orders: $totalOrders")
println("Average Order Value: $${"%,.2f".format(avgOrderValue)}")
// Revenue by region
println()
println("--- Revenue by Region ---")
sales
.groupBy { it.region }
.map { (region, regionSales) ->
RegionSummary(
region = region,
totalRevenue = regionSales.sumOf { it.revenue },
orderCount = regionSales.size
)
}
.sortedByDescending { it.totalRevenue }
.forEach { (region, revenue, count) ->
println(" %-6s: $%,.2f (%d orders)".format(region, revenue, count))
}
// Top products
println()
println("--- Top Products ---")
sales
.groupBy { it.product }
.map { (product, productSales) ->
ProductSummary(
product = product,
totalRevenue = productSales.sumOf { it.revenue },
totalUnits = productSales.sumOf { it.quantity }
)
}
.sortedByDescending { it.totalRevenue }
.forEachIndexed { index, (product, revenue, units) ->
println(" ${index + 1}. %-10s - $%,.2f (%d units)".format(product, revenue, units))
}
// Daily trend
println()
println("--- Daily Trend ---")
sales
.groupBy { it.date }
.mapValues { (_, daySales) -> daySales.sumOf { it.revenue } }
.toSortedMap()
.forEach { (date, revenue) ->
println(" $date: $${"%,.2f".format(revenue)}")
}
// High value orders
println()
println("--- High Value Orders (> \$100) ---")
sales
.filter { it.revenue > 100.0 }
.sortedByDescending { it.revenue }
.forEach { sale ->
println(" ${sale.date}, ${sale.region}: ${sale.quantity}x ${sale.product} = $${"%,.2f".format(sale.revenue)}")
}
}

A few things to notice if you’re coming from TS/Go:

  • sumOf { it.revenue } adds up a projection in one call — no accumulator variable, no reduce with a seed. There are typed overloads (sumOf over Int vs Double), so the result type matches what your lambda returns.
  • groupBy { it.region } returns a Map<String, List<Sale>>. From there .map transforms each entry into a summary object, exactly the group-then-aggregate move you’d hand-roll with a map[string][]Sale in Go.
  • Destructuring in the lambda: .forEach { (region, revenue, count) -> … } unpacks a RegionSummary straight into three named parameters because it’s a data class. .map { (region, regionSales) -> … } does the same for a Map.Entry.
  • mapValues rewrites only the values of a map, leaving the keys; toSortedMap reorders by key so the daily trend comes out in date order for free (ISO dates sort lexicographically).
  • forEachIndexed hands you the position alongside the element — that’s where the 1., 2., 3. ranking comes from.
  • The format strings (%-6s, %,.2f, %d) are Java’s String.format reached via the .format extension — left-padding, thousands separators, and fixed decimals for the aligned columns.

main holds the sample CSV in a trimIndent-ed raw string, parses it, and reports. The one idiom worth calling out is .also { … }: it runs a side effect (the “Parsed N records” log) and returns the original value untouched, so it slots into the middle of an expression without breaking the chain.

src/main/kotlin/com/example/datapipeline/Main.kt
fun main() {
val csvData = """
date,region,product,quantity,unit_price
2024-01-15,North,Widget,10,9.99
2024-01-15,South,Gadget,5,24.99
2024-01-16,North,Gadget,3,24.99
2024-01-16,East,Widget,20,9.99
2024-01-17,South,Widget,15,9.99
2024-01-17,North,Doohickey,8,4.99
2024-01-18,East,Gadget,12,24.99
2024-01-18,West,Widget,7,9.99
2024-01-19,North,Widget,25,9.99
2024-01-19,South,Doohickey,30,4.99
""".trimIndent()
val sales = parseCsv(csvData)
.also { println("Parsed ${it.size} sales records\n") }
generateReport(sales)
}
  1. Build the project:

    Terminal window
    ./gradlew build
  2. Run it — the sample data is embedded, so no input needed:

    Terminal window
    ./gradlew run