Data Pipeline

Goal

Build a CSV data processing pipeline that reads sales data, transforms it with chained collection operations, and prints a summary report. This is the exercise where Kotlin’s collection operators start to feel like a small query language — the kind of map/filter/groupBy chaining you’d reach for array.reduce or a for loop in TypeScript/Go.

What you’ll practice

Chaining collection operations: map, filter, groupBy, sortedBy, sumOf
Sequences for efficient large-data processing
Scope functions (let, also, run)
Destructuring in loops and lambdas
buildList / buildMap for constructing results

Requirements

The program reads CSV rows with the columns date,region,product,quantity,unit_price and produces a report with five sections.

Parse each CSV line into a Sale data class (skip the header, ignore malformed rows).
Print the totals: total revenue, total order count, average order value.
Break revenue down by region, sorted by revenue descending.
Rank the top products by revenue, with their total units.
Show the daily trend — revenue per date, in date order.
List high-value orders (revenue over $100).

Example input

date,region,product,quantity,unit_price
2024-01-15,North,Widget,10,9.99
2024-01-15,South,Gadget,5,24.99
2024-01-16,North,Gadget,3,24.99
2024-01-16,East,Widget,20,9.99
2024-01-17,South,Widget,15,9.99
2024-01-17,North,Doohickey,8,4.99
2024-01-18,East,Gadget,12,24.99
2024-01-18,West,Widget,7,9.99
2024-01-19,North,Widget,25,9.99
2024-01-19,South,Doohickey,30,4.99

Expected output

=== Sales Report ===

Total Revenue: $1,279.52
Total Orders: 10
Average Order Value: $127.95

--- Revenue by Region ---
  North: $524.57 (4 orders)
  South: $424.55 (3 orders)
  East:  $499.68 (2 orders)
  West:  $69.93 (1 orders)

--- Top Products ---
  1. Widget    - $769.23 (77 units)
  2. Gadget    - $374.85 (20 units)
  3. Doohickey - $189.40 (38 units)

--- Daily Trend ---
  2024-01-15: $224.85
  2024-01-16: $274.87
  2024-01-17: $189.77
  2024-01-18: $369.81
  2024-01-19: $399.55

--- High Value Orders (> $100) ---
  2024-01-16, East: 20x Widget = $199.80
  2024-01-18, East: 12x Gadget = $299.88
  2024-01-19, North: 25x Widget = $249.75
  2024-01-19, South: 30x Doohickey = $149.70

The worked solution

A single-module Gradle project with one Kotlin file. No serialization library this time — the data is plain CSV, parsed by hand.

Directorydata-pipeline/
- build.gradle.kts deps + build config
- settings.gradle.kts project name
- Directorysrc/
  - Directorymain/
    Directorykotlin/com/example/datapipeline/
    Main.kt the whole pipeline
    Directoryresources/
    sales.csv sample data for the stretch goal

build.gradle.kts

The simplest possible build: the JVM plugin to compile, application so ./gradlew run works, and nothing else. There’s no runtime dependency because the parsing is all standard-library String operations.

plugins {
    kotlin("jvm") version "2.1.0"
    application
}

group = "com.example"
version = "1.0-SNAPSHOT"

repositories {
    mavenCentral()
}

dependencies {
    testImplementation(kotlin("test"))
}

tasks.test {
    useJUnitPlatform()
}

application {
    mainClass.set("com.example.datapipeline.MainKt")
}

rootProject.name = "data-pipeline"

The data classes

Three small data classes model the domain. Sale is the parsed row; the two summary types hold the aggregated results. The interesting line is the computed property revenue — it’s not a stored field, it’s recalculated on every access (get()), the Kotlin equivalent of a TypeScript getter.

package com.example.datapipeline

data class Sale(
    val date: String,
    val region: String,
    val product: String,
    val quantity: Int,
    val unitPrice: Double
) {
    val revenue: Double get() = quantity * unitPrice
}

data class RegionSummary(
    val region: String,
    val totalRevenue: Double,
    val orderCount: Int
)

data class ProductSummary(
    val product: String,
    val totalRevenue: Double,
    val totalUnits: Int
)

Parsing the CSV

Two functions, both built from collection operators. parseCsvLine returns Sale? — a nullable — and uses the Elvis operator ?: return null so a bad number short-circuits the whole row to null. parseCsv then chains the cleanup: drop(1) skips the header, filter removes blank lines, and mapNotNull parses each line and drops any that came back null in one step. That mapNotNull is the idiom to reach for whenever a TS dev would write .map(...).filter(Boolean).

fun parseCsvLine(line: String): Sale? {
    val parts = line.split(",")
    if (parts.size != 5) return null
    return Sale(
        date = parts[0].trim(),
        region = parts[1].trim(),
        product = parts[2].trim(),
        quantity = parts[3].trim().toIntOrNull() ?: return null,
        unitPrice = parts[4].trim().toDoubleOrNull() ?: return null
    )
}

fun parseCsv(csv: String): List<Sale> {
    return csv.lines()
        .drop(1) // skip header
        .filter { it.isNotBlank() }
        .mapNotNull { parseCsvLine(it) }
}

Generating the report

This is the heart of the exercise — five independent collection pipelines, each one reading top-to-bottom like a description of what you want, not how to loop.

fun generateReport(sales: List<Sale>) {
    val totalRevenue = sales.sumOf { it.revenue }
    val totalOrders = sales.size
    val avgOrderValue = if (totalOrders > 0) totalRevenue / totalOrders else 0.0

    println("=== Sales Report ===")
    println()
    println("Total Revenue: $${"%,.2f".format(totalRevenue)}")
    println("Total Orders: $totalOrders")
    println("Average Order Value: $${"%,.2f".format(avgOrderValue)}")

    // Revenue by region
    println()
    println("--- Revenue by Region ---")
    sales
        .groupBy { it.region }
        .map { (region, regionSales) ->
            RegionSummary(
                region = region,
                totalRevenue = regionSales.sumOf { it.revenue },
                orderCount = regionSales.size
            )
        }
        .sortedByDescending { it.totalRevenue }
        .forEach { (region, revenue, count) ->
            println("  %-6s: $%,.2f (%d orders)".format(region, revenue, count))
        }

    // Top products
    println()
    println("--- Top Products ---")
    sales
        .groupBy { it.product }
        .map { (product, productSales) ->
            ProductSummary(
                product = product,
                totalRevenue = productSales.sumOf { it.revenue },
                totalUnits = productSales.sumOf { it.quantity }
            )
        }
        .sortedByDescending { it.totalRevenue }
        .forEachIndexed { index, (product, revenue, units) ->
            println("  ${index + 1}. %-10s - $%,.2f (%d units)".format(product, revenue, units))
        }

    // Daily trend
    println()
    println("--- Daily Trend ---")
    sales
        .groupBy { it.date }
        .mapValues { (_, daySales) -> daySales.sumOf { it.revenue } }
        .toSortedMap()
        .forEach { (date, revenue) ->
            println("  $date: $${"%,.2f".format(revenue)}")
        }

    // High value orders
    println()
    println("--- High Value Orders (> \$100) ---")
    sales
        .filter { it.revenue > 100.0 }
        .sortedByDescending { it.revenue }
        .forEach { sale ->
            println("  ${sale.date}, ${sale.region}: ${sale.quantity}x ${sale.product} = $${"%,.2f".format(sale.revenue)}")
        }
}

A few things to notice if you’re coming from TS/Go:

sumOf { it.revenue } adds up a projection in one call — no accumulator variable, no reduce with a seed. There are typed overloads (sumOf over Int vs Double), so the result type matches what your lambda returns.
groupBy { it.region } returns a Map<String, List<Sale>>. From there .map transforms each entry into a summary object, exactly the group-then-aggregate move you’d hand-roll with a map[string][]Sale in Go.
Destructuring in the lambda: .forEach { (region, revenue, count) -> … } unpacks a RegionSummary straight into three named parameters because it’s a data class. .map { (region, regionSales) -> … } does the same for a Map.Entry.
mapValues rewrites only the values of a map, leaving the keys; toSortedMap reorders by key so the daily trend comes out in date order for free (ISO dates sort lexicographically).
forEachIndexed hands you the position alongside the element — that’s where the 1., 2., 3. ranking comes from.
The format strings (%-6s, %,.2f, %d) are Java’s String.format reached via the .format extension — left-padding, thousands separators, and fixed decimals for the aligned columns.

Wiring it together

main holds the sample CSV in a trimIndent-ed raw string, parses it, and reports. The one idiom worth calling out is .also { … }: it runs a side effect (the “Parsed N records” log) and returns the original value untouched, so it slots into the middle of an expression without breaking the chain.

fun main() {
    val csvData = """
        date,region,product,quantity,unit_price
        2024-01-15,North,Widget,10,9.99
        2024-01-15,South,Gadget,5,24.99
        2024-01-16,North,Gadget,3,24.99
        2024-01-16,East,Widget,20,9.99
        2024-01-17,South,Widget,15,9.99
        2024-01-17,North,Doohickey,8,4.99
        2024-01-18,East,Gadget,12,24.99
        2024-01-18,West,Widget,7,9.99
        2024-01-19,North,Widget,25,9.99
        2024-01-19,South,Doohickey,30,4.99
    """.trimIndent()

    val sales = parseCsv(csvData)
        .also { println("Parsed ${it.size} sales records\n") }

    generateReport(sales)
}

Run it

Build the project:
Terminal window
```
./gradlew build
```
Run it — the sample data is embedded, so no input needed:
Terminal window
```
./gradlew run
```