Batch Pipeline Designer
Design batch data processing pipelines for large-scale, bounded datasets processed offline. Use when building ETL workflows, processing logs or clickstream data at scale, generating ML feature pipelines or search indexes, or joining two large datasets that cannot fit in memory. Trigger phrases: "design a batch pipeline", "should I use Spark or MapReduce", "how do I join two large datasets", "build an ETL workflow", "process server logs at scale", "how do I handle skewed data in joins", "implement PageRank on a distributed graph", "design an offline processing job". Covers MapReduce vs dataflow engines (Spark, Flink, Tez), three join strategies (sort-merge, broadcast hash, partitioned hash) with selection criteria, graph processing via the Pregel/BSP model, and fault tolerance via materialization vs recomputation. Does not apply to unbounded input streams (see stream-processing-designer) or low-latency OLTP query serving. Produces a pipeline architecture recommendation with engine choice, join strategy, and fault tolerance approach.
