Batch Pipeline Designer

Name: Batch Pipeline Designer
Author: BookForge

Data SystemsDesigning Data-Intensive Applications

Design batch data processing pipelines for large-scale, bounded datasets processed offline. Use when building ETL workflows, processing logs or clickstream data at scale, generating ML feature pipelines or search indexes, or joining two large datasets that cannot fit in memory. Trigger phrases: "design a batch pipeline", "should I use Spark or MapReduce", "how do I join two large datasets", "build an ETL workflow", "process server logs at scale", "how do I handle skewed data in joins", "implement PageRank on a distributed graph", "design an offline processing job". Covers MapReduce vs dataflow engines (Spark, Flink, Tez), three join strategies (sort-merge, broadcast hash, partitioned hash) with selection criteria, graph processing via the Pregel/BSP model, and fault tolerance via materialization vs recomputation. Does not apply to unbounded input streams (see stream-processing-designer) or low-latency OLTP query serving. Produces a pipeline architecture recommendation with engine choice, join strategy, and fault tolerance approach.

View on ClawhHub View on GitHub

Install

1. Add marketplace

› /plugin marketplace add bookforge-ai/bookforge-skills

2. Install plugin

› /plugin install designing-data-intensive-applications@bookforge-skills

3. Use the skill

› /batch-pipeline-designer

CC-BY-SA · Open sourceGitHub

What You'll Need

ReadTodoWriteGrep (optional)Bash (optional)

Source Book

Designing Data-Intensive Applications

Martin Kleppmann

More from Designing Data-Intensive Applications

Transactionsfull

Concurrency Anomaly Detector

Scan application code, SQL queries, or ORM code for exposure to the 6 database concurrency anomalies and produce a findings report with severity, affected locations, and fix recommendations. Use when: debugging a nondeterministic data corruption or race condition bug under concurrent load; auditing transaction code before deployment or after switching databases (isolation defaults differ across engines); a read-modify-write cycle or check-then-act pattern may be exposed to lost updates or write skew; an aggregate query (COUNT, SUM) guards an INSERT or UPDATE (phantom read exposure); or multiple tables are updated in one transaction without serializable isolation. Distinct from transaction-isolation-selector (which chooses the isolation level) — this skill scans code to find which anomalies existing code is already exposed to. Covers Python, Java, Go, JavaScript, Ruby; raw SQL; ORM code (SQLAlchemy, Hibernate, ActiveRecord, GORM); PostgreSQL, MySQL InnoDB, Oracle, SQL Server, and distributed databases. Maps code patterns (read-modify-write, SELECT/INSERT pairs, cross-table boundaries, snapshot boundary reads) to anomaly type, trigger conditions, and minimum fix (isolation upgrade vs. application-level mitigation).

Consistencyhybrid

Consistency Model Selector

Choose the correct consistency model (linearizability, causal consistency, or eventual consistency) for each operation in a distributed system, and select the matching implementation mechanism. Use when designing a new distributed data system, deciding whether ZooKeeper or etcd is needed for coordination, evaluating whether two-phase commit is appropriate for cross-node transactions, debugging correctness violations (stale reads, split-brain, uniqueness constraint failures), or distinguishing linearizability from serializability. Also use when applying the CAP theorem correctly (beyond the "pick 2 of 3" oversimplification), selecting total order broadcast as a consensus primitive, evaluating 2PC failure modes and lock-holding cost, or assessing whether causal consistency is sufficient in place of linearizability. Produces a per-operation consistency recommendation with replication mechanism, ordering guarantee, and — when consensus is needed — protocol selection (Raft, Zab, Paxos) with documented failure modes. Does not cover replication topology or failure recovery strategy (see replication-strategy-selector, distributed-failure-analyzer).

Data Integrationhybrid

Data Integration Architect

Design the integration architecture for systems with multiple specialized data stores (Postgres, Elasticsearch, Redis, data warehouses) that must stay in sync. Use when deciding how data flows between components, avoiding dual writes, reasoning about correctness across system boundaries (idempotency, end-to-end operation identifiers), choosing between Lambda and Kappa architecture, or applying the "unbundling databases" pattern to compose specialized tools instead of relying on a single monolith. Trigger phrases: "how do I keep Postgres and Elasticsearch in sync?", "should I use CDC or event sourcing to propagate data?", "how do I avoid dual writes across microservices?", "my downstream systems are going out of sync — how do I fix the architecture?", "how do I design derived data pipelines?", "what is the system of record pattern?", "how do I integrate OLTP with a search index and an analytics warehouse?", "how do I design for end-to-end idempotency?". This is the capstone skill for data systems design — it synthesizes batch pipelines, stream integration, consistency, and replication into a single architecture recommendation. Produces a component map (systems of record vs derived views), data flow diagram, and correctness analysis. Does not replace batch-pipeline-designer or stream-processing-designer — delegates to them for pipeline internals.

Data Modelhybrid

Data Model Selector

Choose between relational, document, and graph data models for an application by analyzing data shape, relationship complexity, and query patterns. Use when asked "should I use MongoDB or PostgreSQL?", "when does a graph database make sense?", "how do I choose between SQL and NoSQL?", or "what data model fits my access patterns?" Also use for: evaluating impedance mismatch between data model and application code; deciding schema-on-read vs. schema-on-write for heterogeneous data; diagnosing whether many-to-many relationships call for relational or graph model; choosing between property graphs and triple-stores; deciding when polyglot persistence is appropriate. Produces a concrete recommendation with trade-off analysis — not "it depends." Covers relational (PostgreSQL, MySQL), document (MongoDB, CouchDB), and graph (Neo4j, Datomic) models including schema enforcement strategies and data locality trade-offs. For storage engine internals (LSM-tree vs B-tree), use storage-engine-selector instead. For OLTP vs. analytics routing, use oltp-olap-workload-classifier instead.