Connsense-TAP
HPC Workflow Analysis Pipeline · 2020–2024
Overview
Connsense-TAP (Topological Analysis Pipeline) is a computational framework for large-scale analysis of digitally reconstructed brain circuits on HPC clusters. It separates scientific inquiry from computational engineering — the scientist focuses on what to measure and why, while the framework automates the how.
The Problem
Four challenges that make large-scale circuit analysis difficult:
- Scale — Millions of neurons, billions of synapses. Analysis is computationally prohibitive on a single machine.
- Complexity & Reproducibility — A typical analysis is a multi-stage workflow: define regions, extract data, apply transformations, run measurements. Managing parameters and intermediate results across hundreds of subtargets is a "bookkeeping headache."
- Accessibility — Underlying data formats (SONATA) and libraries (bluepy) are laden with informatics jargon, creating barriers for scientists whose primary goal is to ask scientific questions.
- Scientific Evolution — A scientist's preferred analysis changes as they learn more about their subject. The framework must track not just computations but the development of the analysis itself.
Architecture: Three Core Components
1. tap-config: Configuration as Scientific Document
YAML configuration files that are not just parameter dumps but structured, version-controllable
documents narrating the entire study. A pipeline.yaml defines subjects
(subtargets), measurements (analyses), statistical controls, and variations (slicing).
The configuration file is a primary artifact of reproducibility.
2. tap-env: Automated HPC Execution
A CLI (tap) that manages the three-stage workflow:
setup → launch → collect. During setup, it intelligently batches inputs,
estimates computational load, balances jobs, and generates SLURM sbatch scripts.
This completely abstracts away parallel job management — the scientist executes
large-scale analyses with a few commands.
3. tap-store: Intelligent Data Store
Results are collected into a single, structured HDF5 file (connsense.h5)
with a sophisticated Python interface:
- Lazy loading — data loaded only when requested, via lightweight
DataCallobjects that know how to retrieve data - Rich indexing — access data by meaningful names:
tap.analyses["connectivity"]["simplex-counts"](subtarget="R18;C0") - Variation handling — query controls
(
control="erdos-renyi-0") and slices (slicing="layer") seamlessly
Pipeline Stages
- define-subtargets — Generate subvolumes as collections of node-ids (e.g., flatmap columns)
- extract-node-populations — Pull neuron properties (layer, position, cell type)
- extract-edge-populations — Create adjacency matrices for connectivity
- analyze-connectivity — Compute metrics (simplex counts, degree distributions), apply statistical controls
- collect-results — Aggregate into unified HDF5 store
Impact
- Enabled topological analysis of brain circuits at unprecedented scale (4.2M neurons, 14B synapses)
- Reduced analysis setup from days of manual scripting to a single YAML configuration file
- Framework designed for generalization beyond neuroscience — applicable to any domain-specific analysis requiring HPC-scale computation
Technical Stack
Python · SLURM / HPC orchestration · HDF5 (lazy-loading API) · YAML/JSON configuration · SONATA circuit format · NumPy / Pandas / SciPy