This is a rare opportunity to apply serious data engineering in a domain where latency, correctness, and reliability carry direct commercial weight.
Requirements
- 6+ years data engineering in production environments; Python expertise - idiomatic, well tested, production grade code, not notebook scripts
- ETL/ELT pipeline design and implementation at scale; orchestration with Airflow, Prefect, or equivalent; reliability first mindset including backfill, retry, and exactly once semantics
- Azure data platform - Azure Data Factory, Azure Databricks, Azure Synapse Analytics, Azure Data Lake Storage; infrastructure as code for data workloads (Terraform or Bicep)
- Databricks - Delta Lake, Unity Catalog, job cluster vs interactive cluster trade offs, cost aware compute management, Spark job optimisation
- Relational databases: PostgreSQL at production scale - query optimisation, indexing strategies, table partitioning, replication, schema design for both OLTP and analytical workloads
- MongoDB - document modelling, aggregation pipelines, indexing strategy, replica sets; clear judgment on when document vs relational storage is the right architectural call
- Containerisation: Docker and Kubernetes based deployment of data workloads; reproducible, environment agnostic data infrastructure
- Data modelling for analytical workloads - dimensional modelling, data vault, or equivalent; schema evolution, slowly changing dimensions, and downstream impact analysis
- Stream and batch processing patterns; late data handling, watermarking, and backfill strategies; throughput vs latency trade offs in pipeline design
- Production data observability - data lineage, quality checks, SLA monitoring, alerting on freshness and completeness; treating data correctness as a first class concern
- CI/CD for data infrastructure - version controlled pipelines, automated data quality testing, reproducible and auditable deploys
- Ability to work directly with quant researchers, risk managers, and traders - translate business requirements into reliable, well documented data products
Nice to Have
- Financial markets data - market data feeds (Bloomberg, Refinitiv), tick data, trade history, reference data, or instrument master management
- Apache Spark or Flink for large scale stream and batch processing beyond the Databricks ecosystem
- dbt or equivalent SQL transformation layer; experience building and maintaining dbt projects in a production data warehouse
- Event streaming with Kafka or Confluent Platform - topic design, consumer group management, exactly once delivery guarantees
- OLAP optimised stores - ClickHouse, DuckDB, or equivalent; understanding of columnar storage and vectorised query execution
- Energy, commodities, or broader financial markets domain knowledge
What We're Looking For
You treat data as a product, not a side effect. You know what it takes to make a pipeline trustworthy - not just running, but observable, tested, and recoverable when something upstream changes at 3am. You think in systems: schema evolution, lineage, freshness SLAs, and the downstream impact of every modelling decision. At ETrading, that data is the foundation of billion dollar trading decisions. You are the reason it is right.