Job Description
At H1, we believe access to the best healthcare information is a basic human right. Our mission is to provide a platform that can optimally inform every doctor interaction globally. This promotes health equity and builds needed trust in healthcare systems. To accomplish this, our teams harness the power of data and AI technology to unlock groundbreaking medical insights and convert those insights into actions that result in optimal patient outcomes and accelerate an equitable and inclusive drug development lifecycle. Visit h1.co to learn more about us.
Data Engineering is responsible for the development and delivery of our most important asset—our data. With thousands of data sources from around the world, the team ensures that data is accurate, normalized, and delivered at a velocity that keeps up with real-world changes. As we expand our markets and the scope of data we provide to our customers, our team must scale to meet that demand.
The Data Lake is the foundation of H1’s platform, responsible for the validation, accuracy, standardization, and quality of the data powering every downstream product and team across the organization. You will help lead the evolution of this platform while supporting and mentoring a growing team of engineers.
You will:
- Architect, build, and scale distributed ETL/ELT pipelines and large-scale ingestion frameworks across structured and unstructured healthcare datasets.
- Lead the evolution of H1’s Data Lake architecture with a focus on scalability, observability, reliability, and cost optimization.
- Own and improve data quality, validation, normalization, and standardization workflows across thousands of global data sources.
- Design and optimize batch and near real-time data processing frameworks using cloud-native distributed systems.
- Optimize distributed compute and storage systems, including Spark workloads, query performance, partitioning strategies, and infrastructure efficiency.
- Drive improvements in monitoring, governance, operational excellence, and production reliability across the platform.
- Troubleshoot complex production data and infrastructure issues across distributed systems.
- Partner closely with Product, Infrastructure, Security, Compliance, and downstream engineering teams to support scalable and secure data delivery.
- Mentor engineers through technical leadership, architecture reviews, and engineering best practices.
- Help define technical roadmap priorities and contribute to long-term platform strategy and execution planning.
- Support production operations, incident response, and platform health as part of overall ownership of the Data Lake ecosystem.
- You have deep experience designing and scaling distributed data platforms and large-scale pipelines in cloud-native environments.
- You excel at building reliable, observable, and maintainable data systems supporting critical business and analytics workloads.
- You have strong expertise in distributed processing, performance optimization, and modern data architecture patterns.
- You are comfortable leading technical initiatives and influencing architecture decisions across teams.
- You communicate effectively with both technical and non-technical stakeholders.
- You enjoy mentoring engineers and helping raise the engineering bar across teams.
- You are energized by ownership, autonomy, and solving ambiguous technical challenges.
- Demonstrated technical leadership experience with interest in or experience mentoring and leading engineers.
- Strong proficiency in Python (PySpark), Java, Scala, or similar programming languages.
Advanced SQL expertise, including performance tuning and optimization across large datasets.
- Deep experience with Apache Spark and cloud-native big data platforms, preferably within AWS environments (EMR, Glue, S3, Athena, Redshift, or similar).
- Experience designing and scaling modern cloud-native data lake architectures and large-scale ingestion frameworks.
- Experience with orchestration and workflow management tools such as Argo, Airflow, or similar technologies.
- Strong understanding of distributed storage systems, partitioning strategies, and file formats such as Parquet, Avro, and ORC.
- Experience with Docker, Kubernetes, and modern containerization technologies.
- Experience implementing monitoring, observability, and data quality frameworks within production environments.
- Experience with large-scale data cleaning, parsing, normalization, and validation workflows preferred.
- Experience working with healthcare, life sciences, publication, or large-scale entity-resolution datasets preferred.
- Exposure to ML/AI-driven data enrichment, parsing, or validation workflows is a plus.
Anticipated role close date: 8/1/2026
At H1, we believe access to the best healthcare information is a basic human right. Our mission is to provide a platform that can optimally inform every doctor interaction globally. This promotes health equity and builds needed trust in healthcare systems. To accomplish this, our teams harness the power of data and AI technology to unlock groundbreaking medical insights and convert those insights into actions that result in optimal patient outcomes and accelerate an equitable and inclusive drug development lifecycle. Visit h1.co to learn more about us.
Data Engineering is responsible for the development and delivery of our most important asset—our data. With thousands of data sources from around the world, the team ensures that data is accurate, normalized, and delivered at a velocity that keeps up with real-world changes. As we expand our markets and the scope of data we provide to our customers, our team must scale to meet that demand.
The Data Lake is the foundation of H1’s platform, responsible for the validation, accuracy, standardization, and quality of the data powering every downstream product and team across the organization. You will help lead the evolution of this platform while supporting and mentoring a growing team of engineers.
You will:
- Architect, build, and scale distributed ETL/ELT pipelines and large-scale ingestion frameworks across structured and unstructured healthcare datasets.
- Lead the evolution of H1’s Data Lake architecture with a focus on scalability, observability, reliability, and cost optimization.
- Own and improve data quality, validation, normalization, and standardization workflows across thousands of global data sources.
- Design and optimize batch and near real-time data processing frameworks using cloud-native distributed systems.
- Optimize distributed compute and storage systems, including Spark workloads, query performance, partitioning strategies, and infrastructure efficiency.
- Drive improvements in monitoring, governance, operational excellence, and production reliability across the platform.
- Troubleshoot complex production data and infrastructure issues across distributed systems.
- Partner closely with Product, Infrastructure, Security, Compliance, and downstream engineering teams to support scalable and secure data delivery.
- Mentor engineers through technical leadership, architecture reviews, and engineering best practices.
- Help define technical roadmap priorities and contribute to long-term platform strategy and execution planning.
- Support production operations, incident response, and platform health as part of overall ownership of the Data Lake ecosystem.
- You have deep experience designing and scaling distributed data platforms and large-scale pipelines in cloud-native environments.
- You excel at building reliable, observable, and maintainable data systems supporting critical business and analytics workloads.
- You have strong expertise in distributed processing, performance optimization, and modern data architecture patterns.
- You are comfortable leading technical initiatives and influencing architecture decisions across teams.
- You communicate effectively with both technical and non-technical stakeholders.
- You enjoy mentoring engineers and helping raise the engineering bar across teams.
- You are energized by ownership, autonomy, and solving ambiguous technical challenges.
- Demonstrated technical leadership experience with interest in or experience mentoring and leading engineers.
- Strong proficiency in Python (PySpark), Java, Scala, or similar programming languages.
Advanced SQL expertise, including performance tuning and optimization across large datasets.
- Deep experience with Apache Spark and cloud-native big data platforms, preferably within AWS environments (EMR, Glue, S3, Athena, Redshift, or similar).
- Experience designing and scaling modern cloud-native data lake architectures and large-scale ingestion frameworks.
- Experience with orchestration and workflow management tools such as Argo, Airflow, or similar technologies.
- Strong understanding of distributed storage systems, partitioning strategies, and file formats such as Parquet, Avro, and ORC.
- Experience with Docker, Kubernetes, and modern containerization technologies.
- Experience implementing monitoring, observability, and data quality frameworks within production environments.
- Experience with large-scale data cleaning, parsing, normalization, and validation workflows preferred.
- Experience working with healthcare, life sciences, publication, or large-scale entity-resolution datasets preferred.
- Exposure to ML/AI-driven data enrichment, parsing, or validation workflows is a plus.
Anticipated role close date: 8/1/2026
