Data Engineer (Glue/Snowflake/Spark/Airflow) - January 2026
remote, remote
For one of our clients in the pharma industry, we are looking for a Data Engineer (Glue/Snowflake/Spark/Airflow)
Project Name:
Biobase 2
Project Description:
Biobase is a digital product that helps scientists during the drug development process to have easy access to the relevant data. It links the data from different data sources from different product stages (development, production.) via harmonization to a combined data set. The services that are mentioned in Point 4 and are needed for Biobase 2 are to be provided within the framework of an agile development methodology.
Background:
The product is developed along a roadmap with different use cases that reflect the corresponding needs in the relevant business departments.
The overall goal is to make all the existing data available to reduce the effort of collecting the data and creating the analyses and to make analyses possible that were not possible in the past to generate additional knowledge in development and production of biological drugs.
Due to the Specific knowledge and way of working in SCRUM the client has no internal expertise in this area and therefore the contractor has a unique position and provides significantly different services than the internal staff.
Tasks:
Technical development of a data processing pipeline within the AWS Glue and Snowflake ecosystem, based on provided functional and non-functional requirement specifications. The development will independently utilize Apache Spark and GraphX, with Scala and Python as the primary programming languages. The goal upon completion is a fully functional and tested data processing pipeline that meets the documented requirements.
Technical creation of test definitions, followed by the independent implementation and execution of unit, integration, and end-to-end tests. The goal is the delivery of a test report documenting the test cases, execution results, and any identified defects.
Creation of comprehensive technical documentation for the implemented solution. This documentation will independently detail the architecture, data flow, component design, configuration, and operational procedures necessary for its operation.
Automation of the data processing pipeline's workflow and scheduling utilizing Apache Airflow. This includes the creation of Directed Acyclic Graphs (DAGs) to define task dependencies, schedule execution intervals, and configure alerting mechanisms for job failures. The goal is a fully automated, scheduled, and self-recovering data pipeline.
Independent technical monitoring of the deployed data pipeline's performance against predefined Service Level Objectives (SLOs). This service includes the investigation and resolution of defects identified within the contractor-developed components of the pipeline, which are reported via a formal Jira ticketing system.
Technical transfer of knowledge via Sprint meetings in order to coach the existing development team on the above mentioned data specifications and architecture.
Required experience:
Good knowledge in AWS, Glue, Snowflake, Spark, Python/Scala, Airflow
Good experience in agile (scrum) working mode, good coding practices
Language requirements - English mandatory, German is a plus
Start: 15.01.2026 Duration: till 30.06.2026 Capacity: 40h/week Location: remote