Job Details
Job Description
Responsibilities:
Incident & Support: Provide technical support for cloud-based data platforms (data warehouses, pipelines, distributed computing) and swiftly resolve production incidents including performance degradation, stability issues, and data reliability concerns.
Root Cause Analysis (RCA): Perform and document root cause analysis for long-term prevention.
Monitoring & Observability: Proactively monitor system health and data pipeline performance using cloud-native tools, developing dashboards, alerts, and reporting frameworks for real-time insight.
Automation: Build and maintain automation scripts using Python, PowerShell, and Bash to reduce repetitive tasks and enhance operational efficiency.
Platform Improvement: Suggest and implement improvements to increase platform resilience, reliability, and performance.
Collaboration: Work closely with Data Engineers, Full Stack Support Engineers, Data Scientists, and client-facing teams for troubleshooting and resolution.
Knowledge Base: Write, maintain, and share runbooks and troubleshooting guides.
On-Call: Be available for extended working hours during critical outage events.
Minimum Requirements:
Education: Bachelor’s degree in Computer Science, Information Systems, Engineering, or a closely related field.
- Certifications: Professional certifications in Microsoft Azure (Data Engineering, Administration, or Solution Architecture).
Experience: 3+ years of hands-on experience in support engineering, cloud operations, or data engineering within a cloud environment (Microsoft Azure preferred).
Data Platform: Strong practical experience with cloud-hosted data platforms, including data warehouses, pipeline orchestration services, and distributed compute engines.
Analytics Platforms: Experience working with modern scalable analytics platforms such as Databricks, Spark, Azure Synapse, or Microsoft Fabric.
Containerization: Familiarity with container orchestration and virtualization technologies like Kubernetes and Docker.
Monitoring: Familiarity with cloud-native monitoring and observability tools.
Automation: Ability to build and maintain automation scripts using languages like Python, PowerShell, and Bash (implied by the job responsibilities).
Troubleshooting: Proven ability to investigate and resolve issues using SQL/T-SQL, Python, and Spark workloads.
Operations: Knowledge of incident management practices (escalation, resolution, and prevention) and experience upholding high standards for reliability in business-critical production systems.
Benefits:
- Competitive salary based on experience (salary can potentially be more based on experience/skills)
IF you meet the above requirements and want to make a career-changing move, apply today by emailing your CV to [email protected]