JOB PURPOSE:
We are seeking a highly skilled Command Center Senior Application Performance Engineer to join our IT Operations team. This role is responsible for monitoring, analyzing, and optimizing the performance of enterprise applications, ensuring high availability and reliability. The ideal candidate will have deep expertise in Observability tools, Incident Analysis, and Root Cause Analysis (RCA) to proactively identify and resolve system issues.
Key Responsibilities:
• Observability & Monitoring: Utilize industry-leading observability tools (such as Dynatrace, New Relic, AppDynamics, Datadog, Prometheus, or Splunk) to monitor application performance and system health.
• Incident Management: Lead investigations into performance degradation, outages, and application failures to ensure swift resolution and minimal business impact.
• Root Cause Analysis (RCA): Conduct in-depth RCA for recurring issues and drive permanent fixes to prevent reoccurrence.
• Performance Optimization: Collaborate with development and infrastructure teams to fine-tune application performance, reduce latency, and optimize resource utilization.
• Automation & Alerting: Implement proactive monitoring solutions and automation scripts to enhance system resilience and minimize manual intervention.
• Collaboration: Act as a bridge between DevOps, SRE, and application support teams, ensuring alignment on performance objectives and continuous improvement.
• Reporting & Documentation: Generate performance reports, incident post-mortems, and knowledge base articles to improve operational efficiency.
• Capacity Planning: Assess application workloads, anticipate scaling needs, and recommend solutions to enhance system stability and growth.
Requirements
• Education: Bachelor's degree in Information Technology, Computer Science, or a related field. ITIL certification (ITIL v4 preferred) is highly desirable.
• 5+ years of experience in application performance engineering, monitoring, or a similar role.
• Strong knowledge of observability tools (e.g., Dynatrace, New Relic, AppDynamics, Splunk, Prometheus, Grafana, ELK Stack).
• Hands-on experience with incident analysis, troubleshooting, and RCA methodologies.
• Proficiency in scripting and automation using Python, Shell, PowerShell, or similar.
• Familiarity with cloud platforms (AWS, Azure, GCP) and microservices architectures.
• Experience working in a DevOps or SRE-driven environment is a plus.
• Excellent problem-solving skills and ability to work under pressure in a 24x7 production environment.
• Strong communication skills to interact with technical and non-technical stakeholders.