Technology Architect

T-Systems ICT India Pvt. Ltd.

Pune, MH, IndiaPosted 1 months ago

Full-timehybridNot Applicable

Job Description

Platform Operations & Technical Ownership

3rd-Level Technical Support & Troubleshooting as key knowledge resource

Acts as the primary 3rd-level contact for:
- Wazuh SIEM
- PostgreSQL
- S3 MinIO Object Storage
- DNS Infrastructure
- Remote platform access / bastion systems
- Linux OS (SuSE, RHEL, Ubuntu)
- NSX‑T networking and firewalling
- SuSE Manager
Performs deep root-cause analyses including multi-system debugging.
Handles cross-team, business-critical incidents requiring broad platform knowledge.

Capacity & Performance Management

End-to-end responsibility for FCI and Kubernetes cluster capacity management.
Continuous assessment of resource utilization, trends, and scaling requirements.

Platform Stability & Reliability

Drives improvements in platform stability and deployment reliability.
Optimizes operational models and CI/CD processes.
Ensures smooth transitions from project delivery to stable operations.

2. Platform Engineering & Automation

Prepares, designs, and executes Proofs of Concept (PoCs) for:
- Ansible / AWX to enable automated deployments and configuration management.
- Oracle-related technologies, including integration and migration scenarios.
Develops automation strategies and contributes reusable modules and deployment templates.
Defines technical standards for automated operations.

3. Security, Compliance & Governance

Audit Management & Collaboration with Auditors

Designs, reviews, and explains technical audit controls to internal and external auditors.
Coordinates audit activities for both platform and application-related topics.

Security-Driven Engineering

Embeds security controls into automated deployment workflows.
Creates and maintains compliance policies and technical guardrails.

Wazuh SIEM Responsibility

Designs, maintains, and operates the Wazuh security platform.
Develops use cases, alerts, dashboards, and security incident processes.
Troubleshoots performance issues, agent behavior, and platform scalability.

4. Collaboration, Stakeholder Management & Enablement

Coordinates work packages across AO teams, development teams, and infrastructure units.
Works closely with software teams to onboard applications onto the platform.
Supports service portfolio development and provides technical input for presales activities.
Shares best practices and mentors engineers regarding platform processes and tools.

5. Architecture, Design & Technology Evaluation

Executes PoCs and evaluates new platform components.
Defines integration strategies for new technologies in alignment with architecture standards.
Creates reference architectures, deployment blueprints, and operational concepts.
Evaluates solutions based on scalability, resilience, security, and cost efficiency.

6. Project Involvement

Project: Icinga Replacement

Coordinates work and dependencies with classic AO teams.
Supports AO teams in deploying and configuring exporters/agents on legacy VMs.
Standardizes client-side configurations and data mappings.
Implements standardized dashboards for platform service observability.
Defines monitoring and alerting for existing components and applications.
Performs advanced troubleshooting, including:
- missing or incomplete metrics
- high scrape latency
- time-series cardinality challenges
- Kubernetes monitoring (Prometheus Operator, ServiceMonitor/PodMonitor resources)

Project: MIF

Analysis of the existing application architecture and its components.
Conducts PoC for Cognos.
Supports DB2 → PostgreSQL migration, including data validation, performance assessment, and migration tooling.

7. Technical Skills & Competencies

Linux Platform Engineering & Operations

Advanced administration of enterprise-grade Linux systems (RHEL, Ubuntu, hardened distributions).
Deep OS-level troubleshooting (CPU, memory, IO bottlenecks, process diagnostics).
Service lifecycle management using systemd, including journald log analysis.
Kernel parameter tuning, optimization, and performance diagnostics.
Host-level incident investigation and forensic log analysis.
Definition and execution of patching and lifecycle management strategies.
Filesystem operations and troubleshooting (LVM, XFS, ext4, mount and IO issues).
User and remote access configuration, including SSH hardening and bastion host concepts.

Kubernetes Platform Operations

Operational support for Kubernetes clusters across control plane and worker nodes.
Troubleshooting pod failures, scheduling issues, container crashes, and resource exhaustion.
Debugging of networking-related problems (CNI layers, service routing, DNS resolution).
Management of persistent volumes, storage classes, and dynamic provisioning behaviors.
Resource forecasting and capacity planning for cluster growth (CPU, memory, storage).
Execution and validation of Kubernetes cluster upgrades.
Operational support for multi-cluster and multi-environment setups.
Analysis of Kubernetes system logs (kube-api, kubelet, controller-manager).
Maintenance and enhancement of the Kubernetes stack, including version upgrades and feature adoption.

Observability & Security Platform (Wazuh)

Design, deployment, and operational management of the Wazuh SIEM platform.
Full lifecycle management of Wazuh agents, including policy enforcement and tuning.
Troubleshooting log ingestion pipelines, decoders, enrichment rules, and alert logic.
Integration of Wazuh with platform services and infrastructure.
Analysis of security alerts and support of incident investigations.
Performance optimization of SIEM components to ensure reliable event processing.
Maintenance of compliance dashboards and generation of audit-relevant evidence.
Continuous improvement of Wazuh stack via upgrades, new features, and configuration optimization.

Observability & Monitoring Platform (Prometheus / Grafana / Alerting)

Deployment, configuration, and operations of Prometheus-based monitoring stacks (standalone and Kubernetes-integrated).
Administration of scraping configurations, service discovery rules, and target troubleshooting.
Design and maintenance of recording rules and alert rules for platform components.
Alert noise reduction through tuning and improved signal quality.
Integration and troubleshooting of exporters (node, database, Kubernetes, etc.).
Resolution of metric gaps, scrape latency issues, and cardinality-related performance problems.
Capacity planning for Prometheus TSDB retention, storage requirements, and query performance.
Development and lifecycle management of Grafana dashboards for platform and infrastructure services.
Troubleshooting dashboard performance, data source connectivity, and visualization accuracy.
Implementation of standardized dashboard templates across platform services.
Integration of alerting workflows into incident management systems.
Definition of platform SLIs/SLOs and reliability indicators.
Correlation of metrics and logs (including Wazuh and OS logs) for root-cause analysis.
Support and lifecycle management of Kubernetes monitoring components (Prometheus Operator, ServiceMonitor/PodMonitor).
Validation of monitoring coverage for newly onboarded components and applications.

Database Platform Operations (PostgreSQL / Oracle PoC)

Operational management of PostgreSQL clusters across environments.
Monitoring key metrics (connections, locks, long-running queries, replication lag).
Backup, restore, and disaster recovery validation.
Growth and capacity planning for compute and storage layers.
Support for database failover scenarios and resilience testing.
Preparation and execution of Oracle-related proofs of concept.
Evaluation of database deployment models (VM-based, containerized, or managed).
Maintenance and enhancement of the database stack, including upgrades and feature adoption.

Object Storage Platform (MinIO / S3 APIs)

Deployment and operations of MinIO-based object storage clusters.
Troubleshooting of S3 API access, authentication, and compatibility issues.
Monitoring capacity usage, planning storage expansions, and scaling clusters.
Configuration of lifecycle policies, data retention, and archival strategies.
Integration of MinIO with platform workloads, CI/CD, and backup systems.
Performance analysis and troubleshooting of replication and erasure coding.

Networking & Firewall Operations (VMware NSX-T)

Operational support of software-defined networking environments using NSX-T.
Troubleshooting of routing issues, overlay networking, and cross-segment connectivity.
Management of distributed firewall policies and micro-segmentation rules.
Support for load balancers, service exposure, and virtual networking components.
Administration of DNS infrastructure (zones, records, service discovery).
Throughput, latency, and capacity analysis for critical network paths.

Remote Platform Access & Identity Integration

Design and support of secure remote access solutions using Apache Guacamole and Entra ID.
Troubleshooting identity flows, authentication chains, and access control policies.
Integration with enterprise identity providers using OIDC and directory services.
Implementation of secure access patterns for administrators and application teams.

Automation & Platform Engineering (Ansible / AWX)

Preparation and execution of Ansible and AWX proof-of-concepts.
Development of automation playbooks for platform configuration, provisioning, and lifecycle tasks.
Integration of configuration management workflows into operational routines.
Evaluation and optimization of automated operational processes.
Automated deployment validation and configuration compliance checks.

Incident Management & Reliability Engineering

3rd-level escalation point for complex incidents across infrastructure and platform services.
Root cause analysis using logs, metrics, and system-level diagnostics.
Coordination of incident response across multiple technical domains.
Identification and remediation of recurring incident patterns.
Implementation of platform stabilization and hardening measures.
Transition of engineered solutions into long-term operational models.

Security, Compliance & Audit Support

Design and discussion of audit controls with internal and external auditors.
Preparation of audit evidence for platform and application compliance.
Integration of security controls and guardrails into automated deployment workflows.
Maintenance of compliance-sensitive configuration baselines.
Support for remediation of audit findings and compliance gaps.

Architecture & Technology Evaluation

Execution of proofs of concept for emerging technologies and platform components.
Assessment of scalability, resilience, operational complexity, and security posture.
Creation of technical blueprints and reference architectures.
Definition of integration strategies for new services within existing platform ecosystems.
Evaluation of cost efficiency, maintainability, and operational impact of architectural decisions.

Collaboration & Platform Enablement

Coordination of cross-team technical work packages across operations and engineering units.
Support for application onboarding to shared platform services.
Documentation of platform standards, operational procedures, and best practices.
Contribution to presales discussions and service portfolio evolution.

Delivery of knowledge transfer and enablement sessions for operations and development teams

Please Note: Fraudulent job postings/job scams are increasingly common. Beware of misleading advertisements and fraudulent communication issuing 'offer letters' on behalf of T-Systems in exchange for a fee. Please look for an authentic T-Systems email id - [email protected].

Stay vigilant. Protect yourself from recruitment fraud!

To know more please visit : Fraud Alert

About T-Systems ICT India Pvt. Ltd.

More jobs at T-Systems ICT India Pvt. Ltd.

MLOps Engineer

Pune, MH, India

HR Operations

Pune, MH, India

Security Consultant-Audit & Compliance

Bengaluru, KA, India

HR Operations

Pune, MH, India

Privacy Manager

Pune, MH, India

AI Engineer

Bengaluru, KA, India

Similar roles

IT Service Operation Management Lead - VP - Technology Service Managment - IT

Hong Kong Exchanges and Clearing Limited (HKEX) · HK-TKO 5/F

Sr. Product Manager - Tech , Device Operations Technology & Software (DOTS)

Amazon · Sunnyvale, California, USA

Sr. Product Manager - Tech , Device Operations Technology & Software (DOTS)

Teltrium Inc. · Sunnyvale, California, USA

Lead Product Technology

AT&T · Dallas, Texas

Content Services and Content Production Technology Development Internship - Fall 2026

NPR · Remote

Technology Business Management (TBM) Analyst

Trilogy Federal · Washington D.C.

$75K - $85K