Technology Architect
Job Description
Platform Operations & Technical Ownership
3rd-Level Technical Support & Troubleshooting as key knowledge resource
- Acts as the primary 3rd-level contact for:
- Wazuh SIEM
- PostgreSQL
- S3 MinIO Object Storage
- DNS Infrastructure
- Remote platform access / bastion systems
- Linux OS (SuSE, RHEL, Ubuntu)
- NSX‑T networking and firewalling
- SuSE Manager
- Performs deep root-cause analyses including multi-system debugging.
- Handles cross-team, business-critical incidents requiring broad platform knowledge.
Capacity & Performance Management
- End-to-end responsibility for FCI and Kubernetes cluster capacity management.
- Continuous assessment of resource utilization, trends, and scaling requirements.
Platform Stability & Reliability
- Drives improvements in platform stability and deployment reliability.
- Optimizes operational models and CI/CD processes.
- Ensures smooth transitions from project delivery to stable operations.
2. Platform Engineering & Automation
- Prepares, designs, and executes Proofs of Concept (PoCs) for:
- Ansible / AWX to enable automated deployments and configuration management.
- Oracle-related technologies, including integration and migration scenarios.
- Develops automation strategies and contributes reusable modules and deployment templates.
- Defines technical standards for automated operations.
3. Security, Compliance & Governance
Audit Management & Collaboration with Auditors
- Designs, reviews, and explains technical audit controls to internal and external auditors.
- Coordinates audit activities for both platform and application-related topics.
Security-Driven Engineering
- Embeds security controls into automated deployment workflows.
- Creates and maintains compliance policies and technical guardrails.
Wazuh SIEM Responsibility
- Designs, maintains, and operates the Wazuh security platform.
- Develops use cases, alerts, dashboards, and security incident processes.
- Troubleshoots performance issues, agent behavior, and platform scalability.
4. Collaboration, Stakeholder Management & Enablement
- Coordinates work packages across AO teams, development teams, and infrastructure units.
- Works closely with software teams to onboard applications onto the platform.
- Supports service portfolio development and provides technical input for presales activities.
- Shares best practices and mentors engineers regarding platform processes and tools.
5. Architecture, Design & Technology Evaluation
- Executes PoCs and evaluates new platform components.
- Defines integration strategies for new technologies in alignment with architecture standards.
- Creates reference architectures, deployment blueprints, and operational concepts.
- Evaluates solutions based on scalability, resilience, security, and cost efficiency.
6. Project Involvement
Project: Icinga Replacement
- Coordinates work and dependencies with classic AO teams.
- Supports AO teams in deploying and configuring exporters/agents on legacy VMs.
- Standardizes client-side configurations and data mappings.
- Implements standardized dashboards for platform service observability.
- Defines monitoring and alerting for existing components and applications.
- Performs advanced troubleshooting, including:
- missing or incomplete metrics
- high scrape latency
- time-series cardinality challenges
- Kubernetes monitoring (Prometheus Operator, ServiceMonitor/PodMonitor resources)
Project: MIF
- Analysis of the existing application architecture and its components.
- Conducts PoC for Cognos.
- Supports DB2 → PostgreSQL migration, including data validation, performance assessment, and migration tooling.
7. Technical Skills & Competencies
Linux Platform Engineering & Operations
- Advanced administration of enterprise-grade Linux systems (RHEL, Ubuntu, hardened distributions).
- Deep OS-level troubleshooting (CPU, memory, IO bottlenecks, process diagnostics).
- Service lifecycle management using systemd, including journald log analysis.
- Kernel parameter tuning, optimization, and performance diagnostics.
- Host-level incident investigation and forensic log analysis.
- Definition and execution of patching and lifecycle management strategies.
- Filesystem operations and troubleshooting (LVM, XFS, ext4, mount and IO issues).
- User and remote access configuration, including SSH hardening and bastion host concepts.
Kubernetes Platform Operations
- Operational support for Kubernetes clusters across control plane and worker nodes.
- Troubleshooting pod failures, scheduling issues, container crashes, and resource exhaustion.
- Debugging of networking-related problems (CNI layers, service routing, DNS resolution).
- Management of persistent volumes, storage classes, and dynamic provisioning behaviors.
- Resource forecasting and capacity planning for cluster growth (CPU, memory, storage).
- Execution and validation of Kubernetes cluster upgrades.
- Operational support for multi-cluster and multi-environment setups.
- Analysis of Kubernetes system logs (kube-api, kubelet, controller-manager).
- Maintenance and enhancement of the Kubernetes stack, including version upgrades and feature adoption.
Observability & Security Platform (Wazuh)
- Design, deployment, and operational management of the Wazuh SIEM platform.
- Full lifecycle management of Wazuh agents, including policy enforcement and tuning.
- Troubleshooting log ingestion pipelines, decoders, enrichment rules, and alert logic.
- Integration of Wazuh with platform services and infrastructure.
- Analysis of security alerts and support of incident investigations.
- Performance optimization of SIEM components to ensure reliable event processing.
- Maintenance of compliance dashboards and generation of audit-relevant evidence.
- Continuous improvement of Wazuh stack via upgrades, new features, and configuration optimization.
Observability & Monitoring Platform (Prometheus / Grafana / Alerting)
- Deployment, configuration, and operations of Prometheus-based monitoring stacks (standalone and Kubernetes-integrated).
- Administration of scraping configurations, service discovery rules, and target troubleshooting.
- Design and maintenance of recording rules and alert rules for platform components.
- Alert noise reduction through tuning and improved signal quality.
- Integration and troubleshooting of exporters (node, database, Kubernetes, etc.).
- Resolution of metric gaps, scrape latency issues, and cardinality-related performance problems.
- Capacity planning for Prometheus TSDB retention, storage requirements, and query performance.
- Development and lifecycle management of Grafana dashboards for platform and infrastructure services.
- Troubleshooting dashboard performance, data source connectivity, and visualization accuracy.
- Implementation of standardized dashboard templates across platform services.
- Integration of alerting workflows into incident management systems.
- Definition of platform SLIs/SLOs and reliability indicators.
- Correlation of metrics and logs (including Wazuh and OS logs) for root-cause analysis.
- Support and lifecycle management of Kubernetes monitoring components (Prometheus Operator, ServiceMonitor/PodMonitor).
- Validation of monitoring coverage for newly onboarded components and applications.
Database Platform Operations (PostgreSQL / Oracle PoC)
- Operational management of PostgreSQL clusters across environments.
- Monitoring key metrics (connections, locks, long-running queries, replication lag).
- Backup, restore, and disaster recovery validation.
- Growth and capacity planning for compute and storage layers.
- Support for database failover scenarios and resilience testing.
- Preparation and execution of Oracle-related proofs of concept.
- Evaluation of database deployment models (VM-based, containerized, or managed).
- Maintenance and enhancement of the database stack, including upgrades and feature adoption.
Object Storage Platform (MinIO / S3 APIs)
- Deployment and operations of MinIO-based object storage clusters.
- Troubleshooting of S3 API access, authentication, and compatibility issues.
- Monitoring capacity usage, planning storage expansions, and scaling clusters.
- Configuration of lifecycle policies, data retention, and archival strategies.
- Integration of MinIO with platform workloads, CI/CD, and backup systems.
- Performance analysis and troubleshooting of replication and erasure coding.
Networking & Firewall Operations (VMware NSX-T)
- Operational support of software-defined networking environments using NSX-T.
- Troubleshooting of routing issues, overlay networking, and cross-segment connectivity.
- Management of distributed firewall policies and micro-segmentation rules.
- Support for load balancers, service exposure, and virtual networking components.
- Administration of DNS infrastructure (zones, records, service discovery).
- Throughput, latency, and capacity analysis for critical network paths.
Remote Platform Access & Identity Integration
- Design and support of secure remote access solutions using Apache Guacamole and Entra ID.
- Troubleshooting identity flows, authentication chains, and access control policies.
- Integration with enterprise identity providers using OIDC and directory services.
- Implementation of secure access patterns for administrators and application teams.
Automation & Platform Engineering (Ansible / AWX)
- Preparation and execution of Ansible and AWX proof-of-concepts.
- Development of automation playbooks for platform configuration, provisioning, and lifecycle tasks.
- Integration of configuration management workflows into operational routines.
- Evaluation and optimization of automated operational processes.
- Automated deployment validation and configuration compliance checks.
Incident Management & Reliability Engineering
- 3rd-level escalation point for complex incidents across infrastructure and platform services.
- Root cause analysis using logs, metrics, and system-level diagnostics.
- Coordination of incident response across multiple technical domains.
- Identification and remediation of recurring incident patterns.
- Implementation of platform stabilization and hardening measures.
- Transition of engineered solutions into long-term operational models.
Security, Compliance & Audit Support
- Design and discussion of audit controls with internal and external auditors.
- Preparation of audit evidence for platform and application compliance.
- Integration of security controls and guardrails into automated deployment workflows.
- Maintenance of compliance-sensitive configuration baselines.
- Support for remediation of audit findings and compliance gaps.
Architecture & Technology Evaluation
- Execution of proofs of concept for emerging technologies and platform components.
- Assessment of scalability, resilience, operational complexity, and security posture.
- Creation of technical blueprints and reference architectures.
- Definition of integration strategies for new services within existing platform ecosystems.
- Evaluation of cost efficiency, maintainability, and operational impact of architectural decisions.
Collaboration & Platform Enablement
- Coordination of cross-team technical work packages across operations and engineering units.
- Support for application onboarding to shared platform services.
- Documentation of platform standards, operational procedures, and best practices.
- Contribution to presales discussions and service portfolio evolution.
Delivery of knowledge transfer and enablement sessions for operations and development teams
Please Note: Fraudulent job postings/job scams are increasingly common. Beware of misleading advertisements and fraudulent communication issuing 'offer letters' on behalf of T-Systems in exchange for a fee. Please look for an authentic T-Systems email id - [email protected].
Stay vigilant. Protect yourself from recruitment fraud!
To know more please visit : Fraud Alert