bm8 bmFtZQ tainguyenbp

SGVsbG8 dGhlcmU 👋

SSdt YQ U2l0ZQ UmVsaWFiaWxpdHk RW5naW5lZXJpbmc - U1JF VGVhbXM ZnJvbQ VietNam.

T24 dGhl d2F5 U2l0ZQ UmVsaWFiaWxpdHk RW5naW5lZXJpbmc d2l0aA U2VydmVy UmVib290 RW5naW5lZXI ZXhwZXJpZW5jZQ YXM d2VsbA U3lzdGVt UmVib290 RW5naW5lZXI YW5kIA U2VydmljZQ UmVzdGFydA RW5naW5lZXI ZXhwZXJpZW5jZQ.

Q3VycmVudGx5 Z3JhZHVhdGVk d2l0aA YQ YmFjaGVsb3Incw ZGVncmVl aW4 Q29tcHV0ZXI U2NpZW5jZQ.

U2tpbGxz: TGludXg, QmFzaA, UHl0aG9u, R28, QW5zaWJsZQ, VGVycmFmb3Jt, UHJvbWV0aGV1cw, U2VjdXJpdHk, QVdT, QXp1cmU, UExH c3RhY2s, TEdUTQ c3RhY2s, RUxL c3RhY2s, RG9ja2Vy, S3ViZXJuZXRlcw, SGVsbQ, S3VzdG9taXpl, R2l0bGFiQ0k, R2l0SHVi QWN0aW9ucw, SmVua2lucw.

🏢 SeKAmW0 Y3VycmVudGx5 d29ya2luZw Zm9y YQ RS1Db21tZXJjZQ Q29tcGFueQ.
💻 SSdt Y3VycmVudGx5 bGVhcm5pbmc YWJvdXQ TGludXg, T3BlblNvdXJjZQ, S3ViZXJuZXRlcw, SGVsbQ, U2VjdXJpdHk, Q0kvQ0Q, UHl0aG9u, R29sYW5n and SeKAmW0 cGFydGljdWxhcmx5 aW50ZXJlc3RlZA aW4 ZXZlcnl0aGluZw cmVsYXRlZA dG8 Y2xvdWQtbmF0aXZl YW5k Y2xvdWQ dGVjaG5vbG9naWVz.
😄 SG9iYmllcw: Rm9vdGJhbGw ⚽, U3dpbW1pbmc 🏊, Sm9nZ2luZw 🏃, V2Fsa2luZw 🚶.
🚍 Communication: HTTP, WebSocket, gRPC, REST, GraphQL, Kafka.
🧰 Version Control: Git , GitHub , GitLab .
🔨 Tools: Vim , Xcode , Visual Studio Code , Atom , Postman , Jira , SonarQube , Neovim .
🧊 Apache: Solr .
🤿 DevOps: Bash , Docker , Kubernetes , CI/CD , Jenkins , Grafana , Loki , Prometheus , Thanos , VictoriaMetrics , Terraform , Ansible , Vault , Nginx , Consul , OpenTelemetry .
☁️ Cloud: AWS , GCP , Microsoft Azure .
💾 Database: PostgreSQL , MySQL , Redis , MongoDB , MariaDB , MSSQL .
🔬 Analytics: Elasticsearch , Apache Spark .
🖥️ Operating system: Windows , MacOS , Linux , Ubuntu , Fedora , Arch Linux , Linux Mint .

👋 Q29ubmVjdA V2l0aA TWU

🛠️ TGFuZ3VhZ2Vz Jg VG9vbHM

⭐ GitHub Stats

Q29udHJpYnV0aW5n dG8 T3Blbi1Tb3VyY2U 🔥

Nguyen Ngoc Tai - Site Reliability Engineer

Address: [Address], [City, State, ZIP]
Phone: [Phone Number] Email: [Email Address] Github: [Github Address]

Objective

I'm a DevOps Engineer, skilled DevOps Engineer with over 3+ years of hands-on experience in designing, building, maintaining, administrating managing large-scale Kubernetes clusters and performance optimizing, cost reduction highly scalable and reliable systems, as well as managing Data Centers across Linux-based environments, networking and services. I specialize in supporting, automating and optimizing mission-critical deployments in cloud platforms like AWS and Azure, leveraging tools for configuration management, CI/CD pipelines, and DevOps best practices. Outside of work, I dedicate my free time to learning new technologies and contributing to the open-source community, driven by my deep passion for Linux and innovation. Seeking a challenging position where I can leverage my expertise in automation, monitoring, incident response, and infrastructure management to ensure the availability, performance, and efficiency of critical applications and services. Highly skilled and motivated Site Reliability Engineer (SRE) with 3 years of experience in designing, building, and maintaining highly scalable and reliable systems. Seeking a challenging position where I can leverage my expertise in automation, monitoring, incident response, and infrastructure management to ensure the availability, performance, and efficiency of critical applications and services.

Education

[Bachelor's/Master's Degree] in [Computer Science/Engineering/Information Technology]
[University Name], [Year]

Certifications

[0xAWS], [Certifying Organization], [2024]
[0xAzure], [Certifying Organization], [2024]
[0xGCP], [Certifying Organization], [2024]
[0xCKA], [Certifying Organization], [2024]
[0xCKS], [Certifying Organization], [2024]

Technical Skills

Programming Languages: Shell scripting, Python, Go,
Cloud Technologies: AWS, Azure, Google Cloud Platform
Containerization and Orchestration: Docker, Kubernetes
Infrastructure as Code (IaC): Terraform, Ansible
Continuous Integration/Continuous Delivery (CI/CD): Jenkins, GitLab CI/CD
Monitoring and Alerting: Prometheus, Grafana, ELK Stack
Incident Response and Troubleshooting: PagerDuty, Splunk, New Relic
Reliability Engineering: SLA/SLO, Error Budgets, Chaos Engineering
Networking: TCP/IP, DNS, Load Balancing
Collaboration and Communication: Agile, Scrum, Jira, Confluence

Experience

[OnPoint E-commerce], [Ho Chi Minh - Vietnam]

Site Reliability Engineer | DevOps/SRE Teams, [May 2022 - Present]

Collaborated with SWE, QC and PO teams to automate and accelerate the processes of build, test, release and deployment of applications into a runtime environment quickly and reliably, improve application performance and reliability through performance tuning, load testing, code optimization.
Integrated security best practices into CI/CD pipelines by establishing a DevSecOps culture. Automated security scanning and compliance validation with tools like Snyk and Soblew, achieving a significant reduction in vulnerabilities and meeting all regulatory audit requirements.
Designed and implemented a GitOps-based continuous deployment pipeline leveraging GHA, Jenkins for CI and ArgoCD for CD achieving a acceleration in release cycles and a improvement in code quality through a significant reduction in production bugs, reducing configuration drift and decreasing rollback incidents by across 8+ Kubernetes clusters, ensuring greater stability and reliability in production environments.
Collaborated with DE and DA teams setup DataOps to accelerate the continuous integration and continuous deployment of Data Engineering solutions using Airflow, Spark, PostgreSQL, ClickHouse, PowerBI. Collaborate with Data Engineers and Data Analyts to setup automated monitoring solution using Prometheus, Grafana to continuously monitor reliability, availability and performance of Data pipelines components and processes. Able to detect incidents in real-time, so that incidents can be resolved in near real-time. Reducing deployment time and increasing release frequency.
Designed and implemented a comprehensive metrics and centralized observability platform monitoring and alerting solutions using Prometheus, Grafana, Percona, OpenTelemetry for real-time monitoring of application services. Achieved a reduction in incident response time by implementing proactive alerting mechanisms, leveraging ELK Stack, PLG Stack for enhanced logging and analysis. Utilizing Sentry, Elasticsearch APM for improved error exception tracking, APM tracing, log aggregation, and incident diagnostics.
Designed and implemented multi-node Kubernetes clusters with autoscaling and intelligent resource management using Karpenter and KEDA. Achieved optimized container resource allocation and enhanced system performance while ensuring high availability, fault tolerance, and seamless operation for mission-critical applications.
Automated and migrated Jenkins pipelines and manifest template deployments for containerized applications into using GitHub Actions (GHA), Helm, and Kustomize, enabling faster and more reliable release cycles.
Reduced AWS cloud infrastructure costs by 35% by leveraging monitoring, AWS Cost Explorer, Azure Cost Management, optimizing K8s cluster autoscaling mechanisms and leveraging spot instances across AWS and Azure, all while maintaining system stability and performance.
Led incident response and troubleshooting efforts, ensuring timely resolution of critical incidents and minimizing downtime.
Implemented infrastructure automation using Terraform and Ansible, reducing manual provisioning time by 70% and improving consistency across environments.
Designed and built scalable Kubernetes clusters on AWS/GCP for deploying microservices, improving application scalability and fault tolerance.
Developed and maintained CI/CD pipelines using Jenkins and GitLab CI/CD, enabling automated building, testing, and deployment of applications.
Implemented monitoring and alerting solutions using Prometheus, Grafana, and ELK Stack, enabling proactive issue detection and reducing mean time to resolution.
Collaborated with development teams to improve application performance and reliability through performance tuning, load testing, and code optimization.
Led incident response and troubleshooting efforts, ensuring timely resolution of critical incidents and minimizing downtime.
Conducted Chaos Engineering experiments to proactively identify system weaknesses and improve resilience.
Participated in on-call rotations, responding to incidents and performing root cause analysis to prevent recurrence.

[VinCSS Cyber], [Full-time] [Ho Chi Minh - Vietnam]

DevOps Engineer | SecOps/Defense Teams, [March 2021 - May 2022], [Full-time]

Collaborated with R&D teams to develop and maintain CI/CD pipelines using GitLabCI, GHA. Integrated automated building, testing, code scanning, release and deployment of applications into a runtime environment quickly and reliably, reducing build and deployment times. Orchestrated the migration of Dockerized services/web applications to Kubernetes (RKE).
Designed and implemented a GitOps workflow for vCenter On-Premises infrastructure, leveraging GitLab CI/CD, Packer, Terraform and Ansible. Automated the provisioning, configuration, and management of virtual machines, reducing manual intervention and deployment times while ensuring consistency and scalability across the environment.
Collaborated with Security and Incident Response teams to implement and system maintain, system hardening, mitigation, hotfixes, patches, updates security controls and ensure compliance with industry standards.
Implemented a robust monitoring and alerting system to ensure system availability and reliability using GitLab CI/CD, Prometheus, and Grafana, resulting in decreased system downtime and faster issue resolution. Developed applications to proactively identify and address issues. Implemented centralized logging and log analysis using PLG Stack and ELK Stack, enhancing troubleshooting and monitoring capabilities.
Collaborated with cross-functional IT teams (Network, Support, Development, Security) to deploy, secure, and optimize IT systems across development, testing, and production environments. Designed, configured, and deployed tools such as VMware, GitLab, Jira to enhance team productivity and improve system reliability in an On-Premises infrastructure and GCP infrastructure (Compute Engine, Cloud Storage , Cloud SQL).
Managed infrastructure on AWS, including EC2, S3, RDS, and VPC, ensuring high availability, scalability, and security.
Automated infrastructure provisioning and configuration using Terraform and Ansible, reducing deployment time by 50% and improving infrastructure consistency.
Implemented centralized logging and log analysis using ELK Stack, improving troubleshooting and monitoring capabilities.
Worked closely with development teams to implement performance monitoring and optimization strategies.
Collaborated with security teams to implement and maintain security controls and ensure compliance with industry standards.
Conducted disaster recovery planning and testing exercises to ensure business continuity.

[YouNet Group], [Full-time] [Ho Chi Minh - Vietnam]

SysOps Engineer | Infrastructure Teams, [July 2019 - February 2021], [Full-time]

Managed and maintained services/web applications using Docker containerization. Led the migration of services/web applications to Docker, ensuring seamless integration. In the next phase, Orchestrated the migration of legacy monolithic applications to a containerized architecture, leveraging Kubernetes (Kubeadm), enhancing scalability and operational efficiency.
Implemented infrastructure automation GitOps workflow for VMware On-Premises infrastructure using GitLabCI, Packer, Terraform and Ansible, RunDeck, reducing manual provisioning time and improving consistency across environments.
Collaborated with cross-functional teams to design and implement network, system, and storage infrastructures. Partnered with development teams to deploy and manage development, staging, and production environments using GitLab CI/CD. Successfully supported multiple frameworks, including LAMP, LEMP, PHPFox, and WordPress, ensuring seamless integration and optimal performance.
Building, managing, operating, monitoring and development, research, evaluation and selection of solutions for network systems and service infrastructure (Prometheus & Grafana, Loki, Zabbix). Implemented centralized logging and log analysis, network performance and analyze network traffic (ELK Stack) send alerts to Telegram and Jira, improving troubleshooting and monitoring capabilities, resulting in a decrease in system downtime and faster issue resolution.
Managed and maintained IT software infrastructure, including upgrades, mitigations, hotfixes, patches, and security controls for both On-Premises environments and AWS infrastructure (EC2, S3, RDS). Designed and implemented robust backup and disaster recovery policies to ensure the durability and availability of company systems.

[LCS Group], [Full-time] [Ho Chi Minh - Vietnam]

System Engineer | System/Network Teams, [April 2016 - May 2019], [Full-time]

Installed, configured, and maintained development, staging and production environments for over 400+ virtual machines on VMware vCenter within a Data Center environment. Managed and maintained a network infrastructure consisting of over 100 Cisco and Nexus devices, ensuring optimal performance, security and reliability. Responsible for configuring systems, installing, deploying and administering services on Linux-based operating systems and network devices, ensuring they meet organizational requirements and directives from superiors.
Designed and maintained a robust monitoring, alerting system using Zabbix, PRTG resulting in a decrease in network, system downtime and faster issue resolution. Monitor the IT infrastructure of the Company: Server system, backup system, config switch, config router, camera, firewall, internet connection, application, software system.
Collaborated with SWE teams to develop and maintain automated using Bash, Ansible release product to production environment. Develop and maintain the end-to-end CI/CD pipeline using Bash, Ansible reducing install, build and deployment time.
Experience with installing, configuring, setup, call routing, voicemail management, managing, support, troubleshooting tools and techniques for diagnosing and resolving network and voice quality issues. Engineered and deployed scalable VOIP solutions for over 3000 users within the organization. Implemented a VOIP system for a new call center, enabling the support for an additional 7000 concurrent calls. Migration of traditional telephony systems to a centralized VOIP platform. Design and maintain a robust monitoring VOIP system calls, alerting system using Prometheus, Grafana.
Database Administrator (MariaDB, MySQL): Managed and optimized standalone and clustered environments, implemented replication, automated backups, tuned performance, ensured high availability, and support query data visualization reports to leader.

Projects

Project Name: Implemented a comprehensive observability solution using Prometheus and Grafana, providing real-time monitoring and alerting for critical applications and services.
Project Name: Led the migration of legacy infrastructure to a containerized architecture using Kubernetes, resulting in improved scalability and reduced operational overhead.

References

Available upon request

Provide feedback

Saved searches

Use saved searches to filter your results more quickly