Site Reliability Engineer at Google
Sunnyvale, California
Site Reliability Engineer @ Google Google, Mountain View — Site Reliability EngineerJanuary 2015 - PRESENTBorg (Internal Google cloud) Site Reliability EngineerResponsible for Machine-level Performance SLIs+SLOsLead and originator of ‘Node Platform Experiments’ which is an initiative allowing devs to safely push experimental changes onto onto a limited subset of production machines.Lead and originator of ‘First-Class Metrics' which from...
Site Reliability Engineer @ Google Google, Mountain View — Site Reliability EngineerJanuary 2015 - PRESENTBorg (Internal Google cloud) Site Reliability EngineerResponsible for Machine-level Performance SLIs+SLOsLead and originator of ‘Node Platform Experiments’ which is an initiative allowing devs to safely push experimental changes onto onto a limited subset of production machines.Lead and originator of ‘First-Class Metrics' which from a single line of python describing SLI metadata grants the ability to:A/B those SLIs against another team’s rolloutsGenerate dashboards to monitor those rollouts.Generate alerts/SLOs to be used for alertingUsed for all machine-level software rolloutsSRE-Lead for ‘Borg Launch Eval’ which is a joint project with statisticians with the goal of establishing re-usable statistical methodology when trying to determine if an 2% improvement/regression is “within” noise.Lead and originator of “Shared Borg Pilot” which allows Google to qualify new hardware (e.g. Zen, Optane) by allowing small amounts of experimental hardware to run live in production.Oncall for Borg handling major GCE/GCP customer-impacting incidentsLocal team expert for doing machine performance analysis. Often personally requested by other SRE teams during outages to debug resource isolation issues.Developer for “Machines Correlator”, a service which tries to determine the root-cause behind outages by performing correlation and random forest analysis automatically when an incident is reported.Works with a wide variety of team including: Kernel devs, Cloud Node-level (Borglet) devs, Cloud Cluster (Borgmaster) devs, Hardware Platforms devs, Appengine SREs, Youtube SREs, Storage SREs, Compiler experts, Gmail SRE, Websearch SRE, etc 1600 Amphitheatre PkwyStaff Software Developer for J9 JavaVM Team @ IBM Canada Ltd. Worked mostly on Garbage Collection/Memory ManagementWorked with universities (University of New Brunswick) in order to develop new VM-related technologies, e.g. object sharing between VMsWorked on improving garbage collection performance by improving cache locality, reducing lock contention etc.Worked on exploiting new hardware features, e.g. NUMA, transactional memoryWork on region-based garbage collector and pause-less copying garbage collection algorithmDebugging for major customers (banks, etc.)Develop debug extensions using java-based core-file-readerReceived Award for Excellence in regards to performance work with Extreme Scale Product From June 2010 to December 2015 (5 years 7 months) Ottawa, Canada AreaExtreme Blue Software Developer Co-op for J9 Java VM Team @ IBM Canada Worked on Extreme Blue team to rapidly develop a project and pitch it to executives.Worked on Java Virtual Machine. Separating Java code from Native code (normally accessed through JNI) so that they can be run in two separate processes (possibly existing on separate machines)Designed ring buffer for fast Inter-process-communication transport using shared memory. From August 2009 to December 2009 (5 months) Ottawa, Canada AreaSoftware Developer Co-Op for RAS/PD team for DB2 @ IBM Canada Worked on the RAS/PD (Reliability Availability Serviceability / Problem Determination) team responsible for development of diagnostic functionality, fault injection, monitor utilities… etc for DB2 (Database Management System).Designed security plan for execution of db2pd (low-level utility for collecting statistics and diagnostic information by reading shared memory).Responsible for shipping db2top (high-level monitor utility for DB2 designed formultiple partitions) out of alpha.Worked on db2trace (db2 tracing engine) to enable logging to multiple files.Developed API for installing signal(unix)/exception(win32) handlers to perform back trace, core-dumps and general diagnostic information dumps on traps From January 2008 to August 2008 (8 months) Toronto, Canada AreaComputer Technician @ Computers for Schools Repaired computers. Wrote scripts to automate data entry. Prepared hard-disk images for new computers. Worked as job-captain. From May 2006 to August 2006 (4 months) Edmonton, Canada Area
Site Reliability Engineer
1600 Amphitheatre Pkwy
IBM Canada Ltd.
Staff Software Developer for J9 JavaVM Team
June 2010 to December 2015
Ottawa, Canada Area
IBM Canada
Extreme Blue Software Developer Co-op for J9 Java VM Team
August 2009 to December 2009
Ottawa, Canada Area
IBM Canada
Software Developer Co-Op for RAS/PD team for DB2
January 2008 to August 2008
Toronto, Canada Area
Computers for Schools
Computer Technician
May 2006 to August 2006
Edmonton, Canada Area
Google, Mountain View — Site Reliability EngineerJanuary 2015 - PRESENTBorg (Internal Google cloud) Site Reliability EngineerResponsible for Machine-level Performance SLIs+SLOsLead and originator of ‘Node Platform Experiments’ which is an initiative allowing devs to safely push experimental changes onto onto a limited subset of production machines.Lead and originator of ‘First-Class Metrics' which from a single line of python... Google, Mountain View — Site Reliability EngineerJanuary 2015 - PRESENTBorg (Internal Google cloud) Site Reliability EngineerResponsible for Machine-level Performance SLIs+SLOsLead and originator of ‘Node Platform Experiments’ which is an initiative allowing devs to safely push experimental changes onto onto a limited subset of production machines.Lead and originator of ‘First-Class Metrics' which from a single line of python describing SLI metadata grants the ability to:A/B those SLIs against another team’s rolloutsGenerate dashboards to monitor those rollouts.Generate alerts/SLOs to be used for alertingUsed for all machine-level software rolloutsSRE-Lead for ‘Borg Launch Eval’ which is a joint project with statisticians with the goal of establishing re-usable statistical methodology when trying to determine if an 2% improvement/regression is “within” noise.Lead and originator of “Shared Borg Pilot” which allows Google to qualify new hardware (e.g. Zen, Optane) by allowing small amounts of experimental hardware to run live in production.Oncall for Borg handling major GCE/GCP customer-impacting incidentsLocal team expert for doing machine performance analysis. Often personally requested by other SRE teams during outages to debug resource isolation issues.Developer for “Machines Correlator”, a service which tries to determine the root-cause behind outages by performing correlation and random forest analysis automatically when an incident is reported.Works with a wide variety of team including: Kernel devs, Cloud Node-level (Borglet) devs, Cloud Cluster (Borgmaster) devs, Hardware Platforms devs, Appengine SREs, Youtube SREs, Storage SREs, Compiler experts, Gmail SRE, Websearch SRE, etc
What company does Lai Nguyen work for?
Lai Nguyen works for Google
What is Lai Nguyen's role at Google?
Lai Nguyen is Site Reliability Engineer
What industry does Lai Nguyen work in?
Lai Nguyen works in the Computer Software industry.
Enjoy unlimited access and discover candidates outside of LinkedIn
One billion email addresses and counting
Everything you need to engage with more prospects.
ContactOut is used by
76% of Fortune 500 companies
Lai Nguyen's Social Media Links
/school/un...