Postmortem Report : Database Misconfiguration

Photo by Steve Ding on Unsplash

Postmortem Report : Database Misconfiguration

INCIDENT OVERVIEW:

DATE: 25TH JANUARY 2024

START TIME: 1400 HRS (EAT)

END TIME: 1530 HRS (EAT)

DURATION: 1HR 25 MINS

IMPACT:

On 25th January 2024 at 1400 hrs an incident occurred that caused the downtime of the main server.The main cause of this incident was identified as a misconfiguration of the database settings causing failure of key database services.

INCIDENT RESPONSE TIMELINE:

1405 HRS - DETECTION TIME:

  • The anomaly was first detected by some trigger alerts by the server monitors indicating a rise in errors of connection failure

1410 HRS - Acknowledgement:

  • A ticket was was immediately raised to notify the IT department of the server outage and to investigate the incidence promptly

1425 HRS - ASSESSMENT:

  • The first conducted tests of the server indicated database log errors and some configurations settings missing

1440 - ESCALATION:

  • The issues was immediately escalated to the Database Admnistration team for a more detailed analysis

1455 HRS - ROOT CAUSE IDENTIFICATION:

  • After in depth analysis of the database , it was identified that a junior engineer committed new settings without full authorization from his immediate supervisor that led to the connectivity issues.

1520 HRS - RESOLUTION:

  • The steps taken were to immediately revert back to the last committed settings and have the database in its last known stable state

1530 HRS - SERVICE RESTORATION:

  • The services were gradually restored and a number of tests were ran to ensure the database was in a stable state

ROOT CAUSE ANALYSIS AND RESOLUTION:

  • The connectivity issue was caused by a misconfiguration of database settings.

  • Lack of proper testing before committing any changes.

CORRECTIVE AND PREVENTIVE ACTIONS:

SHORT - TERM:

The committed changes were immediately reverted back to a last known stable commit

LONG-TIME:

Right procedures should be followed for confirmation and authorization of any changes.

  • A well documented commit procedure should be in place to prevent such future mishaps.

  • Enhanced monitoring should be put in place to allow early detection of such an incidence

CONCLUSION:

The incident has served as a valuable lesson for a standard procedure in committing changes to be put in place and also proper and various tests to be conducted before changes are made to the live server.The corrective actions are meant to be implemented to avoid such a mishap in the near future.

Sincerely,

Francis Ng’ethe