Investigation into Self-Monitoring and Self-Healing (SMASH) Software Processes, 10-R9628Printer Friendly Version
Inclusive Dates: 04/01/06 01/01/08
Background - Over the last few years, various requests for proposals (RFPs) received by SwRI have contained numerous requirements related to system recovery and redundancy. These requirements have included the ability to rollover processes to different machines, provide database recovery, and allow for system clustering to be used. Additionally, many RFPs are requesting burn-in periods for software followed by maintenance contracts. With the more complex ITS systems that are being developed, the need for more intelligent monitoring of these processes is apparent. While SwRI has developed tools for externally monitoring these processes, methods of internal monitoring had not been advanced.
Approach - This research program intent was to determine what components of a software system were most prone to problems. The team chose the following components to include as part of the system: sockets, threads and timers. For each of these components, developers were canvassed to determine typical problems that occurred. From this, a framework design was created to allow the components to be monitored for the occurrence of the problems. When problems occur, the process initiates a repair on the affected component. This framework also allows custom components to be monitored. A library of tools was developed to perform component monitoring.
Accomplishments - During the previous fiscal year, work was completed to develop a C# library for component monitoring, including the monitorable components, a process, socket and thread monitor class and generic monitor. These elements monitor the status of the various components and instruct the component to repair itself if needed. When a status is reported that is not a normal operational status, the monitors request a textual description of the problem, log a message containing this description, and instruct the component to repair the problem. For instance, one problem reported was timers that stopped firing. The last time fired is monitored, and the timer can be restarted if it is not consistent with the firing period.
In addition to the monitoring framework, additional functionality was included in the system. Logging was implemented on a per-component level to allow details to be turned on for a problem component (e.g., a socket can report packets transmitted and received). For threads, deadlock prevention code was added. In addition, to minimize latency introduced by monitoring, the monitor level can be automatically revised based on number of repairs for a period.