Investigation into Self-Monitoring and Self-Healing (SMASH) Software Processes, 10-R9628Printer Friendly Version
Inclusive Dates: 04/04/06 01/01/08
Background - Over the past few years, various proposals received by SwRI have contained numerous requirements related to system recovery and redundancy. These requirements have included the ability to rollover processes to different machines, provide database recovery, and allow for system clustering to be used. Additionally, many proposals are requesting burn-in periods for software followed by maintenance contracts. With the more complex Intelligent Transportation Systems being developed, the need for more intelligent monitoring of these processes is apparent. While SwRI has developed tools for externally monitoring these processes, methods of internal monitoring have not been advanced.
Approach - The intent of this research project is to determine what components of a software system are most prone to problems. From experience, the team is aware of issues that can occur with sockets, threads and timers, but will work to identify additional components that may be problematic by interviewing a diverse group of developers. Further investigation will be performed to determine what problems frequently occur and what can be done for repair when the problem is encountered. The program will create a design for a framework that can incorporate into a software process the ability to monitor the identified components. When problems occur, the process will then initiate a repair on the affected component. The framework will also allow custom components to be monitored. Once design is complete, a library of tools will be developed that contains components that are monitored and a process monitor.
Accomplishments - During the previous fiscal year, component investigation, some repair investigation, and the initial framework design were completed. Accomplishments this year include the completion of research into repairs for components, although some time may still be expended researching socket methods. Repairs for components include timers not firing, prevention of thread deadlock, and detection and repair of socket disconnections.
Significant work has been completed in developing the C# library including a MonitorableTimer, deadlock prevention code, and the process and component monitoring classes. Because the desired result of the research is a highly reliable monitoring system, extensive unit testing has been created for the various classes. Currently, work is progressing on the MonitorableSocket methods to ensure problem conditions are successfully detected and repaired. In addition to the tool code, a viewer was created to both ensure the usability of the external interface and easily view the status of a process.