Network Reliability Report Networking and Telecommunications Standing Committee Council on Information Technology and Services University Of Florida June 29, 1994 Prepared by R. E. Newman-Wolfe CSE-346 392-1488 nemo@cis.ufl.edu http://www.cis.ufl.edu/ nemo Network Reliability Report Networking and Telecommunications Standing Committee June 29, 1994 1 Introduction This document addresses the issues the Networking and Telecommunication Standing Committee (NTSC) of the Council on Information Technologies and Services (CITS) considers to be most important with regard to network reliability. Extending service to new users should not be done at the expense of degrading the level of reliability delivered to existing users. It is our recommendation that new service get the quality specified here, and that older parts of the network migrate to this standard as required or as the opportunity presents itself. Procedures affecting the whole network should be implemented as soon as practical. First, it is critical that the users of the campus network be surveyed to determine both the level of service they expect and the degree to which they perceive the network as critical to their mission. This survey should include expectations of procedures, response times, training opportunities, information distribution, the nature of tolerable outages, the way in which the network is used, what other groups of users and resources are accessed using the network, and how adversely an interruption of network access would affect their mission. The NTSC does not propose to conduct this survey itself, but advises the CITS to find an appropriate mechanism by which it can be conducted. The results of the survey should be made available to the involved deans, to the provost and president, to CITS, and to the NTSC. Network reliability is difficult to measure meaningfully. An outage of five minutes every week is not the same as a two hour outage every six months. Users generally want ``100 Reliability'' - no outages whatsoever, but this is not possible. The degree to which reliability is achieved depends on the resources applied and the ways in which they are used. Perhaps a more important measure, and something more achievable, is the perception of reliability. This includes procedures that give users the sense that their needs are being addressed, that they are informed of the state of the network as it affects them and what measures are being taken to keep services available to them. In what follows, this will be the major focus rather than some other metric, such as a measure of up-time. While the procedures and mechanisms described here are the best known to the NTSC at the present time, newer procedures and products may supercede these. For the purposes of procedures and planning outlined in this document, we focus primarily on the campus backbone, but we cannot ignore the ``end-to-end'' reliability issues. For successful communication to occur between two hosts, all the networks connecting the two hosts must function properly, particularly including the local area networks (LANs) to which the hosts are connected. Even though the campus network is managed in a distributed fashion, it is important that the user perceive support all the way to the desktop. 2 Recommendations for Network Reliability 1. Expected Procedures (a) Response · The response times really must come from the users, but the numbers below we consider to be minimal · A central trouble reporting, dispatching and tracking desk (one-call shopping) · 4-ring response to phone-in trouble reports 5 days a week, from 8 a.m. to 12 a.m. Options that various user groups may require include: - days by 8 hours a day - days by 12 hours a day - days by 8 hours a day - days by 12 hours a day - days by 16 hours a day - days by 24 hours a day · 1-hour response time 5 x 8 (technician on the spot) · 4-hour response time 7 x 24 (off-hours) (b) Procedures · Use compatible systems for logging and tracking trouble reports and updates across campus · Log trouble reports, issue ticket at time of call and tell caller ticket number (so they can check status later). Note that software packages for this exist already. · If possible, users and managers should be informed of and given the ticket number of problems that affect them directly. Some of the calls (those that report obviously non- local problems) should be posted to CCC and LAN-Mgrs lists. · Status of problems should be made available via phone mail, the finger program, and other means so that users and managers can track progress (using ticket number). Updates and follow-ups must be posted as progress is made. · Users and managers should be informed of scheduled downtime for preventative maintenance, which should be performed duringoff-hours (as determined by the affected user groups). · The Physical infrastructure of the network should be inspected at regular intervals, with all of the visible portions of the network (POPs, conduit, etc.) inspected at least once a year. The inspection results should be documented and any irregularities given trouble report tickets. 2. Proactive Monitoring (a) Goals · Need to monitor 7 days by 24 hours a day to anticipate and to respond to trouble for all major protocols. · Trend detection: This requires maintaining archives over a multi-year period on a per-interface basis, to be used for anticipating requirements for throughput upgrades. · Failure detection: Short-term records (at least a month) should be kept on-line, including hourly volume and exceptions. These help when failures have been detected in order to determine if there were symptoms that manifested themselves earlier, or if there were related conditions that led to the failure. · Renegade detection (e.g., unauthorized hosts, bogus addresses, rogue machines) · Net-hog detection (e.g., detection of excessive capacity usage, especially if it causes congestion and intolerably poor service to others). (b) Mechanisms · Monitoring tools · Manageable hardware wherever possible (e.g., SNMP, RMON, CNMP) · Out-of-band management: This is critical, otherwise, when the net fails, you can't access management functions to tell what is going on or to fix the problem. 3. Physical Security (a) Access · Wiring closets should be locked and there should be far fewer keys out. Access should be limited to the physical plant and network management personnel appropriate to the facility. A policy consistent with these guidelines should beimplemented and enforced. Perhaps card key access would be appropriate. · Wiring closets should not be used for storage of non- networking and non-telecom items. Extraneous tools (e.g., mops, brooms, ladders) and other items (boxes, spare parts) should be kept in separate closets or at least, a physically separatedpart of the same closet (in those instances where a retrofit causes a telecom closet to share space with another function). (b) Conditioned Environment · Humidity and temperature control is essential for each point of presence (POP). This may be achieved by conditioning the wiring closet or by a self-contained rack in an unconditioned room. · UPS (managed): An uninterruptable power supply (UPS) is necessary to permit continuation of service over short power outages, and to permit controlled shutdown of network services if the outage is longer. A managed UPS is desirable to inform network agents (programs that can then alert network management personnel) when the UPS detects a power loss, so that other actions can be taken and losses anticipated. In-band communication (using the campus network) between the UPS and the agent is all right, since power loss is not likely to correlate with massive data congestion or other circumstances that would render the network unusable for management purposes. Use of the existing building automation network (currently Arcnet based) for management of the UPSs should be considered. (c) Connectivity · Guidelines for conduit, classroom wiring, etc. should be followed. These should be updated as technology and requirements evolve. · Minimize single points of failure. For example, backpaths (redundant physical routes) should be available for critical services and paths. In particular, we recommend that there be: - more than one connection to the Internet - true physical redundancy throughout the campus backbone - true router redundancy between campus backbone segments · Maintain level of reliability to existing users when new users are extended service. For example, it is not advisable tooverload a router already serving a population for which the network is a mission critical resource. Rather, some form of upgrade should maintain or improve service to both the old and the new users. 4. Network Disaster Recovery Planning (a) Scope · Identify likely network failure modes (for example, loss of a router, backhoe fade, lightning strike, a specific user or protocol creating congestion) · Determine what effect the identified failure modes would have on the ability to continue with operations · Identify priority user populations: This is to prioritize recovery of service if needed. To determine this, the users and deans should be consulted (we note that this is hard to implement). · Design should make "user islands" so that failures will allow for "graceful degradation" of service, so that users can still get most of what they want done, just not all. · Describe procedures for fault detection, isolation, and recovery · If there is an outside governing body that applies, thenwe must address their requirements (b) Mechanisms · Design reviews: These should be done periodically, at specified intervals, to identify likely problem areas and suggest improvements · Outsourcing: Identify companies that provide backup/disaster recovery, and evaluate for cost effectiveness · Wireless: Wireless can be used for rapid recovery of basic service. · Testing of recovery plans, so that troubleshooters are adept at implementing the recovery methods, and so the plans themselves may be improved. · Off-site storage of a copy of recovery plans and network documentation. · Network Disaster Recovery Plan Reviews: These plans should be reviewed periodically for completeness and currency. Use of an outside consultant will avail us of a different viewpoint. 5. Personnel (a) Staffing · Adequacy of staffing: The staff size should be adequate for level of service desired. This should be reviewed regularly as well. · Backup: Key positions should have clearly identified backups, kept current through cross training · Maintain updated list of LAN managers and network personnel (b) Education · Basic Training: Campus backbone staff, college level staff and LAN managers should receive initial and continuing training (including training in recovery procedures) · Cross Training: Staff should also be cross trained so that each key position has at least one trained backup · Users: The user population should be provided with training not only in how to use the network, but what effects their actions may have, and what to do in case of failures · Information Distribution: There should be wide distribution of information on where to turn for help and trouble reports. Also, scheduled maintenance times should be posted electronically in well-known locations prior to the downtime.