IT Service Status

System Administration

Audience:

NU Departments

Policy Statement:

This document is a non-technical, practical guide to performing the duties and practices inherent in taking on the responsibilities for and maintaining a server. It assumes there are three positions required to perform the job of System Administrator: the Primary Administrator, the Backup Administrator and their Manager, together they compose the Team.

Initialization of Server/Service:

Initialization is deciding who will own what, description of the responsibilities of the each team member to the other, and how duties will be divided between team members.

The Manager, through discussion with Team members, will decide who will act as the Primary Administrator and the Backup Administrator of the system or service.

The responsibilities of the Primary at this stage are to:

Prepare for the initial setup and steady operation of the server.
Keep the Backup abreast of any special circumstances or special details involved in the set up.
Begin the documentation of the server or service, recording significant installation instructions and parameters.

Once the server or service is running, delegation of regular duties should be done by the Primary with approval by the Manager.

By default, the Primary should play the leading role in all things related to the upkeep, configuration, and troubleshooting of the server or service.
The Primary should also inform the Team of the list of duties needed to properly administer the server or service change.
At minimum, the Backup should have duties that allow the Backup to maintain familiarity with the system or service.

The Primary should also be responsible for making sure all of the items vital to the systems, service's reliability, and compliance are in place or accounted for.

CALs
Backup rotations
UPSs, et. al.
Is there a response time that the Team is responsible for? If so, how will it be maintained?
Is the system secure? How will it remain secure? Firewalls, router filters, etc.

In general, the Primary should retain the default settings as much as possible, changing them only where necessary and at a pace that allows the Primary and Backup to observe the changes that result. Being clever is not a bad thing, but being too clever results in a system difficult to debug that can cost a department in the form of down time and other resources. In practice, this concept not only promotes the ability to easily track the source of a problem or conflict, but it also helps to ensure that the system or service remains similar to the published norm, further enabling an easy transition or addition of a new Team member.

Communication:

Communication allows the Manager to quickly make informed decisions, allows for the sharing and discovery of new ideas and better methods, reduces the likelihood of a misstep or failure on the part of a single individual or the Team, and can facilitate symbiotic working relationships between Team members.

Depending on the level of server or service (i.e.: enterprise level) there is a potential for regular meetings to be held between the Team and other parties related to the server or service.
Communications with the Manager should be done mainly through the Primary. The frequency of the communication remains at the Manager's discretion, the relative importance of the server or service, and the guidance desired by the Primary from the Manager. At minimum the Manager should be briefed on the status of the server once a month.
The Primary needs to learn what information is required by the Manager, so that the Primary can properly identify and communicate pertinent information to the Manager on a timely basis. It is a mutual responsibility between the Manager and the Primary that the Primary learn what the Manager needs to have communicated and when.
The Primary should prepare documentation of the processes and settings, in tandem with the set up and initial operation of the server and/or services. The Primary should continue to append and edit the document as settings and services change.
The Primary and Backup need to be in regular communication about changes and ideas of improvement on a regular basis. The Backup needs to be available to the Primary for this reason, and should also touch base with the Primary on a regular basis to inquire of any significant changes or updates. It is a mutual responsibility of the Primary and Backup that the Backup remain up to date.
The Primary needs to communicate with the Manager whenever there is a significant change; including those that will have an impact on the user population (such as maintenance time, planned or emergency). This communication should occur in a timely manner, promoting enough time to give the user population 24 hour notice of the possible disruption or change in services.
Depending on the server or service there may be necessary interim communication between the Team or their department and the user population that will be using or relying on the server or service. This communication should be handled by the Primary or Backup, upon Manager approval. During emergencies, this communication should happen as the emergency is being dealt with. The Manager should communicate the preferred method of contact to the Team for these situations.
The Backup operator should communicate significant changes made as well as events noted to the Primary. The Backup will communicate with the manager in place of the primary, if the primary is unavailable.
When considering a major change to a server or service, the Primary and Backup should prepare a written list of items that will need to be taken into consideration, and create a plan from that list, in order to enable the Manager make informed decisions.

Other Functional Considerations:

Significant changes to a server or service should only be executed after Team discussion and Managerial approval. Additional communication and approval may be required by the customer contact or by other controlling staff, faculty or administration.
A production server or service is a system that the department or institution relies on to accomplish their objectives where failure would constitute a significant interruption or failure of a primary goal or loss of crucial data.
A Team should move a server into production only after it has been tested thoroughly in a non-production environment.
Once the Team confirms that whatever errors have been worked out and it has been determined that the server will be able to scale reliably to meet the demand, then it could be moved into production.
Moving a server into production should occur with proper communications and in concert with whatever groups that would need to be involved in the move or activation of the 'live' service.
Projects often take much longer to complete than one would estimate. The Team should take this into consideration whenever planning any significant server project and build in some extra time into the proposed plan timeline.
When considering putting a server or service into production or making significant changes to a server or service, everything must be first tested to satisfaction and a project should not be considered complete until it has been tested thoroughly.
Documentation should continue to be created and updated as the server or service evolves. There are several ways to accomplish this: text files, PDFs or HTML, for example.
When considering making changes to a server or service, be sure to schedule adequate time for testing and resolving any issues - major or minor - that may arise from these changes.
Planning changes early in the week would be prudent, preventing downtime over the weekend.
It may also benefit the Team to make changes earlier in a given day so there is ample time to deal with whatever tweaking or debugging that may be necessary.
The Team should plan what would be necessary in order to undo any changes should it become necessary. This includes additional time or staff resources time, backups, etc.
Keep in mind that if you have seen a new system or service in action, it may not perform exactly to your expectations. Thorough research and testing of products help ensure that the system or service will meet your expectations and needs.
When dealing with solving problems that seem to be related, look for the common source of the problem.
Utilize log data and system tools to trace the problem.
Tracing problems to their ultimate cause can take time, but often will save a lot of time overall.
Use your test environment for what it is for; don't try anything you don't understand on a production server.

Important Dates

Original Issue Date:

March 2004

Revision Dates:

October 2005, May 2007