XTERMINALS
AND HOST PERFORMANCE Part 2 |
Originally published November, 1994 |
by
Carlo Kopp |
¿ 1994, 2005 Carlo Kopp |
The Xterminal has been at the receiving end of much criticism, primarily in relation to performance in situ. The reality of performance problems experienced in larger Xterminal sites is however more closely related to inappropriate host and network sizing. This feature will address the central technical issues involved in proper sizing, and outline some strategies for successfully sizing hosts and networks in this demanding environment. Sizing Hosts for Use with Xterminals The issue of whether to formally size a host for use with Xterminals or not is driven by the size of the Xterminal population to be supported, and whether a formal performance requirement exists in relation to operator response times. A site with half a dozen Xterminals scattered across offices, and running clients upon various hosts, need not bother as the result simply doesn't justify the effort expended. A site with a dozen or more Xterminals, particularly if intended to be driven by a single host, should look very carefully at sizing. If the numbers are greater, and the total system must meet a specification, such as x hundred milliseconds of screen response time to keyboard input, sizing becomes a make or break issue. A central caveat which must be stated at this point is that sizing is application specific, and given the idiosyncrasies of most applications, to ignore this relationship is to court disaster. The sizing process can be broken down into several components. These are the sizing of host memory, host CPU performance, host network interface performance, the assessment of aggregate interrupt load upon the host, and evaluating operating system IPC performance. Fortuitously, most of these tasks can be performed in isolation from one another and therefore aggregate system performance can be estimated with some measure of accuracy. Should very high accuracy be required, there is no substitute for an appropriate benchmark, but this can be tricky to perform without a large Xterminal population on site. Sizing Host Memory The sizing of host memory is a conventional affair, and is no different to sizing a host's memory for tty based applications. The objective of host memory sizing is to prevent the host's memory management system from coming under any load other than that required to page applications and data into memory, and carry the paging load which is characteristic of the application's normal operation. Should the host's memory be undersized, the memory management facilities will become very active, juggling seldom used memory pages between applications to fulfill the instantaneous need for memory. This incurs additional CPU load as well as disk I/O load. Should the demand for free memory cross a certain threshold, those processes which are least active will be swapped out to the disk's swap partition, incurring further demands upon system performance. The visible effect will be host saturation, with heavy CPU and I/O load, very little work being done and response times which are painfully long. The reason why host memory sizing is important in the X environment is that an X server can and in practice will support many X client applications concurrently, and each of these clients is a process within itself which will make all of the demands of memory and host resources which a process makes. Whereas in a conventional tty environment, each user will run a copy of their given interactive application (eg editor, database application or other) and typically a shell and the port's associated getty process, an Xterminal user will typically have a half a dozen or more clients on the host. Whether these are active or not is immaterial, as good response time dictates that they should be memory resident rather than sitting out on the swap disk. Swapping them back into memory will gobble up resources and time and therefore is to be avoided. Assuming that the users are captive users, that is that they will only run the application(s) which they are meant to, and don't fire up _xterms_ or even playthings like _xeyes_, _plaids_ or _icos_, the memory sizing process is straightforward. The first step in the exercise is to quantify the demand for common memory resources, such as the operating system itself, and common applications, such as the database server should this be the case. This is best achieved by gaining access to a system running the operating system in question as well as the common applications, and using the _ps_ command (see the man pages for platform specific details) to look at the resident set sizes and text segment sizes. Where the common application is a database product, care must be taken, as these products when running typically have many active processes, as well as shared memory areas and internal caches. These must all be accounted for. A matter not to lose sight of here is that the principal consumers of memory in any process are the text, data and stack areas, and that the instantaneous memory consumption of each of these varies with time. Ideally we aim for peak or maximum load, as this will provide a safe figure. The mechanics of this task are simple, and involve summing up the constituent memory usages of all processes on the system. A good sanity check is to compare the resident set sizes with the total set sizes, as the latter is an indicator of how far consumption may go with increased load. At this point we will have a very good idea of how much memory will be used by the system without its clients. Assessing the memory usage of clients is equally straightforward and equally tedious. Fortuitously, again, this can be done with a single Xterminal, using the above method. Once we have the memory usage for each client type, we sum the figures to get the memory usage per user. Once we have the memory usage per user, we multiply it out by the number of users and this yields the magic figure. Adding the common memory requirements to the per user memory requirements completes the exercise and sizing is done. A point worth making here is that shared text segments reduce the demand for memory significantly, and where the option is available, applications should use dynamic linking as much as possible. Accounting for shared text usage involves a minor modification to the method outlined. The sizing figures produced by this method are a good approximation to what a real system can be expected to use. Should the opportunity be available to look at a smaller system, eg with a lesser number of Xterminals than planned and running the same applications and operating system, this should be taken as it is a very good sanity check to verify that the figures are indeed reasonable. Should a disagreement exist, the cause must be isolated and the calculation adjusted appropriately. Sizing Host CPU Sizing memory is a technically trivial exercise. Sizing CPU is not. This is primarily due to the fact that the CPU is shared by the operating system and the user processes, and thus activity in any area influences the system as a whole. Assumptions must therefore be made very carefully. The objective is to find out how fast a CPU is required, or how many CPUs of a given type are required to achieve the desired performance. Unlike memory sizing, where a semi-analytical solution yields good results, CPU sizing can be difficult to precisely quantify with an analytical model, in a Unix multiprocessing environment (specialised real time kernels are much easier to work with in this respect, as scheduling priorities for processes can be cast into concrete easily, and operating system performance effects are more precisely specified by vendors). The most common method used for this purpose is scaling, which if done carefully can yield useful results. The basic idea behind scaling is that a CPU of a given architecture and a given clock speed will do a certain amount of work running a particular OS and application. Doubling either the clock speed or the number of CPUs will almost double the amount of work which can be done. Multiplying the clock speed or CPU numbers by N will multiply the achieved work by almost N. The critical factor here is the "almost", as in practice the curve of work performed vs CPU cycles expended is not linear, but rather logarithmic in shape. What this means is that the linear scaling estimate becomes less and less accurate with the increasing number of CPUs or increasing ratio of clock speeds. There are a number of culprits here, all of which are shared operating system or hardware resources. These, subject to queuing theory, will saturate at some given amount of load, slowing further increases in performance. Having set the boundaries for our estimate, we can proceed to the actual task. The principal tools to be used are performance monitoring utilities such as _ps_, or where available, proprietary tools. The starting point for measurement is to gain access to a lightly configured specimen of the host type to be evaluated, a suitable Xterminal and the required X client applications, preferably on a private Ethernet segment. The applications are then started, and the percentage of CPU time consumed by the processes is monitored from a separate console. Because of the bursty nature of X activity, the best strategy is to take a large number of evenly spaced measurements, log these to a file, and using tools such as _awk_, extract and statistically analyse the results. What we aim for is an average and peak percentage of CPU time consumed by each application, as it is driven by a user. A useful sanity check is to use the _time_ utility as well, as it provides some estimate of the average time consumption, as well as the ratio of system to user time. Once the CPU times or percentages are measured for the client applications, we can proceed to add these up to gain an idea of per user aggregate CPU load. The next step in this exercise is to scale the numbers in relation to the number of users and the target machines. An example provides the best illustration. Starting with a notional machine, which has a given architecture, and a clock speed of 50 MHz, we measure the average CPU load for the application and find that it consumes 27.6% of the aggregate CPU time. We would like to run ten users each doing exactly the same thing. That amounts to 276% of the given CPU type. We then add a safety margin of say 25% to account for non-linear scaling, which yields 345% (this is a gross approximation, increasingly less accurate with scaling factor). A first order estimate therefore suggests that we either look for a 175 MHz version of the same CPU, or a 4 CPU multiprocessor configuration. Alternately, we get a superscalar CPU of the same architecture, which will run faster at the same clock rate, and repeat the exercise. While the scaling method provides a good initial estimate of sizing requirements, it has important limitations and must therefore be used very carefully. A common trap the unwary fall into is comparing like architecture engines on the basis of clock speed alone. The proliferation of superscalar chips such as the RISC SuperSPARC, HyperSPARC, R4400 and CISC Pentium makes scaling against earlier scalar architecture engines meaningless, as the superscalar chips attempt multiple instruction launches per every clock cycle. A central caveat is therefore to compare scalar to scalar, and superscalar to superscalar. Different vendors' implementations of like superscalar architectures also differ (eg SuperSPARC vs HyperSPARC), and this must be accounted for in the process. A large disparity in cache sizes may also distort results (eg various subtypes of SuperSPARC). Sizing CPU performance is often described as a black art, and it is fair to say that it is not an understatement. The accuracy of the estimate thus derived is very sensitive to the quality of initial assumptions, and a good knowledge of machine architecture, operating system idiosyncrasies and application behaviour is essential if the numbers are to mean anything. Appropriately benchmarking the actual application is the only ultimate proof. Network Interface and Network Sizing The sizing of network interfaces is no less pathological a task than sizing a CPU to an application. The objective of the exercise is to estimate the traffic load of the Xterminal and relate this to the saturation point of the interface. The test environment to be used is identical to that used for CPU sizing tests, with the addition of a network analyser or functional equivalent, such as a SunOS machine running the _etherfind_ utility. The first phase of the exercise is to quantify the behaviour of the applications to be used. This is done, much like with CPU sizing, by running the applications with a typical user load and carefully observing with the network monitor. Ideally, network traffic is to be logged and timestamped. Statistically the best result will be produced from a large population of packets, and the author usually uses 32768 (mainly because _etherfind_ has a packet count argument). Once the statistics are collected, some handiwork with _awk_ can be applied to produce the average and peak per second packet rates. Peak rates are important because of the burstiness of the X traffic, and these figures will look very different from average rates. Should several clients try to send at once, the packet rate seen by the interface is the sum of the peak rates, rather than the average rate. Fortuitously, usually only one client is active per Xterminal and this will somewhat alleviate the total load. Again some trivial arithmetic will yield a reasonable estimate of aggregate peak packet rates across the intended Xterminal population. This figure can then be applied to sizing the Ethernet and the interfaces (this is another gross approximation, as the proper method is to treat the client population as a Markovian process and apply a queuing model). Finding the saturation point of an interface is somewhat messier, and involves using a tool such as _ttcp_, and a fast workstation as a traffic sink. The test method is straightforward, and involves hammering the interface with packet traffic of appropriate packet size until its throughput limit is found, and then calculating from the test parameters the number of packets per second at which the interface starts incurring serious queuing delays. The result is an approximation, but should be of the order of magnitude which is required. A good interface will saturate above 50% of the Ethernet's throughput. In possession of these figures, we can now calculate the number of peak client loads the interface and network will carry. That will in turn determine how many Xterminals can fit on a single interface and hence network segment. The speed performance of the Xterminal will have a bearing on the end result, and with heavily loaded high performance Xterminals a half a dozen turns out to be reasonably safe number. If the Xterminals are older and slower types, or PCs running Xterminal emulators, the figure can be much larger (it is worth noting that the abysmal TCP/IP stack and IPC performance of Windows based environments has at least one redeeming feature). The end result of this process will determine whether the host can support the Xterminals on a single interface, or whether multiple interfaces and network segments are required. A useful notional model is to think of the network segment as a multidrop high speed serial cable, and the Xterminals as ordinary terminals. A single interface can carry a certain amount of traffic, and should the load be too high, more than one card will be required (the author will point to the long past practice of partially loading multiport serial cards with terminals, where the terminals were set to high baud rates, to avoid saturating the serial cards - the general issues are really one and the same). Operating System and Interrupt Loading Issues Inappropriate operating system configuration and performance idiosyncrasies can both constrain the host's ability to drive a large number of Xterminals. Configuration issues are essentially specific to particular flavours of Unix, but there are a number of common points which must be addressed. The performance idiosyncrasy which can most dramatically impact performance under these conditions is the Inter Process Communications scheme used, and its buffering. Should inadequate buffering be available, traffic may become bottlenecked waiting for buffers to be freed. Typically default buffer space sizing is adequate, but should a large number of Xterminals need to be supported, particularly with other network traffic, it is conceivable that problems could arise, resulting in received packets being discarded or I/O blocked until space is available. In the event that such problems are encountered in operation it is worth checking that this isn't the problem. In BSD systems the default per protocol queue size should be examined, in SVR4 the limit to streams buffer space size should be checked. If necessary these should be adjusted, and a new kernel made. Another network parameter which can make a big difference is the default maximum TCP window size, which is a measure of how many packets can be sent before an acknowledgment is required from the other party. Increasing the default window size allows for more traffic in the pipeline, and this can significantly improve throughput. The mechanism is related to the time overheads of flow control, in a sustained load situation every stop-start due flow control results in dead transmission time, until the next acknowledgement is received. Enlarging the window size reduces the frequency at which the flow control mechanism interrupts transmission. Both ends of the link must have adequate buffering though, or packets will be lost. A note of caution is required here, as the parameter is usually system wide and not every other device on the network can necessarily adjust its window size accordingly. The author recalls a OS engineer working for a US vendor's OS group, who proudly boasted of having vastly improved network throughput by bumping up the window size to 16 Kbyte from 4 Kbyte. Once the particular OS release reached the customers, certain terminal servers starting inexplicably refusing connections. Fifteen minutes with a network analyser isolated the problem and five minutes of _adb_ against the live kernel binary removed the problem. A point worth reiterating here is that native sockets are somewhat faster than native streams, and should the choice be available, a Unix flavour using sockets should be used. It is worth noting that at least one major vendor is adding native socket support to their port of the streams based SVR4. Aggregate interrupt loading can also be an issue, as every packet received or sent will typically produce an interrupt. Every interrupt costs a fixed amount of CPU time to switch from executing the system or user processes to executing the device driver code, and at a certain interrupt rate every system will saturate. Workstation motherboards give out at thousands of interrupts per second, larger systems will tolerate more. Multiprocessors are sensitive to how interrupts are distributed between CPUs, and assymetrical kernels can produce saturation of the single CPU which services the interrupts. If you know that the flavour of Unix you intend to use generates an interrupt per packet, the interrupt load contribution due network traffic is easily inferred from the above. Xterminals are a very demanding device in terms of host and operating system performance and poor OS implementations will be evident very quickly, once serious sizing work is done. Summary Sizing hosts for use with Xterminals is a non-trivial exercise demanding a good insight into the inner workings of the operating system and hardware. Reasonable estimates can be made by systematic measurement and calculation, and even if not perfect, these can provide a good indication of what is safe and what is unsafe. Nearly all complaints about Xterminal performance which the author has investigated over the last few years were attributable to undersized hosts and excessive network interface loading. The guilty parties fell into both the customer and vendor camps, and in every instance the history of the site revealed that no attempt was ever made to size the system and network properly. It is perhaps a sad reflection upon our industry that in every such instance the finger was squarely pointed at the Xterminal. But, such is life. If it were not the case, there wouldn't be an interesting technical tale to be told ... |
$Revision: 1.1 $ |
Last Updated: Sun Apr 24 11:22:45 GMT 2005 |
Artwork and text ¿ 2005 Carlo Kopp |