XTERMINALS
AND HOST PERFORMANCE Part 1 |
Originally published October, 1994 |
by
Carlo Kopp |
¿ 1994, 2005 Carlo Kopp |
Of of all the display devices to emerge in the last two decades, the Xterminal must be without doubt the most controversial. Supporters of the Xterminal will point out the advantages of high speed, host independence, security and centralised management, while detractors will argue that performance in situ is unspectacular and that the Xterminal is less flexible than the workstation. Who is right and who is wrong ? This feature will take a close look at the system level performance arguments and show why in matters of performance, both sides of the argument have a point to make. You may find the conclusions surprising. Xterminal Performance The conventional Xterminal is a dedicated X11 protocol display device, using much of the technology base which had evolved for Unix workstations. A workstation is a wholly self contained computer system, running its own operating system, managing its virtual memory and executing its own applications. An Xterminal, like a workstation, is typically built around a motherboard with a single processor and local memory. The similarities end here, as the Xterminal does not run an operating system in the conventional sense, but a dedicated locally resident X11 Server, which communicates with a client application over a network. In most instances, this will involve TCP/IP protocol traffic running over an Ethernet LAN. Traditionally, Xterminal capability has been measured in terms of features set and speed, speed being the key competitive metric. The most common measure of speed performance is the Xstone, calculated from the weighted sum of the results of a series of tests (Xbench). Three years ago a good Xterminal delivered 70,000 Xstones on paper, and a poor Xterminal below 30,000, while today the yardstick is well beyond 100,000 Xstones. Those who have benchmarked Xterminals will have noted that performance could vary across hosts, indeed at least one vendor the author knows of had actually tuned the TCP/IP protocol stack and Ethernet driver and hardware of their test host specifically to improve the achieved Xstone figure on their Xterminal product. So why is it that detractors so vociferously attack the performance aspect of the Xterminal ? Repeatedly we hear arguments very much like "...we have a lab of them and they run like lemons ..." Sadly, these detractors do have a case, albeit empirical, but the root cause of the problem is not to be found within the Xterminal itself, but rather within the host platform. To understand why, we must delve into the bowels of the machinery to find out. Xterminals and Hosts The Xserver and client application model is central to understanding the performance limitations of specific X display implementations. In this model, the X server is the entity which paints images upon the user's screen, and reads the input from the user's keyboard and mouse. This is its sole task in the scheme of things. The client program implements the X application, using extensive X library support. The X library code then sends and receives X protocol messages to the X server, which acts upon them. There are two basic forms in which an X server is implemented. The first form is as a user process on a workstation, where the X server writes to a display device driver and reads from serial device drivers associated with the keyboard and mouse. Performance in such implementations is determined primarily by the performance of the CPU in executing the X server binary, and by the performance of the graphics hardware in the machine, be it a frame buffer or an accelerator device. An Xterminal can be thought of notionally as a workstation stripped of all disk I/O and running only the Xserver process (indeed many sites cleverly put obsoleted workstations such as Sun 3/50s, 3/60s and 3/80s to exactly this use). In practice dedicated Xterminals are however simpler, as many of the functions of a complete Unix system are not required and thus there is no point in sacrificing all important CPU cycles in their running. Xterminal hardware performance is subject to much the same constraints as workstation hardware performance, and the dominance of workstation CPU types in the top end of the market, as well as the proliferation of accelerator chips in such products, illustrate this. The X model is built around X protocol communication between a client and a server, and this communication takes place over an InterProcess Communication (IPC) channel. Much of the power of X stems from its independence from operating systems and communications channels, but by the same token the performance of the X model is critically dependent upon the performance of the underlying IPC mechanism, and the hardware which supports it. To better understand how this influences performance, we will follow the path of an X protocol request from an application to the user's screen. It is by all means an interesting journey. As an example we will look at a client application which draws a line (for clarity we will avoid discussing the XGC context and other important issues). It makes a XDrawLine call against the Xlib library, with all the proper arguments, and the library will produce an appropriate X protocol request which it places in a buffer. At some time, it will flush this buffer and send the message on its way. The X protocol message will be written to the host's IPC channel. For the purpose of discussion, this will be a Berkeley (BSD) socket. The message, now queued in its channel, will fall into a pool of IPC buffers (mbuf pool), where it will wait its turn to be sent to the recipient server. If the server is another process on the same machine, it will pop out of the server process' socket connection and be processed by the server. However, if the server is an Xterminal, a few more things must happen before our line is drawn. The message, sitting in the buffer pool, must wait its turn for access to the host's network interface. Once ready to be sent, it must have a TCP protocol header computed and attached, then an IP header. At this point the message has gained an extra forty bytes in size (or more if padded), and is ready to go to the device driver. The device driver is the code which manages the Ethernet network interface, and it has its own buffers, its read and write calls to these buffers, as well as interrupt servicing code to manage packet transfers between the chip proper and the driver buffers. The device driver will then program the chip with the destination address, load the X/TCP/IP packet into a buffer and cut the chip loose to transmit the message, encapsulated in an Ethernet packet, over the network to its recipient. The chip will then, bit by bit (no pun intended), clock the packet out on to the network cable, where the recipient's Ethernet chip will clock it in bit by bit, and accumulate it in a buffer, byte by byte, until the whole packet is received. Assuming the checksum is correct, the chip signals the Xterminal's network driver. At this point the Xterminal will do the reverse of what the host has done, decoding and checking the IP header, then the TCP header, finally handing the X protocol message to the server program proper. It will then decode the message, and determine what to say to the display device, which in turn paints the screen (the latter another gross simplification). The XDrawLine request has thus had to follow a complex and tortuous path before it can be executed, and much of this path has been through the innards of the host's operating system. Performance Issues As is clearly evident, the speed with which an X client can communicate with an X server is critically dependent upon the performance of the host system's IPC. While tracing the path of a message paints a nice picture of the mechanisms involved, it can only qualitatively illustrate what we already know, which is that a workstation's local server can be more efficiently accessed than an Xterminal based server. The central issue from a performance and systems engineering perspective is the total volume and the characteristics of X protocol messaging and how well a host and its operating system can handle this traffic. Droms and Dyksen in their 1990 paper did a very nice job of looking at how much of an Ethernet's capacity will be chewed up by an X server and its client, concluding that a single client session can consume up to 15% of the bandwidth of the Ethernet itself. This is however only the tip of the iceberg. X protocol traffic is inherently bursty, and usually characterised by a mix of a large number of small packets, interspersed with larger packets. While the network is idle for much of the time, when activity occurs, it is usually a veritable blizzard at rates of up to hundreds of packets per second. This is exacerbated during those X operations, where the traffic is bidirectional. Herein lies the central obstacle to achieving good performance with Xterminals. The network interface on any host will queue up packets for transmission, and thus exhibits all of the behavioural idiosyncrasies of a queuing system. What interests us in this context is that a queuing system exhibits an effect termed saturation, as it comes under increasing load. If events arrive at the same rate as the queuing system can process them, the waiting time in the queue asymptotically grows to infinity. In practical terms, this means that a network interface on a host will exhibit either a short waiting time to send a packet, when it is under a light load, and a very large waiting time when it is under a heavy load. If packets arrive at a faster rate than the interface can send them, the queue will start backing up. If packets stop arriving, the interface will eventually flush the queue until empty, and all is well again. The burstiness of X protocol traffic means that X clients can and will transiently saturate a host network interface, given the opportunity. Should several X clients try to send at about the same time, the packets will queue up in the buffer pool, and be flushed through the interface as their turn arrives, incurring in the process time delays of varying lengths. The natural question at this point is of course, at what load will a network interface saturate ? The quick answer to the question is that this is host and operating system dependent. There are several factors which influence the saturation point of an interface. These are the performance of the CPU, the efficiency of the network protocol stack, the buffering strategy used at the network interface, the efficiency of the driver layer, and the performance of the network interface chip and its bus access path. Measurements carried out by the former SPARC manufacturer Solbourne during the early nineties indicated that the AMD LANCE (7990) interface on a multiprocessing server saturated at some point between 500 and 1000 packets per second, when running a large population of NCD-16 Xterminals. A bit of arithmetic suggests that this load could be up to 50% of the maximum throughput of the Ethernet (given ambiguity about packet size statistics) - interestingly the 50% load level also turns out to be just below the knee in the queuing saturation curve for a Markovian M/M/1 queuing system. Surprise, surprise, theory agrees with empirical measurement ! The author was so intrigued with this, that in 1992 he he did some tinkering of his own, using a number of SPARC machines and ttcp, which is a simple socket based application which fires a large number of buffers of data from one host to another, memory to memory. It is commonly used by network interface designers for testing and protocol stack tuning. The results were most interesting. A Cypress CYM-6002K-40 based Galaxy clone could achieve, for large window and buffer sizes, a net throughput of 83% of the Ethernet's total bandwidth. Other SPARC machines performed not so well, and throughputs as low as 25% were achieved. Even the Galaxy clone ran out of steam when the packet size was chopped down to 128 bytes, and the author's conclusion was what the Solbourne figures, albeit by then obsolete, were of the proper order of magnitude (if others have tried this I am most interested in seeing results !). Determining where your host saturates will require some effort, but regardless it is fair to say that packet rates in excess of 750 per second are taking the interface close to saturation and will begin to incur queuing delays of several packet durations (a 512 byte packet takes about a microsecond to send). With single X clients capable of generating substantial fractions of a network interface's saturation load, it only takes several active clients to use up what interface bandwidth is available, assuming all else on the host is performing properly. Another common source of performance problems in this area is inefficiency and queuing delays in the upper layers of the IPC channel. Poor implementation, as well as architectural limitations within an IPC mechanism can cause channel saturation upstream from the network interface. Systems based on SVR4 which uses native streams for network IPC traffic, rather than sockets, are penalised in exactly this area, against their BSD brethren. This is due the additional complexity of the tty oriented stream mechanism. The proprietary Unixes do not fare better, the author recalls one system, some years ago, which topped the Specmarks on floating point, but throttled itself on network/IPC with three active clients on a single Xterminal. The conclusion to be reached here is that the full speed performance of an Xterminal can only be exploited if the host system has fast IPC, such as native sockets, adequate CPU performance, and the number of Xterminals per network interface is such that the host's network interfaces operate well below the saturation point. Should any of these conditions be violated, performance will inevitably degrade due the basic mathematics of queuing systems. Multimedia and Xterminal Performance Multimedia is the buzzword of the month, and Xterminals are like all other systems being adapted to this emerging marketplace. Multimedia support in an Xterminal means the ability to support audio, document imaging and ultimately video. Of these features, document imaging (defined by the XIE extensions to X11R6) appears to be the most mature, providing support for grayscale, colour, bitonal imaging, and with proprietary enhancement, Display Postscript. Audio transmission over a network is still in the domain of proprietary technology, with some possibility that X11R7 may accommodate it. Video transmission over a network is in a similar position, with the ISO/IEC/CCITT JPEG standard competing against the MPEG standard. Both JPEG and MPEG are standards for the compression of digitised imagery. Multimedia, be it document imaging, audio or video, will further exacerbate the existing limitations of host Ethernet adapters when operated with Xterminals. Video and document imaging are both characterised by large volumes of data to be sent from a client to the display, and these will bite very seriously into the available bandwidth of a network adapter. Audio and video in turn are further complicated by the need to evenly supply samples at rates which won't irritate users. Motion picture frame rates below 20 frames per sec (TV runs at 25-30, subject to standards) are simply uncomfortable to watch, whereas telephone quality speech requires samples evenly spaced at a rate of about 8000 samples per sec. Should facilities like teleconferencing or videophone be required, the issue of picture/sound synchronisation comes into play. Supporting these facilities will be technically challenging, whether on workstations or on Xterminals, due to the basic nature of the bursty packet oriented networking environment which is common to both of these devices. Given what we already know about the performance limitations of the existing Ethernet and its host interfaces, it is fair to say that multimedia will probably become the driving force for another evolutionary step in local area networking. That is of course worth a feature within itself. Conclusions High performance Xterminals offer the potential for excellent display performance when properly used. In practice this is seldom the case, as typical installations will see too many Xterminals attached to a single Ethernet segment, and all of the traffic produced between these devices and their client applications funnelled through a single Ethernet adaptor on the clients' host system. Basic theory predicts that under such conditions, the system will exhibit miserable performance, in spite of the host being adequately sized for speed, and the Xterminals having Xstone ratings over 100,000. This typically proves to be the case in practice, and the linchpin of the Xterminal hater lobby's misdirected arguments. It is not difficult to see that much of this problem historically stems from competition between Xterminal and workstation vendors, both of whom battle for seats above all. A workstation/host vendor will have commercial incentives to undersize the host system to make their offer more price competitive as well as making a case for substituting "non-performing" Xterminals with workstations. Once performance problems are evident, the temptation for the kettle to call the pot black must be overwhelming. The idea of having to split the customer's Ethernet into multiple segments, and fit multiple Ethernet adaptors on to the host system will always be unattractive to any vendor, as it is both expensive and operationally disruptive to the customer's site, thus becoming an impediment to a sale. As with many other areas in our industry, Xterminal non-performance is another area where mythology has successfully overwhelmed reality. Sizing hosts for use with Xterminals, and basic guidelines for network sizing will be the subject of Part 2 of this feature. |
$Revision: 1.1 $ |
Last Updated: Sun Apr 24 11:22:45 GMT 2005 |
Artwork and text ¿ 2005 Carlo Kopp |