An Introduction to Inter-Process Communications |
Originally published May, 1995 |
by
Carlo Kopp |
¿ 1995, 2005 Carlo Kopp |
Inter Process Communications (IPC) are one of the least appreciated facilities in any operating system, yet they can have a truly dramatic impact on application performance. Efficient IPC translates into good performance for networked and client-server applications, inefficient IPC in turn can lead to dismal performance in spite of the availability of generous amounts of CPU time. This can be of particular importance with modern 4GL database products, most of which are built around the model of a central server process or processes, which service(s) queries from individual user sessions, running as client processes. Where the designers of such products do not appreciate the implications of using a particular style of IPC, they can severely limit the ability of their product to extract useful performance from the platform in use. This will become a major constraint to the scalability of such applications. Whilst ten users on a small platform may get the performance they want, fifty or one hundred users on a proportionately scaled platform may not. An example the author had seen last year is well worth describing. The application in question bottlenecked on Unix message queue throughput at about twenty users. At any bigger number of user sessions, the system performance would asymptotically decline, while the operating system statistics indicated 80% to 90% of system idle time. The system's message queues were backed up, and the server process had about 50 backed up queries to service. Renicing the server process and resizing queue parameters shifted the saturation point by several users, but could not solve the problem (as one would expect, applying queuing theory). This behaviour was observed on no less than four different Unix vendors' platforms, with some variations in severity. The 4GL product vendor's assertions that the problem lay with the hardware vendors needless to say held little credibility. What is important about this example, from an application developer's perspective, is that the basic environment for the application was ill matched to the underlying operating system services. Were the 4GL product interfaced differently, say via streams or sockets, or shared memory, performance and scalability would have been far better, translating into much lower hardware related costs for an operational system. Application developers therefore need a good understanding of what are the performance implications of using a particular operating system service. If an application development environment doesn't hook into the operating system appropriately, no amount of host or application tuning and hardware performance will provide the desired result. IPC - A Perspective View What IPC is all about is getting messages from one process to another. IPC can be reliable or unreliable, in the latter instance there is no essential guarantee that a message will arrive at its destination. This means that the application must manage retries should the message fail to arrive. This ultimately boils down to the tradeoff between putting the plumbing into the application, or relying on the operating system's plumbing to ensure reliability of transfers. This tradeoff depends on the characteristics of the traffic between the processes, and determining the appropriate method can be a non-trivial task, particularly where little prior experience exists, as with a new application being developed. This is why application environment developers should be very careful about how they approach this design decision. Making the wrong choice can burden a product with performance problems which can be very expensive to fix late in the product life cycle, when customers try to load up systems with larger numbers of users. The simplest strategy for IPC is the use of shared memory pages. Shared memory is a scheme where the host operating system's memory management system is used to map one or more pages into the address spaces of more than one process. This method is potentially the fastest of all IPC schemes as there is no transfer overhead associated with a process making data available to another process. The processes simply write into particular areas in their address maps, and the data is immediately (ie in the next time slice of the OS) readable by the other process(es). The drawback with shared memory is the potential for one or another form of inconsistency, should datastructure locking schemes be imperfect. Every writer must lock the data it is writing, and clear the lock once finished. This protocol must be rigidly enforced, if not there is always the possibility that another process will alter the data from under its peer. Where multiple structures must be altered, this can introduce considerable complexity. Significant problems can develop should programs have bugs associated with uninitialized pointers, as these can result in feral program behaviour, where the integrity of the whole application can be jeopardised by a single process clobbering shared lock structures, or shared data. Another limitation of shared memory schemes is that there is no inherent support in current production operating systems for operation over networks. Third party applications do exist to provide such capabilities, but all such schemes must deal with the fact that multiple processes on multiple hosts run asynchronously, and therefore time delays exist between a process writing a location and all other processes sharing the mapped page seeing the result of the write operation. Synchronisation of access to lock structures therefore requires support for a contention protocol of one or another variety, whereby a process makes a request for access which may be granted, denied or queued. The increasing demand for networked applications has seen a massive growth in the use of stream based schemes, both particularly well supported by the mainstream Unix flavours. Stream based schemes, such as BSD Sockets and System V Streams, are built around the idea of a unidirectional or bidirectional character stream between two processes. A process at the sending end of the stream can do write operations (fwrite(), fputc(), fputs() etc) operations on the stream. The data is then buffered by the operating system, eventually becoming available for read operations (fread(), fgetc(), fgets() etc) at the receiving end. A nice aspect of this scheme is the use of elastic buffering, whereby the writer can send data asynchronously, and in large bursts, without necessarily having to wait for the receiving process to read it. If the writer and the reader can service the connection at the same average rate, very high throughputs can be achieved because the writing party does not have to block very often, if at all, waiting for service by the reading party. Another nice aspect of the stream model is that it is very easy to seamlessly integrate into a networking scheme. Herein lies much of the power of Unix as an environment to support applications, as processes may transparently communicate with their peers locally and remotely. The central limitation of the stream model is that it alone cannot provide a program with the kind of transparent address space access which exists with shared memory schemes. Data must be read or written serially, which means that an overhead must exist at either end of the connection, to map the data to a location in one address space to another. The emergence of client-server distributed object schemes is an attempt to address this limitation by adding another layer above the stream. This is a major subject within itself which will be covered in a future article. A stream based scheme can be put to very good use if it is supported by a user shell, because process input and output piping can then be implemented. Piping from a shell is heavily used in Unix and is another of its major technical strengths, not well supported by many of its would be proprietary competitors. The throughput performance of an operating system's stream mechanism will become very apparent, where multiple pipes are used, as in command line or shell script constructs such as find . -name "*.c" -print | xargs grep "Errcode:" | awk '{ print $2 } | grep fred | rsh rigel "cat > logfile" . Should the native IPC transport be sluggish, such constructs will run very slowly. This can be a major issue with third party POSIX shells on non-Unix operating systems, and even trivial tests of this kind can be a good indicator of whether trouble lies ahead. The final major style of IPC used is the queued message scheme. In this arrangement, blocks of data termed messages are pushed by the sending process, FIFO style, on to a queue managed by the operating system. After some time, the receiving process polls its queue and should a message have arrived, extracts it from the queue. The now obsolete but still supported Unix message queue is an example of this style of IPC and its use is not recommended for anything other than auxiliary message passing, as its performance is not spectacular. Should you ever be deciding upon the choice of a 4GL product, one of your checks on OS compatibility should be whether message queues are used as the primary channel between client and server processes. Should this be the case, reject the product and spare yourself a lot of performance heartache later in the development cycle. Any discussion of IPC should include semaphores, which are often classified into IPC mechanisms, although the only information they convey is a single 1 or 0 state. Semaphores are used primarily for synchronisation purposes, to provide for mutual exclusion on accessing shared data.
The System V shared memory scheme is commonly used in a range of Unix implementations. Its function is to map an area of a process' virtual address space into the virtual address space of another process. Using shared memory involves a series of simple steps, via which an identifier is initialised, a segment attached, and after it is no longer required, deleted. The segment identifier is initialised with a shmget() call which takes three arguments, an identifier key, a size parameter and a flags field. The flags field contains mode bits, which determine access rights and modes such as private or shared. The parameters which describe the segment are held in a shmid_ds descriptor structure, which may be accessed via the use of a shmctl() system call. If the page is to be locked into memory, to prevent swapping from compromising access speed, a shmctl is used. Once the shmget() is done, the shared segment must be attached to the address space of the process with a shmat() call. This call has three arguments, which are the identifier returned by shmget, an address specifier and a flags field. While the user of shmat provides the user with some freedom, it will also impose system specific limitations. The most common of these is a requirement to align the segment on page address boundaries, a constraint imposed by virtual memory hardware which makes assumptions about page tables. Once a process has finished using its segment, it should detach it using a shmdt() system call. Should the process be the last attached to the segment, the segment is deleted and the pages freed. A common problem with ill-behaved applications is a failure to detach from a shared segment during a program crash or error condition. This will result in shared segments cluttering up memory after the process has died, but also holding on to shared memory identifier tables, of which there are only SHMMNI in the system. If a debugging session involves repeated instances of this behaviour, at some point the shmmni table fills up and no more segments can be attached. The segments must then be manually deleted using a utility such as ipcrm. A facility worth mention in this context is the mmap() system call, which allows the mapping of files into the address space of a process. Like shared memory, mapped areas may be private or shared. Whilst not offering the performance of shared memory as a means of IPC, memory mapped files can be usefully applied in situations where files are to be read and written by multiple processes. Unlike shared memory which is volatile, mmapped files are not and subject to the frequency of cache flushes, provide a degree of resilience to system crashes.
The BSD socket scheme evolved during the early eighties, in response to DARPA demand for efficient networked applications. The 4BSD release was largely funded by DARPA with the intention of providing a standard operating system for contractor research sites connected to ARPANET. The socket interface is a far simpler scheme than System V Streams, and was designed with a focus on throughput performance, rather than architectural elegance and modularity. In practice this reflects in the fact that BSD based kernels have traditionally been slightly faster than System V kernels, running the same hardware. Central design objectives of the BSD IPC scheme were:
A pipe connection has for instance properties 1, 2 and 3. A datagram socket is unreliable, whereas a stream socket is reliable and may also carry OOB messages. Sockets must have a naming scheme, so that processes can connect without having to know anything about each other. The simplicity of the socket interface is a model of technical elegance, and is therefore very easy to use. The following example is for a client process. Sockets are created with a socket() call, which returns a file descriptor number. This is analogous to the creation of a file, in that a memory datastructure has been allocated, via which read and write operations to the object may be carried out. int socket(int domain, int type, int protocol); where the domain may be one of the following:
the type one of the following:
The protocols argument allows choice, in some instances, of which protocol to use within the domain. Once the socket has been created, it must be bound to a name, and a connection must be opened to allow the transmission of data. int bind(int socket, struct sockaddr *name, int namelen); The socket created by the socket() call exists in its protocol name space, but doesn't have a name assigned to it. The bind call assigns a unique name to the socket. int connect(int socket, struct sockaddr *name, int namelen); The connect call is then used by the client to open the transmission path to the server process. Server process operation is slightly more complex, and involves listening and accepting connections. This is accomplished with the following calls: int listen(int socket, int backlog); The backlog argument specifies the number of pending connections which may be queued up to be accepted. int accept(int socket, struct sockaddr *addr, int *addrlen); The accept() call is somewhat more powerful than its peers, in that it will actually create a new file descriptor (socket) for each accepted connection. It returns the value of the new descriptor, but leaves the original socket open to accept further requests. Once the connection is open, data may be read and written with conventional read(), write() calls as well as socket specific send(), recv() calls, the latter providing support for OOB messaging. Once the socket is no longer needed, it is destroyed with a close call. As is clearly evident, the socket scheme is very simple to use, and is intentionally blended into the file descriptor scheme in use. Interestingly, the CSRG designers chose to produce a separate mechanism for opening the socket connection, to that used for file opening. The alternative is to overload the open system call to provide a common interface. System V Streams The Stream mechanism used by System V Unixes is one of the architecturally most elegant ideas within the original AT&T product. That elegance is however costly both in terms of complexity of use and CPU performance, in comparison with the simpler BSD socket scheme, and can become a significant performance factor where a system is running a large number of users over network connections. Streams first appeared in System V Release 3.0, and are the brainchild of Dennis Ritchie. The central idea behind Streams is that of a modular interface for an I/O stream, which allows the stacking of multiple modules on a stream. These modules may implement communication protocols, serial I/O cooking (ie line disciplines) or multiplexing and demultiplexing of multiple I/O streams into a single channel. One nice way of looking at Streams is that every module is in effect a transform engine, which does a specific mapping of the input I/O stream into an output I/O stream. The possibilities available to designer using Streams are very broad, and the design philosophy of Streams, if religiously adhered to, should allow considerable portability of streams module code. Space limitations preclude a more detailed discussion of the inner workings of Streams, and interfacing to Streams, and this will the subject of a future item. Interested readers are referred to Berny's Magic Garden Book. Summary Nearly all Unix implementations today support System V Shared Memory and both the BSD socket and System V Streams interfaces. In the latter instances, this however in itself says very little about the achievable throughput performance on either interface, as implementations can be quite different from one another. A typical style of implementation on a contemporary generic SVR4 system will see sockets implemented by a socket library, which essentially piggybacks a socket connection on top of the OS native Streams IPC mechanism. From a performance perspective this is most unfortunate, as the computational overheads of Streams operations are further increased by the overheads of the socket interface layer above it. This is why networked applications written around sockets typically perform worse on generic SVR4 systems. One major Unix vendor is adding native socket support in its SVR4 kernel, simply to work around this limitation, but ultimately this problem will persist on most platforms until the established base of socket applications is ported across to Streams. As is evident, IPC is not an aspect of application and OS integration to be trifled with, if performance is an issue. The naive view that the vendors of 4GL products, and the vendors of platforms have addressed all of the issues, is exactly that, naive. If system IPC performance is a serious issue, which it usually is in a client-server environment, this area must be on the checklist of items to be evaluated. Those who ignore it do so at their peril.
|
$Revision: 1.1 $ |
Last Updated: Sun Apr 24 11:22:45 GMT 2005 |
Artwork and text ¿ 2005 Carlo Kopp |