Dr Carlo Kopp's Publications Archive

Industry Publications Index ... Click Here

The C Stdio Library

Originally published April, 1995

by Carlo Kopp

The humble statement #include is probably the first item which a beginning C programmer must grapple with, as he or she creates their first "hello world" program. Little does the beginner suspect the underlying complexity beneath the C stdio library, which is a particularly clever exercise in hiding the native I/O system of the platform in question.

The C language, as we all know, originated in the seventies with the development of early Unix at Bell Labs. C is a language which has often been described as underfeatured in it's native I/O support - I/O is primarily a creature of the stdio library, which provides the functionality sought by users of I/O.

To unravel the historical complexities behind this state of affairs is a major story within itself, suffice to say that it was very much a case of evolution in action, to paraphrase a certain novel. What is of importance today is that a C environment with its collection of libraries will with varying degrees of success impose aspects of the Unix I/O model on to whatever is the platform in use. In many instances this works very nicely, in some it is a bit of a shotgun marriage.

File Objects and the C/Unix stdio I/O Model

One of the most powerful features of the Unix operating system, and one which has substantially contributed to the system's cross platform portability, is its scheme of abstracting the platform's I/O as file type devices. This arrangement means that devices are accessed in a uniform fashion, as the OS conceals the ugly idiosyncrasies of the hardware. Outside of Unix, it is a big bad world when it comes to device handling.

The idiosyncrasies of hardware and non-Unix operating systems which may support it are manifold. The first major issue in this area is that devices can be accessed in character mode, or with mass storage devices, block mode. This behaviour is inherent within the hardware, and should you wish to write a file to disk, it is copied on to the medium as a block, rather than the selective writing of whatever has been changed.

Character mode devices, such as dumb terminals, are a another world of their own. ASCII asynchronous terminals using the 7-bit ASCII alphabet, will need to be driven with control/escape sequences, and will also require support for "raw" and "cooked" mode operation, in the latter case the operating system filters out certain control codes as required. Character streams may also need to deal with line termination issues, particularly where the terminal expects to see newline (or return) characters handled in specific ways.

Another issue is how the hardware and OS handle process to process stream communication (eg pipes), if it is at all supported. Where it is supported, the issue is of course how this appears to the programmer.

In summary there are at least three major entities which a program needs to read and write to. These are character devices, block devices and other programs.

Unix set a number of interesting technical precedents in terms of how I/O is handled, and these have since become conventions with the proliferation of the C language into the wider marketplace.

The designers of the Unix/C environment chose to use the file descriptor/pointer mechanism, which had earlier appeared in PL/I and Multics, and further developed the idea. The familiar FILE pointer construct used for I/O conceals the machinery of the file descriptor. File descriptors differ between various flavours of Unix, as well as being unique to every C language port to a non Unix platform. The descriptor's purpose is to provide a uniform interface to a device, regardless of its native behaviour. The user is no longer responsible for managing the internals of his or her device interface, this is taken over by the C library and the operating system.

A good illustration of different design strategies in file descriptors can be made by comparing the SVR4 and 4.4 BSD descriptors, both of which will appear functionally virtually identical to the programmer.

The SVR4 file descriptor (Fig.1) has seven key elements. The first two are file table linked list pointers, which point to the next and previous descriptor in the file table. Most systems will use a linked list scheme for managing file descriptors in the table, as the rate of file opening and closing is usually very modest and the linked list is a nice and simple mechanism.

The next element in the table is a mask of open mode bits, which specify the mode in which the file has been opened. An example would be read-only and no-delay. This field is important because it effectively caches the the open mode in a readily accessible place.

The mode flag field is then followed by the reference count. The reference count mechanism is all about multiple pointers to a single descriptor. Should multiple pointers be assigned to the one file, open() and close() operations increment and decrement the reference count so as to ensure that only after the last pointer is closed is the file actually closed. Should it be otherwise, the closing of one pointer could close the file out from under another pointer, causing a subsequent error on a read or write operation.

The reference count is then followed by a pointer to a vnode structure, which contains filesystem specific parameters for the opened file (see Sept 94 OSR). It is via the vnode that the file is actually accessed.

The next field in the file descriptor is the offset, which indicates how deep we are in the file. During writes or reads, the offset is advanced on every operation so that the subsequent operation knows where to start.

The final element in the structure is a pointer to a structure containing process credentials.

The BSD 4.4 design (Fig.2) is somewhat newer, and given the former Berkeley group's somewhat more adventurous approach to integrating new object types into the operating system, the descriptor also reflects this.

The first field in this file struct is a status flag, indicating file lock status. It is then followed by a type field, distinguishing eg between vnodes and sockets. The reference count is retained, as with SVR4, for all the same reasons.

The msgcount field is used when passing descriptors to other processes.

The big difference between the SVR4 and 4.4 BSD descriptors lies in the latter's use of an object oriented approach to operations on the file. This is reflected in the next field in the struct, which is an array of file operations pointers, which point to file type specific functions. When a file is opened and a table entry occupied, it is loaded with an array of pointers to file type specific operations (functions).

These are read/write, ioctl, select() and close(), which are implemented differently for each class of device. The internal plumbing of functions such as fread, fwrite, read, write, close, fclose() is thus radically simplified.

The remaining fields are a vnode/socket identifier, a file offset, a credentials field and file table pointers.

These two examples do a good job of depicting how different implementations can perform the same task on identical or different hardware platforms. Porting C and its libraries to a non-Unix platform creates problems of its own, as the underlying OS may or not easily accommodate the required mechanisms.

The easiest non-Unix platform to deal with will be a modern microkernel, as the libraries have access to devices at a very low level and thus the writer of th library can produce his or her own machinery to replicate the Unix style I/O model. While this involves more effort, the environment will usually allow for a closer fit in the final product.

Where the host OS is a proprietary monolithic kernel (eg MPE, VMS, NT), the task becomes far more complex. The programmer porting a stdio library will have to carefully work around the idiosyncrasies of the platform to get the desired result. Many platforms strongly type their file objects, and thus the Unix model of text and binary file transparency no longer applies. While the ANSI standard accommodates this, it will result in additional complexity.

Another area which can create some heartache is producing a mechanism to interface to the native interprocess communications to implement support for functions such as piping. While this exceeds the scope of the stdio library, it can become a major issue when porting Unix like environments such as those defined by POSIX 1003.1 and 1003.2.

In summary it is fair to say that replicating the Unix like I/O model inherent in the stdio library can be a non-trivial task, subject very much to the degree of intractability in the host platform. Those who have experienced peculiar behaviour with I/O on non-Unix platforms should not be surprised, as the writer of the library may have had to do some very interesting coding gymnastics to get the required behaviour in the central areas of the library's operating envelope. This often means that less frequently used modes fall victim to the host OS.

Unix I/O Libraries

Modern Unix implementations provide a user with access to two different I/O function sets. The first of these are the "low level" I/O functions, which are comprised of system calls enabling very precise control of I/O parameters. The second is the stdio library, which in most modern systems is compliant with the ANSI/ISO standard.

The low level I/O functions open(), chmod(), close(), dup(), lseek(), read(), write(), umask() are Unix specific, and will often vary in detail between various vendor's implementations. In most situations, these functions are the lowest level at which a program can carry out I/O operations, and therefore provide high speed as well as fine control of file opening parameters, as compared to the stdio library which is usually one tier up the hierarchy.

The undesirable aspect of using low level Unix I/O is often doubtful portability, particularly to non SVR4/BSD platforms. Where code is expected to be moved to non-Unix platforms this problem is further exacerbated, as some of these functions will simply not be supported. It follows that these functions should only be used by those who know exactly what they are doing.

The industry standard for low level Unix-like I/O is defined by the IEEE POSIX 1003.1 document. Not all Unix implementations are at this time POSIX.1 compliant, or partially POSIX.1 compliant, therefore caution must be exercised when assessing the portability of applications written to this standard, or written to use native low level I/O.

In many implementations, the stdio library is layered on top of the low level Unix I/O. Functions in the stdio library are thus wrappers for the low level I/O functions, with additional argument processing and buffering as may be required.

The ANSI 3.157-1989 Standard stdio Library

The ANSI and subsequent ISO C Language standards grew out of the 1984 /usr/group C language standard, which was focussed on Unix standardisation. When the need for a more general C Language standard emerged, as various vendors produced non-Unix hosted C compilers, ANSI formed the X3J11 committee to produce a non-platform specific definition for the language and its suite of libraries.

The need to accommodate operating systems other than Unix, and thus devoid of many of the powerful services Unix offers, led to a defacto split in the standarisation of the language. Unix centred features gravitated to the IEEE's POSIX.1 standard, and the ANSI standard stripped out those features which are considered Unix specific. Support for pipes, ioctls, process control facilities, curses and file permissions is not provided in the ANSI C standard.

The ANSI X3J11 committee had an interesting time when formalising the set of functions to be used in the new standard. This was because the established Unix environments provided powerful native unbuffered I/O functions (read(), write()), whereas many other platforms were more restricted in services provided. A major issue was the inability of many platforms to support the transparent binary character I/O stream model of Unix.

This was further complicated by line termination conventions used in non Unix platforms. Whereas Unix will transparently pass newline terminated lines through an unbuffered I/O stream, other platforms need not do so and this had the potential to break programs.

Another major issue was file descriptor handling. Unix reserves descriptors 0, 1 and 2 for stdin, stdout and stderr streams. Many systems could not accommodate this, so a more general mechanism was required.

In the end the committee decided that all of the I/O services would be built around a buffered stream model, so that the library code could implement much of the Unix like functionality missing in other platforms. Buffer management would use implementations of the setbuf()/setvbuf() family of functions.

Changes against existing Unix environments were numerous. For instance FILENAME_MAX, FOPEN_MAX and TMP_MAX parameters were added to allow for platform specific limits.

The buffered stream mechanism used in the standard is as close to the Unix model as can be expected. Key features are :

Line definition - in Unix lines are delimited by newline characters. Other systems require newline-return pairs (NL/CR), or pad blanks after the line to a complete record size. The ANSI standard reconciles this by enforcing Unix like behaviour at the application level, but allowing the library routines to map the line format to the platform's native format if required. Trailing blank handling is specified to be implementation dependent.

Transparency - the ANSI standard supports text and binary stream formats. This is so an application which needs to see raw I/O, can use a binary stream on a platform which would otherwise mangle the record structure of the native format in mapping to a text stream. In this fashion, ill behaved environments can be accommodated.

Random Access - in Unix a file position index points to the position of the next character in a file object, regardless of character type. On some platforms, this mechanism breaks down as a newline line delimiter in C may in fact map into an arbitrary number of padding characters in the native format. The fseek(), fgetpos() and fsetpos() operations are defined to provide suitable position encoding on arbitrary platforms (see 7.9.9 in the standard).

Buffering - in Unix the user has considerable flexibility in specifying how the I/O stream is buffered, or unbuffered. As other systems may not provide this functionality, the ANSI standard provides setbuf() and setvbuf() functions but allows considerable latitude in implementation.

The biggest single deviation from the Unix model is of course the split between binary and text streams. This was unavoidable. Even this model ran into difficulties with some implementations, where the native system is unable to handle exact file sizes - in these instances the standard requires padding with blanks. The philosophy behind this is to allow I/O from C programs to be digestible by programs native to the platform.

File handling functions were also altered against the Unix model, in a number of areas.

remove() - the original unlink() function was too closely tied to Unix. It was therefore replaced by a system specific remove() function.

rename() - the Unix like link() function was deemed to be too closely tied to its native environment. An implementation defined rename() was introduced instead.

tmpfile() - this function provides scratch space for working, using a binary stream format.

tmpname() - this function is similar to tmpfile(), but provides a non-volatile work file.

File access functions were altered in the following fashion:

fclose() - some systems may require something be written to the file before closing.

fflush() - fflush() provides a portable mechanism to fsync() a stream on an arbitrary platform.

fopen() - the principal change is in the provision of the "b" for binary stream qualifier to be used where required (above). The fopen() function has provision for additional platform specific qualifiers, should these be required. Portability suggests these should be avoided like the plague (eg fd = fopen("blogs","wb, recordsize=256"); ).

setbuf() - maps into setvbuf() and retained for compatibility.

setvbuf() - this function retains the syntax of System V, but may in fact do nothing useful. The expectation is that programmers will produce code which is not functionally dependent on this call, but can take advantage of it given opportunity.

Formatted I/O functions also experience a number of notable changes:

fprintf() - an l (%ld) modifier was added to accommodate long double. The %p qualifier was added for pointers, which need not be int sized. The %n qualifier as added to provide an awk like conversion count.

fscanf() - the design rationale is that a single conversion failure causes an invocation failure. Single char pushback only is supported. Format conversions are compatible with strtod() and strtol().

vprintf(), vfprintf() and vsprintf() - adopted from System V.

fgetc(), fputc() - not to be implemented by macro, as is customary with Unix.

fgets() - replaces gets(), which was prone to overrun if inadequately buffered.

puts() - also terminates with a newline.

ungetc() - no longer requires the reading of a char prior to invocation, this in turn requires storage for the pushback char. The value of the file position indicator (index) becomes undefined with an ungetc() operation.

File positioning functions were also altered in some areas:

fgetpos(), fsetpos() - provide for larger file sizes than supportable by fseek() and ftell().

fseek(), ftell() - these retain Unix like behaviour for binary objects, but provide for encoding of record position / byte position in text files, if required. ftell() will fail on terminal streams.

rewind() - resets EOF and ERR indicator flags for the file.

As is evident, the ANSI standard is a reasonably successful attempt to reconcile often diametrically opposed platform specific requirements. Even so, programmers writing I/O intensive applications intended to be ported across multiple and dissimilar platforms should give careful consideration to how they use the stdio library functions. A good read of P.J. Plauger's "The Standard C Library" is highly recommended, also the ANSI Standard rationale document (ftp.uu.net:/doc/standards/ansi/X3.159-1989/rationale.PS.Z) is well worth reading.

As for those fortunate souls who have had the pleasure of writing stdio libraries for non-Unix platforms, nothing else needs to be said !

typedef struct file { 	

    struct file	*f_next;	/* pointer to next entry */ 	

    struct file	*f_prev;	/* pointer to previous entry */
	
    ushort		f_flag;		/* state flags */ 	c

    nt_t		f_count;	/* reference count */ 	

    lock_t		f_lock;		/* lock for f_flag and f_count */ 	

    struct vnode	*f_vnode;	/* pointer to vnode structure
*/ 	

    off_t		f_offset;	/* read/write character pointer */ 	

    struct	cred	*f_cred;	/* credentials of user who
opened it */ 	

    cnt_t		f_msgcount;	/* message reference count */ 

    } file_t;

Figure 1.

struct file {
    struct file *f_filef; /* list of active files */
    struct file **f_fileb; /* list of active files */
    short f_flag; /* see fcntl.h */
    short f_type; /* descriptor type */
    short f_count; /* reference count */
    short f_msgcount; /* references from message queue */
    struct ucred *f_cred; /* credentials associated with descriptor */
    struct fileops {
    int (*fo_read)
    int (*fo_write)
    int (*fo_ioctl)
    int (*fo_select)
    int (*fo_close) }
    *f_ops;
    off_t f_offset;
    caddr_t f_data; /* vnode or socket */
    };

Figure.2

$Revision: 1.1 $

Last Updated: Sun Apr 24 11:22:45 GMT 2005