Embedded Systems and Y2K |
Originally published April, 1999 |
by
Carlo Kopp |
¿ 1999, 2005 Carlo Kopp |
Heralded by many as the impending collapse and demise of civilisation as we know it, and rejected as vendor and overpaid consultant FUD by many others, one thing which is certain about the Y2K issue, is that there is no shortage of opinion on the subject ! Scenarios of impending "turn of the millennium doom" aside, there are quite a few technically interesting issues and non-issues surrounding the subject, so in the interests of injecting some clarity into the debate, at the editor's suggestion I will attempt to elucidate some of these. Clearly Y2K failures in your digital microwave, toaster or kitchen range are unlikely to spell the end of civilisation, unless of course an overdone roast planned for post New Year's Eve nibbling is going to cost you that elusive multimillion dollar order from the client you are wining and dining on that occasion. A somewhat less frivolous issue might be a Y2K problem which causes a critical production process control system to decide to take a turn of the millennium holiday, just to celebrate the event. To best gain insight into the issues and their implications it is most useful to explore the mechanics of how a "classical" Y2K failure arises, and how it is likely to affect the victim system. Y2K Failure Modes The most basic definition of a Y2K failure is one which arises as a result of a realtime clock (RTC), or supporting software, failing to correctly recognise a date beyond the 31st December, 1999. At the lowest level within any computer system, embedded or non-embedded, timekeeping is performed by an RTC circuit or device. The simplest design for an RTC comprises a high speed, precision, quartz crystal oscillator, driving a counter circuit. The counter may be a very simply straight binary design, a binary coded decimal design, or a smarter design which can count in years, months, days, hours, minutes, seconds and fractions of seconds. In the instance of binary or BCD counters, the operating system or embedded program, accesses the time through a memory or I/O mapped register. The binary or BCD time value is then crunched by a little piece of code which spits out the date in human comprehensible terms, for use in programs and internal housekeeping. Should the counter actually decode into genuine dates and times, then the process is even simpler and the numbers just read out for direct use. The starting point for a possible Y2K problem lies in the design of the hardware, which may or may not provide four decimal digits worth of storage for the year field. A thrifty designer might have, many years ago, decided that an additional eight latches bumps up the chip real estate by an unacceptable 4% and thus decided to earn the praise of his project leader or manager, and limit the storage available to a eight latches for the last two digits alone. After all, who is likely to be using this junk in 10, 12, 15 or 20 years time ? Of course, in the absence of the upper two date digits, come the magic turn of the millennium the counters roll over to zero and the layers of software above essentially conclude that the glorious twentieth century hath dawned upon us, and return a value of 1st January, 1900. This is the most deeply embedded source of a possible Y2K fault, and in most instances also one of the easiest to ferret out, but potentially most expensive to fix. A caveat here is that a clever chip designer might have also hardwired the upper two register fields to 19 decimal, using live counters only for the lower two digits. So the apparent presence of four BCD hex coded date digits need not mean that the device is well behaved. If the RTC chip does carry within its basic design the Y2K "feature", we are still not out of the woods. The next possible place the dreaded Y2K can raise its ugly head is in the device driver code which reads back the RTC register value. If the system is based upon a venerable 8 or 16 bit CPU, odds are that a programmer may have decided to store the read back register value in a single 16-bit word, as a pair of BCD values, or in a single byte, as an 8 bit value. In either event, the imperative to squeeze hand crafted code into a tiny microcontroller chip may lead the programmer to strip away the upper two decimal digits of the date value. So even if the hardware is spitting out the proper numbers, they may simply be ignored by the operating system or date scanning code in an event loop based system. Let us however assume that the hardware and the operating system kernel and device drivers are behaving themselves as we would like them to. Are we out of the woods yet ? The answer is no. The next layer up at which we can get into difficulty is the system's runtime libraries, shared and runtime loaded, or static and linked at compile time. The date handling subroutines which may exist in these libraries may or may not allocate the required amount of storage. If the system is designed for an 8 or 16 bit CPU, or was written for such and ported upward to a 32 bit CPU, then it may strip away the upper two fields, or simply ignore them, returning a fixed value of 19 to plug the evident gap. At this point we will know whether the hardware, OS and libraries are safe or unsafe. If they are safe, we still have to contend with the application program running in this environment, since it too may have been coded up with the insidious Y2K "feature". Where memory is at a premium, there are always very strong incentives to strip away redundant storage, and a decade ago the digits 19 in most instances qualified as exactly that. This is particularly true of systems which maintain a database, or log data. Extra bytes added to each and every record in a population of hundreds of thousands or millions of records amount to additional storage requirements of an appreciable amount, especially true for very tightly coded systems. Legacy applications, "uphosted" from 8 and 16 bit architectures to modern 32 bit architectures, are obvious candidates for this mode of failure. Applications written in the age of 32 bit machines and cheap Megabytes of memory are far less likely to exhibit this category of Y2K "feature". In summary, the most rigorous approach to understanding the problem derives from basic reliability theory and the famous Lusser's Product Law, which states that the probability of survival is the product of the probabilities of survival of each and every component in a functional serial chain. Therefore each and every step in which a date is processed within the system constitutes an element in this model, and must be Y2K safe for the system to be safe.
If any of these five system "layers" exhibit the Y2K "feature", the serial chain is broken and the Y2K bug will arise at a system level. Fixing the Y2K bug therefore requires the modification or replacement of any items of hardware or code which exhibit the Y2K feature. This is often easier said than done. If the equipment is based on legacy hardware, such as that from the first generation of 16 bit minicomputers, or 8 bit microcontrollers, a Y2K problem in the hardware is most likely unmodifiable, unless one has design schematics for the board, and one is also a reasonably competent hardware designer. This may be futile, however, since the layers of code sitting on top of the hardware will probably also need to be modified. If you have the source code and the inclination, then this is a technically feasible choice. Whether it is economical is another issue. It may be cheaper to trash the system and replace it from the ground up, or replace the hardware and rehost the application, modifying the code in the process. Y2K and Embedded Systems We discussed the internals of embedded systems in the last issue of Systems, in the context of the suitability of Unix for such applications. For those readers who may have missed the last issue I will paraphrase the definition of an embedded system from that feature: " ... embedded applications are characterised most frequently by the requirement to interact with the external world (here meaning machinery within which they are embedded) more frequently than with a human operator. Indeed, many embedded applications are characterised by completely hidden interfaces, the typical ignition computer on a car would be an excellent example in that it knows about the position of the accelerator pedal transducer and that is the total of its interaction with the operator." Embedded systems are today ubiquitous in both the modern household and industry. The myriad of applications spans the previously noted household microwave, but also televisions, VCRs, stereos, telephones, fax machines, modems, communications equipment of every variety and speed, automotive systems, computer peripherals, industrial process control systems, every variety of regulator or controller (from sprinkler to motor speed controls), air traffic control systems, flight control computers in aircraft and satellite launchers, every imaginable variety of military equipment, the list indeed is very long. The scale of the disasters arising from misbehaving embedded software can be best gauged by two recent notable examples. The failed launch of the Ariane 5 prototype vehicle, where the navigation software got a little confused, and tipped the 100 ton plus gross weight booster sideways at about Mach 1 speed, or the interminable woes of an unnamed US airport which simply could not get its digital and fully software controlled luggage handling system to behave itself. While in neither instance lives were lost, the difference between a lethal disaster and a comical tale for code cutters to contemplate can often be very slim indeed. If the Ariane booster range safety system failed, it had the capacity to lay waste to large village or decent suburb sized area. If an airliner navigation system or flight control system exhibits a systemic failure, the results can definitely be tragic. The series of Airbus 320 crashes resulting from pilots punching in the wrong modes into the flight management system, or the hapless Korean Jumbo shot to bits by Russian fighter jets ostensibly due to the captain transposing digits when programming the navigation equipment, should be a good reminder for those who may wish to trivialise the importance of embedded system failures. What is the potential impact of the Y2K "feature/bug" on embedded systems ? As is frequently the case, there is no trivial all encompassing answer. The starting point in evaluating the importance of Y2K in embedded systems is that of the specific system level effects which may arise. Many embedded systems simply do not care about dates and for such Y2K is a total non-issue. In many embedded systems, dates are used strictly for purposes of logging activity and the worst possible side effect might be that logs printed off after the turn of millennium are dated starting from the 1st January, 1900. Annoying but unlikely to cause loss of limb, life or production. However, in some systems the date information may be used in calculations and a spurious value, undefined value, or dates starting from 1900 may cause a piece of code to get confused and crash. Now this piece of code may not be functionally critical, and thus may not impair the function of the system as a whole, other than trashing internal logging. However, depending on the robustness of the system design, one task crashing may cause other tasks to either crash or hang, resulting in a system crash or hang. This in turn may cause other failures at a higher system level. So Y2K can have the potential to cause serious problems in some embedded systems, with other more serious consequences resulting in turn. An air traffic control system which decides that it ought to lock up could produce dire effects. A railway signalling system could also produce unhealthy failure modes. A control system for a chemical plant or petrochemical/gas facility crashing could also produce very ugly side effects, ranging from valves being stuck open or closed, to stuff being pumped in the wrong places. Embedded code which has decided to behave weirdly can exhibit the most bizarre of symptoms at times. Dealing with Embedded Y2K The first step in dealing with the issue of Y2K in embedded systems is to determine what systems are in use and the likely or known consequences of a possible Y2K problem. We can divide Y2K failures, for clarity, into the categories of "beancounting failure" and "system functional failure". The former means that data logging may be compromised, but the system will remain in operation. The latter means that the system may crash, hang or exhibit other non-date related failure modes. A "beancounting" failure will at the worst result mostly in records being damaged from the onset of the problem, in a manner which may or may not be easily corrected. Whether this results in a loss of production or an inability to bill a client correctly would be a system specific issue. Taking an ASCII file of log data and replacing all instances of 1900 in a specific field with 2000 is not a costly fix to a problem, in many instances, and may work out to be much cheaper than rooting out the cause of the problem in the bowels of the system and fixing it. A bandaid to fix the problem may be a much more cost effective solution than a technically proper fix. A "system functional failure" will be much more serious, especially if it prevents the system from being rebooted and brought online again. Until the Y2K problem is isolated the system will be down and a loss of production or function will result. The likely consequences of this will be system specific. If we have a critical system which is known to use dates, and we cannot afford any downtime, then it is prudent to dig a little deeper, especially if the system has its development origins in the 8/16 bit machine period, or still uses hardware of that generation. The next step is to explore what options are available for detecting a Y2K problem. For many systems, this is quite straightforward, since it involves setting up an offline, non-operational copy of the operational system, and winding its clock forward to the 31st December, 1999. This testbed system is then observed closely for failures as it rolls over the date and runs for the next several days. For system designed and integrated in-house, this might be a little messy but is still fairly straightforward. For instance a redundant gasline telemetry system which I co-designed a couple of decades ago was built from the outset for the "hot" standby system to monitor the activity of the "live" system, but be isolated from the network so it was unable to issue commands to compressor stations and other control elements. With such a system architecture there are no difficulties in running an offline Y2K validation check with live data. Whether this is feasible depends on the system architecture and the user's knowledge of that architecture. With a propensity for many organisations to "dumb" themselves down and go for turnkey systems, getting rid of in-house development and detailed technical support, this may become a little trickier. In such instances the end user organisation is basically at the mercy of the turnkey vendor and subject to the risks involved in a major Y2K failure, is likely to have to pay whatever the original integrator asks for to set up and run the test. If system under test hiccups, then we have cause to dig much deeper and see what the results of the hiccup may have been. If it is serious, then we know the system has a genuine Y2K problem and we can decide what to do about it. The "what" being highly specific to the system in question. The strategies for dealing with the problem are the replacement of the whole system, partial replacement and modification of the system, or modification of the problematic hardware and code modules. Which of these to pursue, at this late date, will be driven primarily by timelines. The Y2K issue has its serious aspects, and its comical aspects, the latter primarily in some of the public speculation we hear on this subject. As a former cutter of embedded code, I cannot help but chuckle frequently when I listen to many of the doom and gloom predictions in the media. For most embedded systems out there, Y2K will be a non-issue. For many embedded systems it may produce benign faults. For a small handful of systems which are both functionally critical, and exhibit a Y2K problem, the result of a Y2K fault might vary from the benign to the serious. Only the latter category represents a cause for concern. Systematic and rigourous operational and technical analysis, complemented by testing where appropriate, is the best strategy for dealing with Y2K, and one which will in most instances reduce any possible risk down to an infinitesimal value. |
$Revision: 1.1 $ |
Last Updated: Sun Apr 24 11:22:45 GMT 2005 |
Artwork and text ¿ 2005 Carlo Kopp |