Introduction
The performance of computer hardware typically increases monotonically with time. Even if the same could be said of software, the rate at which software performance improves is usually very slow compared to that of hardware. In fact, many might opine that there is plenty of software whose performance has deteriorated consistently with time. Moreover, it is rather difficult to establish an objective performance metric for software as complex as an operating system: a "faster OS" is a very subjective, context dependent phrase.
An operating system's architecture has a much greater longevity than that of common hardware. Operating system researchers do not come up with new, much faster algorithms as consistently or frequently as hardware updates happen. Nevertheless, those involved in "producing" operating systems -- researchers, designers, implementers, and even marketeers -- have the arduous task of ensuring that the associated performance curves keep going up. There are not many viable players in the OS market (some might argue, even if rhetorically, that essentially there's only one). Still, it is a very tough market, and OS vendors must "improve" their systems incessantly.
Now, given that you are not likely to run into earth-shattering algorithmic breakthroughs in every OS release cycle, how do you make your system faster? The problem has a multi-pronged solution:
- Rather than looking for generalized optimizations pedantically, you could look into making the system faster for one (or more, but a few) "common" usage scenario.
- You could consider numerous minor and mundane performance improvements, even if they are technically unimpressive/uninteresting, or ugly/kludgy to implement. Together, such improvements could lead to a perceptible performance gain from the users' point of view.
- You could vary the granularity at which you would usually make improvements. For example, in addition to improving typical OS algorithms, you could look into improving more meta operations, such as the entire boot process.
- The most important kind of performance is the one perceived by the eventual users of a system. Thus, in any usage scenario, a "faster workflow" would be tantamount to "higher performance". It might be possible to make the workflow faster without making fundamental changes in the design and implementation of the components involved. With apriori knowledge of how the system would be typically used, you could rearrange the order in which things happen (even if the resulting order is unnatural or unclean), if doing so makes the user believe that things are happening faster.
Example: Mac OS X
This document discusses ten things that Apple did (beyond initial/fundamental OS design and implementation) to improve Mac OS X's performance. Some of these are simply good ideas and obvious candidates for implementation; some are guidelines or tools for developers to help them create high-performance applications, while some are proactive attempts at extracting performance from strategically chosen quarters. Consider the following a sampling of such optimizations, in no particular order:
Summary
- BootCache
- Kernel Extensions Cache
- Hot File Clustering
- Working Set Detection
- On-the-fly Defragmentation
- Prebinding
- Helping Developers Create Code Faster
- Helping Developers Create Faster Code
- Journaling in HFS Plus
- Instant-on
1. BootCache
Mac OS X uses a boot-time optimization (effectively a smart read-ahead) that monitors the pattern of incoming read requests to a block device (the boot disk), and sorts the pattern into a "playlist", which is used to cluster reads into a private cache. This "boot cache" is then used for satisfying incoming read requests, if possible. The scheme also measures the cache hit rate, and stores the request pattern into a "history list" for being adaptive in future. If the hit rate is too low, the caching is disabled.
The loadable (sorted) read pattern is stored in /var/db/BootCache.playlist
. Once this pattern is loaded, the cache comes into effect. The entire process is invisible from users.
This feature is only supported on the root device. Further, it requires at least 128 MB of physical RAM before it is enabled (automatically).
/System/Library/Extensions/BootCache.kext
is the location of the kernel extension implementing the cache while Contents/Resources/BootCacheControl
within that directory is the user-level control utility (it lets you load the playlist, among other things).
The effectiveness of BootCache can be gauged from the following: in a particular update to "Panther", a reference to BootCacheControl was broken. BootCache is started (via BootCacheControl
, the control utility) in /etc/rc
, and a prefetch tag is inserted (unless the system is booting in safe mode). /etc/rc
looks for BootCacheControl
in the Resources
directory of the BootCache.kext
bundle, as well as in /usr/sbin
, and finds it in the former (it doesn't exist in the latter). However, another program (loginwindow.app
) accesses /usr/sbin/BootCacheControl
directly, and does not find it. For what it's worth, making BootCacheControl
available in /usr/sbin
, say via a symbolic link, reduces the boot time (measured from clicking on the "Restart" confirmation button to the point where absolutely everything has shown up on the system menu) from 135 seconds to 60 seconds on one of my machines.
2. Kernel Extensions Cache
There may be close to a hundred kernel extensions that are loaded on a typical Mac OS X system, and perhaps twice as many residing in the system's "Extensions" folder(s). Kernel extensions may have dependencies on other extensions. Rather than scan all these every time the system boots (or worse, every time an extension is to be loaded), Mac OS X uses caching for kernel extensions, and the kernel itself.
There are three types of kernel/kext caches used in this context:
- The kernel cache contains the kernel code, linked kernel extensions, and info dictionaries of any number of kernel extensions. The default cache directory for this type of cache is
/System/Library/Caches/com.apple.kernelcaches
. The cache files in this directory are namedkernelcache.XXXXXXXX
, where the suffix is a 32-bit adler checksum (the same algorithm as used by Gzip). - The multi-extension, or
mkext
cache, contains multiple kernel extensions and their info directories. Such caches are used during early system startup. BootX, the bootloader, tries to load a previously cached list of device drivers (created/updated by/usr/sbin/kextcache
). If the mkext cache is corrupt or missing, BootX would look in/System/Library/Extensions
for extensions that are needed in the current scenario (as determined by the value of theOSBundleRequired
property in theInfo.plist
file of the extension's bundle. The mkext cache exists by default as/System/Library/Extensions.mkext
. You can use/usr/sbin/mkextunpack
to extract the contents of a mkext archive. - The kext repository cache contains the info dictionaries for all kernel extensions in a single repository directory, including their plugins. This cache exists by default as
/System/Library/Extensions.kextcache
. Note that this file is simply a large property list (XML) file that is Gzip compressed.
3. Hot File Clustering
Hot File Clustering (HFC) aims to improve the performance of small, frequently accessed files on HFS Plus volumes. This optimization is currently used only on boot volumes. HFC is a multi-staged clustering scheme that records "hot" files (except journal files, and ideally quota files) on a volume, and moves them to the "hot space" on the volume (0.5% of the total filesystem size located at the end of the default metadata zone, which itself is at the start of the volume). The files are also defragmented. The various stages in this scheme are DISABLED
, IDLE
, BUSY
, RECORDING
, EVALUATION
, EVICTION
, and ADOPTION
. At most 5000 files, and only files less than 10 MB in size are "adopted" under this scheme.
The "metadata zone" referred to in the above description is an area on disk that may be used by HFS Plus for storing volume metadata: the Allocation Bitmap File, the Extents Overflow File, the Journal File, the Catalog File, Quota Files, and Hot Files. Mac OS X 10.3.x places the metadata zone near the beginning of the volume, immediately after the volume header.
HFC (and the metadata zone policy) are used only on journaled HFS Plus volumes that are at least 10 GB in size.
Note that what constitutes the set of hot files on your system will depend on your usage pattern over a few days. If you are doing extensive C programming for a few days, say, then it is likely that many of your hot files will be C headers. You can use hfsdebug to explore the working of Hot File Clustering.
% sudo hfsdebug -H -t 10
# Top 10 Hottest Files on the Volume
rank temperature cnid path
1 537 7453 Macintosh HD:/usr/share/zoneinfo/US/Pacific
2 291 7485 Macintosh HD:/private/var/db/netinfo/local.nidb/Store.128
3 264 7486 Macintosh HD:/private/var/db/netinfo/local.nidb/Store.160
4 204 7495 Macintosh HD:/private/var/db/netinfo/local.nidb/Store.96
5 204 2299247 Macintosh HD:/Library/Receipts/iTunes4.pkg/Contents\
/Resources/package_version
6 192 102106 Macintosh HD:/usr/include/mach/boolean.h
7 192 102156 Macintosh HD:/usr/include/mach/machine/boolean.h
8 192 102179 Macintosh HD:/usr/include/mach/ppc/boolean.h
9 188 98711 Macintosh HD:/usr/include/string.h
10 178 28725 Macintosh HD:/%00%00%00%00HFS+ Private Data/iNode1038632980
3365 active Hot Files.
4. Working Set Detection
The Mach kernel uses physical memory as a cache for virtual memory. When new pages are to be brought in as a result of page faults, the kernel would need to decide which pages to reclaim from amongst those that are currently in memory. For an application, the kernel should ideally keep those pages in memory that would be needed very soon.
In the Utopian OS, one would know ahead of time the pages an application references as it runs. There have been several algorithms that approximate such optimal page replacement. Another approach is to make use of the locality of reference of processes. According to the Principle of Locality, a process refers to a small, slowly changing subset of its set of pages. This subset is the Working Set. Studies have shown that the working set of a process needs to be resident (in-memory) in order for it to run with acceptable performance (that is, without causing an unacceptable number of page faults).
The Mac OS X kernel incorporates a subsystem (let us call it TWS, for Task Working Set) for detecting and maintaining the working sets of tasks. This subsystem is integrated with the kernel's page fault handling mechanism. TWS builds and maintains a profile of each task's fault behavior. The profiles are per-user, and are stored on-disk, under /var/vm/app_profile/
. This information is then used during fault handling to determine which nearby pages should be brought in.
Several aspects of this scheme contribute to performance:
- Bringing a number of pages in (that would hopefully be needed in the near future) results in a single large request to the pager.
- TWS captures, and stores on disk, an application's (initial) working set the first time it is started (by a particular user). This file is used for seeding (sometimes called pre-heating) the application's working set, as its profile is built over time.
- The locality of reference of memory is usually captured on disk (because files on disk usually have good locality on HFS Plus volumes). Thus, there should not be too much seeking involved in reading the working set from disk.
For a user with uid U
, the application profiles are stored as two page cache files: #U_names
and #U_data
under /var/vm/app_profile/
(#U
is the hexadecimal representation of U).
The "names" file, essentially a simple database, contains a header followed by profile elements:
struct profile_names_header { struct profile_element {
typedef unsigned int natural_t;
typedef natural_t vm_size_t;
unsigned int number_of_profiles;
unsigned int user_id;
unsigned int version;
off_t element_array;
unsigned int spare1;
unsigned int spare2;
unsigned int spare3;
};
off_t addr;
vm_size_t size;
unsigned int mod_date;
unsigned int inode;
char name[12];
};
The "data" file contains the actual working sets.
5. On-the-fly Defragmentation
When a file is opened on an HFS Plus volume, the following conditions are tested:
- If the file is less than 20 MB in size
- If the file is not already busy
- If the file is not read-only
- If the file has more than eight extents
- If the system has been up for at least three minutes
If all of the above conditions are satisfied, the file is relocated -- it is defragmented on-the-fly.
File contiguity (regardless of file size) is promoted in general as a consequence of the extent-based allocation policy in HFS Plus, which also delays actual allocation. Refer to Fragmentation In HFS Plus Volumes for more details.
6. Prebinding
Mac OS X uses a concept called "prebinding" to optimize Mach-O (the default executable format) applications to launch faster (by reducing the work of the runtime linker).
The dynamic link editor resolves undefined symbols in an executable (and dynamic libraries) at run time. This activity involves mapping the dynamic code to free address ranges and computing the resultant symbol addresses. If a dynamic library is compiled with prebinding support, it can be predefined at a given (preferred) address range. This way, dyld
can use predefined addresses to reference symbols in such a library. For this to work, libraries cannot have preferred addresses that overlap. Apple marks several address ranges as either "reserved" or "preferred" for its own software, and specifies allowable ranges for 3rd party (including the end users') libraries to use to support prebinding.
update_prebinding
is run to (attempt to) synchronize prebinding information when new files are added to a system. This can be a time consuming process even if you add or change a single file, say, because all libraries and executables that might dynamically load the new file must be found (package information is used to help in this, and the process is further optimized by building a dependency graph), and eventually redo_prebinding
is run to prebind files appropriately.
Prebinding is the reason you see the "Optimizing ..." message when you update the system, or install certain software.
/usr/bin/otool
can be used to determine if a binary is prebound:
# otool -hv /usr/lib/libc.dylib
/usr/lib/libc.dylib:
Mach header
magic cputype cpusubtype filetype ncmds sizeofcmds flags
MH_MAGIC PPC ALL DYLIB 10 1940 \
NOUNDEFS DYLDLINK PREBOUND SPLIT_SEGS TWOLEVEL
7. Helping Developers Create Code Faster
Mac OS X includes a few optimizations that benefit developers by making development workflow -- the edit-compile-debug cycle -- faster. Some of these were introduced with Mac OS X Panther.
- Precompiled Headers: Xcode (gcc, specifically) supports precompiled headers. Xcode uses this functionality to precompile prefix headers.
% cat foo.h
#define FOO 10
% cat foo.c
#include "foo.h"
#include <stdio.h>
int
main()
{
printf("%d\n", FOO);
}
% ls foo.*
foo.c foo.h
% gcc -x c-header -c foo.h
% ls foo.*
foo.c foo.h foo.gch
% gcc -o foo foo.c
% ./foo
10
% rm foo.h
% gcc -o foo foo.c
% ./foo
10
- Distributed Builds: Xcode (through distcc) supports distributed builds, wherein it is possible to distribute builds across several machines on the network.
- Predictive compilation runs the compiler in the background (as soon as it can, even as you edit the source). Once you are ready to build, the hope is that much of the building would have been done already.
- Zero Link, a feature useful for development builds, links at runtime instead of compile time, whereby only code needed to run the application is linked in and loaded (that is, as an application runs within Xcode, each object file is linked as needed). A related feature is "Fix and Continue", courtesy which you can (with caveats) make a change to your code and have the code compiled and inserted into a running program.
8. Helping Developers Create Faster Code
Apple provides a variety of performance measurement/debugging tools for Mac OS X. Some of these are part of Mac OS X, while many others are available if you install the Apple Developer Tools. Quite expectedly, Apple encourages its own developers, as well as 3rd party developers, to create code in conformance with performance guidelines.
As mentioned earlier, perceived performance is quite important. For example, it is desirable for an application to display its menu bar and to start accepting user input as soon as possible. Reducing this initial response time might involve deferring certain initializations or reordering the "natural" sequence of events, etc.
Mac OS X Tools
Mac OS X includes several common GNU/Unix profiling/monitoring/dissecting tools, such as gprof
, lsof
, nm
, top
, vm_stat
, and many more, such as:
Refer to Apple's documentation for these tools for more details.
- fs_usage Report system calls and page faults related to filesystem activity.
- heap List all malloc-allocated buffers in a process's heap.
- ktrace/kdump Enable/view (from a trace) kernel process tracing.
- leaks Search a process's for unreferenced malloc buffers.
- malloc_history Show a process's malloc allocations.
- otool Display various parts of an object file.
- pagestuff Display information on specified pages of a Mach-O file.
- sample Profile a process during a time interval.
- sc_usage Show system call usage statistics.
- vmmap Display virtual memory regions allocated in a process.
Performance Measurement Tools
- MallocDebug Tracks and analyzes allocated memory.
- ObjectAlloc Tracks Objective-C and Core Foundation object allocations and deallocations.
- OpenGL Profiler Tool for profiling OpenGL applications.
- PEFViewer Viewer for the contents of a PEF binary.
- QuartzDebug Visualizer for an application's screen drawing behavior -- the areas being redrawn are flashed briefly.
- Sampler Viewer for execution behavior of a program.
- Spin Control Samples applications that cause the spinning cursor to appear.
- Thread Viewer Viewer for threads and their activity.
CHUD Tools
The Computer Hardware Understanding Development (CHUD) Tools package, an optional installation, provides tools such as the following:
- BigTop A graphical equivalent to
top
,vm_stat
, etc. Displays system statistics. - CacheBasher Measures cache performance.
- MONster Tool for collecting and visualizing hardware level performance data.
- PMC Index Tool for searching Performance Monitoring Counter (PMC) events.
- Reggie SE A viewer (and editor) for CPU and PCI configuration registers.
- Saturn Tool for profiling applications at the function-call level, and visualizing the profile data.
- Shark Performs system-wide sampling/profiling to create a profile of the execution behavior of a program, so as to help you understand where time is being spent as your code runs.
- Skidmarks GT Processor performance benchmark (integer, floating-point, and vector benchmarks).
- SpindownHD Utility for displaying the sleep/active status of attached drives.
- acid Analyzes traces generated by amber (only the TT6E format).
- amber Traces all threads of execution in a process, recording every instruction and data access to a trace file.
- simg4 A cycle-accurate core simulator of the Motorola PowerPC G4 processor.
- simg5 A cycle-accurate core simulator of the IBM PowerPC 970 (G5) processor.
9. Journaling in HFS Plus
While modern filesystems are often journaled by design, journaling came to HFS Plus rather late. Apple retrofitted journaling into HFS Plus as a supplementary mechanism to the erstwhile working of the filesystem, with Panther being the first version to have journaling turned on by default.
On a journaled HFS Plus volume, file object metadata and volume structures are journaled, but not file object data (fork contents, that is). The primary purpose of the journal is to make recovery faster and more reliable, in case a volume is unmounted uncleanly, but it may improve the performance of metadata operations.
10. Instant-on
Apple computers do not hibernate. Rather, when they "sleep", enough devices (in particular, the dynamic RAM) are kept alive (at the cost of some battery life, if the computer is running on battery power). Consequently, upon wakeup, the user perceives instant-on behavior: a very desirable effect.
Similarly, by default the system tries to keep network connections alive even if the machine sleeps. For example, if you login (via SSH, say) from one PowerBook to another, and both of them go to sleep, your login should stay alive within the constraints of the protocols.
Epilogue
Using Mac OS X as an example, we looked at a few kinds of optimizations that "OS people" (particularly those involved in creating an end-user system) adopt to improve performance. The integration of all such optimizations is perhaps even more important than the optimizations themselves. The end result should be a perceptible improvement in performance. A desirable manifestation of such improvment would be a faster workflow for the end-user.
It must be noted that most, if not all, of the optimizations listed here are not specific to Mac OS X. Microsoft uses similar techniques to make Windows faster. In particular, techniques similar or equivalent to (but not limited to) BootCache, Hot File Clustering, and Working Set Detection/Maintenance are also used in Windows.