Wednesday, March 12, 2008

BSD filesystem services

4.4BSD
For user processes, all I/O is done through descriptors. The file descriptor is used by the kernel to index into the descriptor table for the current process to locate a file entry (or file structure). The information is kept in the filedesc structure which is a substructure of the process structure for process.

File descriptor (user process) -->
descriptor table (filedesc process substructure) -->
file entry (kernel list)-->
  • vnode
  • interprocess communication
  • virtual memory
The file entry provides a file type and a pointer to an underlying object for the descriptor. For data files, the file entry points to a vnode structure that references a substructure containing the filesystem-specific information. The 4.4BSD file entry may also reference a socket, instead of a file. Socket has a different file type, and the file entry points to a system block that is used in doing interprocess communication.

Special files do not have data blocks allocated on the disk; they are handled by the special-device filesystem that calls appropriate drivers to handle I/O for them.

The virtual memory system supports the mapping of files into a process's address space. Here, the file descriptor must reference a vnode that will be mapped into the user's address space, either partially or completely.

inode
In 4.3BSD, the file entries directly referenced the local filesystem inode. An index node, or inode, is a data structure that describes the contents of a file:
  • The type and access mode for the file
  • The file's owner
  • The group-access identifier
  • The time that the file was most recently read and written
  • The time that the inode was most recently updated by the system
  • The size of the file in bytes
  • The number of physical blocks used by the file (including blocks used to hold indirect pointers)
  • The number of references to the file
  • The flags that describe characteristics of the file
  • The generation number of the file (a unique number selected to be the approximate creation time of the file and assigned to the inode each time that the latter is allocated to a new file; the generation of number is used by NFS to detect references to deleted files)
Note: Filenames are maintained in directories, rather than in inodes, because a file may have many names, or links, and the name of a file may be large (up to 255 bytes in length).

The structure of an inode:
  • mode
  • owners(2)
  • timestamps(3)
  • size
  • direct blocks --> data
  • single indirect
  • double indirect
  • triple indirect
  • block count
  • reference count
  • flags
  • generation number

The inode contains an array of pointers to the blocks in the file. The system convert from a logical block number to a physical sector number by indexing into the array using the logical block number. Since inodes are statically allocated and most files are small, so the array of pointers must be small for efficient use of space. The first 12 array entries are allocated in the inode itself. For typical filesystems, this allows the first 48 or 96 Kbyte of data to be located directly via a simple indexed lookup.

vnode
Sun Microsystems implemented the first virtual file system by adding, VFS layer, a new object-oriented layer between the file file entry and the inode. A vnode used by a local filesystem would refer to an inode. A vnode used by a remote filesystem would refer to a protocol control block that described the location and naming information necessary to access the remote file. Unlike the original Sun Microsystems vnode implementation, 4.4BSD allows dynamic addition of vnode operations at system boot time. As part of the booting process, each filesystem registers the set of vnode operations that is able to support. The kernel then builds a table that lists the union of all operations supported by any filesystem. From that table, it builds an operations vector for each filesystem. Moreover, unlike most vendor's vnode implementation, which have a fixed number of vnodes allocated to each filesystem type, the 4.4BSD kernel keeps a single systemwide collection of vnodes. The benefit of having a single global vnode table is that the kernel memory dedicated to vnodes is used more efficiently than when several filesystem-specific collections of vnodes are used. For example, consider when the user application shifting from local filesystem to network filesystem.

In 4.3BSD, the local filesystem code provided both the semantics of the hierarchical filesystem naming and the details of the on-disk storage management. To enable experimentation with other disk-storage techniques without having to reproduce the entire naming semantics, 4.4BSD splits the naming and storage code into separate modules. At the vnode layer, there are a set of operations defined for hierarchical filesystem operations and a separate set of operations defined for storage of variable-sized objects using a flat name space. About 60% of the traditional filesystem code became the name-space management, and the remaining 40% became the code implementing the on-disk file storage.

Buffer Cache
Another important service provided by the filesystem-independent layer is the management of kernel's buffer space. The task of the buffer cache is two-fold. One is to manage the memory that buffers data being transferred to and from the disk or network. The other, more important, is to act as a cache of recently used blocks. The buffer is composed of two parts. The first part is the buffer header, which contains information used to find the buffer and to describe the buffer's contents. The content information includes:
  • hash link
  • free-list link
  • flags (buffer status)
  • vnode pointer
  • file offset
  • byte count
  • buffer size
  • buffer pointer --> buffer contents MAXBSIZE (64 Kbyte)
The second part is the actual buffer contents. Rather than the header being prepended to the data area of the buffer, as is done with mbufs, the data areas are maintained separately. The buffer size is always at least as big as the data block that the buffer contains. Data are maintained separately from the header to allow easy manipulation of buffer sizes via the page-mapping hardware. In this case, there is no need to set the header on a page or to check the collision of the other header.

Buffer management
Because the requested block already resides in the buffer cache, on a typical 4.4BSD system, over 85% of the the implied disk or network transfers can be skipped. Depending on the available memory, a system is configured with from 100 to 1000 buffers. The sizes of buffer requests from a filesystem range from 512 bytes up to 65,536 bytes. While the address space is not fully populated with physical memory, in order to allow the system to adapt efficiently to the changing needs of buffers' sizes and numbers, the kernel allocates to each buffer MAXBSIZE bytes of virtual memory. For example, initially each buffer is assigned 4096 bytes of physical memory. As a smaller buffers are allocated, they give up their unused physical memory to buffers that need to hold more than 4096 bytes.

In earlier versions of BSD and in most other versions of UNIX, buffers were identified by their physical disk block number. 4.4BSD changes this convention to identify buffers by their logical block number within the file. For filesystems such as NFS, only a logical block number can be used as there is no way for the local client to compute the physical block address of a logical file on the server. Moreover, the savings are considerable for a local filesystem where the computation may require traversing up to three indirect blocks. The drawback to using a logical-address cache is that it is difficult to detect aliases for a block belonging to a local file and the same block accessed through the block device disk whose logical-block address is the same as the physical-block address.

Stackable filesystems
The early vnode interface was simply an object-oriented interface to an underlying filesystem. The vnode interfaces implemented vnode operations as indirect function calls. As the demand grew for new filesystem features, the stacking ideas was implemented in the 4.4BSD system. The bottom of a vnode stack tends to be a disk-based filesystem, whereas the layers used above it typically transform their arguments and pass on those arguments to a lower layer. Stacking uses the mount command to create new layers. The mount command pushes a new layer onto a vnode stack.

When a file access occurs to a vnode in the stack, that vnode has several options:
  • Do the requested operations and return a result
  • Pass the operation without change to the next-layer vnode on the stack. When the operation returns from lower vnode, it may modify the results, or simply return them.
  • Modify the operands provided with the request, then pass it to the next-lower vnode. When the operation returns from the lower vnode, it may modify the results, or simply return them.
If an operation is passed to the bottom of the stack without any layer taking action on it, then the interface will return the error "operation not supported."

In order to implement the stacking (bypass of operations to lower layers and adding of new operations into the system at boot time), kernel has placed the vnode operation name and its arguments into an argument structure. The latter is then passed as a single parameter to the vnode operation. Thus, all calls on a vnode operation will always have exactly one parameter, which is the pointer to the argument structure. If the vnode operation is unknown, then the generic bypass routine can call the same operation in the next-layer, passing the operation the same argument structure that is received. In addition, the first argument of every operation is a pointer to the vnode operation description. This description provides to a bypass routine the information about the operation, including the operation's name and the location of the operation's parameters.