Page 1
SUSE Linux Enterprise Server 10 SP1 EAL4 High-Level Design Version 1.2.1...
Page 2
Novell, the Novell logo, the N logo, and SUSE are registered trademarks of Novell, Inc. in the United States and other countries. IBM, IBM logo, BladeCenter, eServer, iSeries, i5/OS, OS/400, PowerPC, POWER3, POWER4, POWER4+, POWER5+, pSeries, S390, System p, System z, xSeries, zSeries, zArchitecture, and z/VM are trademarks or registered trademarks of International Business Machines Corporation in the United States, other countries, or both.
This document describes the High Level Design (HLD) for the SUSE® Linux® Enterprise Server 10 Service Pack 1 operating system. For ease of reading, this document uses the phrase SUSE Linux Enterprise Server and the abbreviation SLES as a synonym for SUSE Linux Enterprise Server 10 SP1.
2 System Overview The Target of Evaluation (TOE) is SUSE Linux Enterprise Server (SLES) running on an IBM eServer host computer. The SLES product is available on a wide range of hardware platforms. This evaluation covers the SLES product on the IBM eServer System x™, System p™, and System z™, and eServer 326 (Opteron).
Control List (ACL) for the named objects under its control. This chapter documents the SUSE Linux Enterprise Server and IBM eServer product histories, provides an overview of the TOE system, and identifies the portion of the system that constitutes the TOE Security Functions (TSF).
The Common Criteria for Information Technology Security Evaluation [CC] and the Common Methodology for Information Technology Security Evaluation [CEM] demand breaking the TOE into logical subsystems that can be either (a) products, or (b) logical functions performed by the system. The approach in this section is to break the system into structural hardware and software subsystems that include, for example, pieces of hardware such as planars and adapters, or collections of one or more software processes such as the base kernel and kernel modules.
Page 17
The SLES kernel includes the base kernel and separately-loadable kernel modules and device drivers. (Note that a device driver can also be a kernel module.) The kernel consists of the bootable kernel image and its loadable modules. The kernel implements the system call interface, which provides system calls for file management, memory management, process management, networking, and other TSF (logical subsystems) functions addressed in the Functional Descriptions chapter of this document.
LAN segment to a host on a third LAN segment, and from there are routed to the target host. The number of hops from the client to the server are irrelevant to the security provided by the system, and are transparent to the user.
Also, note that the network client and server can be on the same host system. For example, when User B uses ssh to log in to Host 2, the user's client process opens an ssh connection to the ssh server process on Host 2.
Refer to the appropriate command man pages for detailed information about how to set up and maintain users and groups. 2.2.6 TSF interfaces The TSF interfaces include local interfaces provided by each host computer, and the network client-server interfaces provided by pairs of host computers.
The local TSF interfaces provided by an individual host computer include: Files that are part of the TSF database that define the configuration parameters used by the security • functions. System calls made by trusted and untrusted programs to the privileged kernel-mode software. As •...
Page 22
The SLES operating system is distributed as a collection of packages. A package can include programs, configuration data, and documentation for the package. Analysis is performed at the file level, except where a particular package can be treated collectively. A file is included in the TSF for one or more of the following reasons: It contains code, such as the kernel, kernel module, and device drivers, that runs in a privileged •...
3 Hardware architecture The TOE includes the IBM System x, System p, System z, and eServer 326. This section describes the hardware architecture of these eServer systems. For more detailed information about Linux support and resources for the entire eServer line, refer to http://www.ibm.com/systems/browse/linux.
3.2.1 System p hardware overview The IBM System p servers offer a range of systems, from entry level to enterprise class. The high-end systems offer support for gigabytes of memory, large RAID configurations of SCSI and fiber channel disks, and options for high-speed networking. The IBM System p servers are equipped with a real-time hardware clock.
The System z hardware runs z/Architecture™ and the S/390™ Enterprise Server Architecture (ESA) software. The IBM System z server is equipped with a real-time hardware clock. The clock is powered by a small battery, and continues to tick even when the system is switched off. The real-time clock maintains reliable time for the system.
3.4.1 eServer 326 hardware overview The IBM eServer 326 systems offer support for up to two AMD Opteron processors, up to twelve GB of memory, hot-swap SCSI or IDE disk drives, RAID-1 mirroring, and options for high-speed networking. The IBM eServer 326 server is equipped with a real-time hardware clock. The clock is powered by a small battery and continues to tick even when the system is switched off.
Page 28
processor extensions are activated, allowing the processor to operate in one of two sub-modes of LMA. These are the 64-bit mode and the compatibility mode. 64-bit mode: In 64-bit mode, the processor supports 64-bit virtual addresses, a 64-bit instruction • pointer, 64-bit general-purpose registers, and eight additional general-purpose registers, for a total of 16 general-purpose registers.
4 Software architecture This chapter summarizes the software structure and design of the SLES system and provides references to detailed design documentation. The following subsections describe the TOE Security Functions (TSF) software and the TSF databases for the SLES system. The descriptions are organized according to the structure of the system and describe the SLES kernel that controls access to shared resources from trusted (administrator) and untrusted (user) processes.
Page 31
Figure 4-1: Levels of Privilege System x: The System x servers are powered by Intel processors. Intel processors provide four execution modes, identified with processor privilege levels 0 through 3. The highest privilege level execution mode corresponds to processor privilege level 0; the lowest privilege level execution mode corresponds to processor privilege level 3.
When the processor is in kernel mode, the program has hardware privilege because it can execute • certain privileged instructions that are not available in user mode. Thus, any code that runs in kernel mode executes with hardware privileges. Software that runs with hardware privileges includes: The base SLES kernel.
4.1.2.1 The DAC model allows the owner of the object to decide who can access that object, and in what manner. Like any other access control model, DAC implementation can be explained by which subjects and objects are under the control of the model, security attributes used by the model, access control and attribute transition rules, and the override (software privilege) mechanism to bypass those rules.
4.1.2.3 Programs with software privilege Examples of programs running with software privilege are: Programs that are run by the system, such as the cron and init daemons. • Programs that are run by trusted administrators to perform system administration. • Programs that run with privileged identity by executing setuid programs.
The concept of breaking the TOE product into logical subsystems is described in the Common Criteria. These logical subsystems are the building blocks of the TOE, and are described in the Functional Descriptions chapter of this paper. They include logical subsystems and trusted processes that implement security functions.
4.2.1.1 Logical components The kernel consists of logical subsystems that provide different functionalities. Even though the kernel is a single executable program, the various services it provides can be broken into logical components. These components interact to provide specific functions. Figure 4-3 schematically describes logical kernel subsystems, their interactions with each other, and with the system call interface available from user space.
Audit subsystem: This subsystem implements functions related to recording of security-critical • events on the system. Implemented functions include those that trap each system call to record security critical events and those that implement the collection and recording of audit data. 4.2.1.2 Execution components The execution components of the kernel can be divided into three components: base kernel, kernel threads,...
The crontab file and cron daemon are the client- server pair that allow the execution of commands on a recurring basis at a specified time. The init program is the userspace process that is ancestor to all other userspace processes. It •...
Page 39
The crontab program is the program used to install, deinstall, or list the tables used to drive • the cron daemon. Users can have their own crontab files that set up the time and frequency of execution, as well as the command or script to execute. The gpasswd command administers the /etc/group file and /etc/gshadow file if •...
The chfn command allows users to change their finger information. The finger command • displays that information, which is stored in the /etc/passwd file. The date command is used to print or set the system date and time. Only an administrative user •...
This section briefly describes the functional subsystems that implement the required security functionalities and the logical subsystems that are part of each of the functional subsystems. The subsystems are structured into those implemented within the SLES kernel, and those implemented as trusted processes.
gpasswd • chage • useradd, usermod, userdel • groupadd, groupmode, groupdel • chsh • chfn • openssl • 4.4.5 User-level audit subsystem This subsystem contains the portion of the audit system that lies outside the kernel. This subsystem contains the auditd trusted process, which reads audit records from the kernel buffer, and transfers them to on-disk audit logs, the ausearch trusted search utility, the autrace trace utility, the audit configuration file, and audit libraries.
5 Functional descriptions The kernel structure, its trusted software, and its Target of Evaluation (TOE) Security Functions (TSF) databases provide the foundation for the descriptions in this chapter. File and I/O management The file and I/O subsystem is a management system for defining objects on secondary storage devices. The file and I/O subsystem interacts with the memory subsystem, the network subsystem, the inter-process communication (IPC) subsystem, the process subsystem, and the device drivers.
In order to shield user programs from the underlying details of different types of disk devices and disk-based file systems, the SLES kernel provides a software layer that handles all system calls related to a standard UNIX file system. This common interface layer, called the Virtual File System, interacts with disk-based file systems whose physical I/O devices are managed through device special files.
Page 46
Figure 5-3: ext3 and CD-ROM file systems after mounting The root directory is contained in the root file system, which is ext3 in this TOE. All other file systems can be mounted on subdirectories of the root file system. The VFS allows programs to perform operations on files without having to know the implementation of the underlying disk-based file system.
inode: Stores general information about a specific file, such as file type and access rights, file owner, group owner, length in bytes, operations vector, time of last file access, time of last file write, and time of last inode change. An inode is associated to each file and is described in the kernel by a struct inode data structure.
Page 48
Figure 5-5: VFS pathname translation and access control checks Figure 5-5 VFS pathname translation and access control checks...
5.1.1.2 open() The following describes the call sequence of an open() call to create a file: 1. Call the open() system call with a relative pathname and flags to create a file for read and write. 2. open() calls open_namei(), which ultimately derives the dentry for the directory in which the file is being created.
5.1.1.3 write() Another example of a file system operation is a write() system call to write to a file that was opened for writing. The write() system call in VFS is very straightforward, because access checks have already been performed by open(). The following list shows the call sequence of a write() call: 1.
Unbindable Mount: This mount does not forward or receive propagation. This mount type can not be • bind-mounted, and it is not valid to move it under a shared mount. Slave Mount: A slave mount remains tied to its parent mount and receives new mount or unmount •...
5.1.2.1.1.1 Access Control Lists ACLs provide a way of extending directory and file access restrictions beyond the traditional owner, group, and world permission settings. For more details about the ACL format, refer to Discretionary Access Control, Section 5.1.5, of this document, and section 6.2.4.3 of the SLES Security Target document. EAs are stored on disk blocks allocated outside of an inode.
Page 53
ext3_group_desc: Disk blocks are partitioned into groups. Each group has its own group descriptor. • ext3_group_desc stores information such as the block number of the inode bitmap, and the block number of the block bitmap. ext3_inode: The on-disk counterpart of the inode structure of VFS, ext3_inode stores •...
Page 54
Figure 5-8: New data blocks are allocated and initialized for an ext3 field...
Figure 5-9 shows how for a file on the ext3 file system, inode_operations map to ext3_file_inode_operations. Figure 5-9: Access control on ext3 file system Similarly, for directory, symlink, and special-file types of objects, inode operations map to ext3_dir_inode_operations, ext3_symlink_inode_operations, and ext3_special_inode_operations, respectively.
from the superblock’s s_root field of the superblock, and then invokes isofs_find_entry() to retrieve the object from the CD-ROM. On a CD-ROM file system, inode_operations map to isofs_dir_inode_operations. Figure 5-10: File lookup on CD-ROM file system 5.1.3 Pseudo file systems 5.1.3.1 procfs The proc file system is a special file system that allows system programs and administrators to manipulate the...
Since VM is volatile in nature, tmpfs data is not preserved between reboots. Hence this file system is used to store short-lived temporary files. An administrator is allowed to specify the memory placement policies (the policy itself and the preferred nodes to be allocated) for this file system. 5.1.3.3 sysfs sysfs is an in-memory file system, which acts as repository for system and device status information,...
5.1.3.6 binfmt_misc binfmt_misc provides the ability to register additional binary formats to the kernel without compiling an additional module or kernel. Therefore, binfmt_misc needs to know magic numbers at the beginning, or the filename extension of the binary. binfmt_misc works by maintaining a linked list of structs that contain a description of a binary format, including a magic number with size, or the filename extension, offset and mask, and the interpreter name.
chown() system call. The owner and the root user are allowed to define and change access rights for an object. This following subsection looks at the kernel functions implementing the access checks. The function used depends on the file system; for example, vfs_permission() invokes permission() which then calls specific *_permission() routines based on the inode’s inode operation vector i_op.
If the process is neither the owner nor a member of an appropriate group, and the permission bits for • world allow the type of access requested, then the subject is permitted access. If none of the conditions above are satisfied, and the effective UID of the process is not zero, then the •...
Page 61
5.1.5.2.3 ACL permissions An ACL entry can define separate permissions for read, write, and execute or search. 5.1.5.2.4 Relationship to file permission bits An ACL contains exactly one entry for each of the ACL_USER_OBJ, ACL_GROUP_OBJ, and ACL_OTHER types of tags, called the required ACL entries. An ACL can have between zero and a defined maximum number of entries of the ACL_GROUP and ACL_USER types.
Page 62
5.1.5.2.8 ACL enforcement The ext3_permission() function uses ACLs to enforce DAC. The algorithm goes through the following steps: 1. Performs checks such as “no write access if read-only file system” and “no write access if the file is immutable.” 2. For ext3 file systems, the kernel calls the ext3_get_acl() to get the ACL corresponding to the object.
file by adding ACLs with the setfacl command. For example, the following command allows a user named john read access to this file, even if john does not belong to the root group. #setfacl –m user:john:4,mask::4 /aclfile The ACL on file will look like: # owner: root # group: root user:: rw-...
application, the I/O scheduler is considered an important kernel component in the I/O path. SLES includes four I/O scheduler options to optimize system performance. 5.1.7.1 Deadline I/O scheduler The deadline I/O scheduler available in the Linux 2.6 kernel incorporates a per-request expiration-based approach, and operates on five I/O queues.
requests. This capability makes it behaves similarly to the Anticipatory I/O scheduler. I/O priorities are also considered for the processes, which are derived from their CPU priority. 5.1.7.4 Noop I/O scheduler The noop I/O scheduler can be considered as a rather minimal I/O scheduler that performs, as well as provides, basic merging and sorting functionalities.
5.1.8.4 Tasklets Tasklets are dynamically linked and built on top of softirq mechanisms. Tasklets differ from softirqs in that a tasklet is always serialized with respect to itself. In other words, a tasklet cannot be executed by two CPUs at the same time.
Process control and management A process is an instance of a program in execution. Process management consists of creating, manipulating, and terminating a process. Process management is handled by the process management subsystems of the kernel. The kernel interacts with the memory subsystem, the network subsystem, the file and I/O subsystem, and the inter-process communication (IPC) subsystem.
Page 68
The SLES kernel maintains information about each process in a task_struct process type of descriptor. Each process descriptor contains information such as run-state of process, address space, list of open files, process priority, which files the process is allowed to access, and security relevant credentials fields including the following: uid and gid, which describe the user ID and group ID of a process.
Figure 5-12: The task structure The kernel maintains a circular doubly-linked list of all existing process descriptors. The head of the list is the init_task descriptor referenced by the first element of the task array. The init_task descriptor belongs to process 0 or the swapper, the ancestor of all processes. 5.2.2 Process creation and destruction The SLES kernel provides these system calls for creating a new process: clone(), fork(), and...
5.2.2.2.4 setresuid()and setresgid() These set the real user and group ID, the effective user and group ID, and the saved set-user and group ID of the current process. Normal user processes (that is, processes with real, effective, and saved user IDs that are nonzero) may change the real, effective, and saved user and group IDs to either the current uid and gid, the current effective uid and gid, or the current saved uid and gid.
Hyperthreading is a technique in which a single physical processor masquerades at the hardware level as two or more processors. It enables multi-threaded server software applications to execute threads in parallel within each individual server processor, thereby improving transaction rates and response times.
Figure 5-14: Hyperthreaded scheduling For more information about hyperthreading, refer to http://www.intel.com/technology/hyperthread/. 5.2.6 Kernel preemption The kernel preemption feature has been implemented in the Linux 2.6 kernel. This should significantly lower latency times for user-interactive applications, multimedia applications, and the like. This feature is especially good for real-time systems and embedded devices.
The following code snippet demonstrates the per-CPU data structure problem, in an SMP system: int arr[NR_CPUS]; arr[smp_processor_id()] = i; /* kernel preemption could happen here */ j = arr[smp_processor_id()]; /* i and j are not equal as smp_processor_id() may not be the same */ In this situation, if kernel preemption had happened at the specified point, the task would have been assigned to some other processor upon re-schedule, in which case smp_processor_id() would have returned a different value.
5.3.1 Pipes Pipes allow the transfer of data in a FIFO manner. The pipe() system call creates unnamed pipes. Unnamed pipes are only accessible to the creating process and its descendants through file descriptors. Once a pipe is created, a process may use the read() and write() VFS system calls to access it. In order to allow access from the VFS layer, the kernel creates an inode object and two file objects for each pipe.
pipe_inode_info: Contains generic state information about the pipe with fields such as base (which points to the kernel buffer), len (which represents the number of bytes written into the buffer and yet to be read), wait (which represents the wait queue), and start (which points to the read position in the kernel buffer).
The inode allocation routine of the disk-based file system does the allocation and initialization of the inode object; thus, object reuse is handled by the disk-based file system. 5.3.2.2 FIFO open A call to the open() VFS system call performs the same operation as it does for device special files. Regular DACs when the FIFO inode is read are identical to access checks performed for other file system objects, such as files and directories.
ipc_id: The ipc_id data structure describes the security credentials of an IPC resource with the • p field, which is a pointer to the credential structure of the resource. kern_ipc_perm: The kern_ipc_perm data structure is a credential structure for an IPC •...
5.3.3.3.3 msgget() This function is invoked to create a new message queue, or to get a descriptor of an existing queue based on a key. The newly created credentials of the message queue are initialized from the credentials of the creating process.
5.3.3.4.4 semctl() A function that is invoked to set attributes, query status, or delete a semaphore. A semaphore is not deleted until the process waiting for a semaphore has received it. DAC is performed by invoking the ipcperms() function. 5.3.3.5 Shared memory regions Shared memory regions allow two or more processes to access common data by placing the processes in an IPC shared memory region.
Processes that communicate using sockets use a client-server model. A server provides a service, and clients make use of that service. A server that uses sockets first creates a socket and then binds a name to it. An Internet domain socket has an IP port address bound to it. The registered port numbers are listed in /etc/services.
For an Internet domain socket, the address of the server is its IP address and its port number. Sockets are created using the socket() system call. Depending on the type of socket, either UNIX domain or internet domain, the socket family operations vector invokes either unix_create() or inet_create().
For more information, see the TCP/IP Tutorial and Technical Overview IBM Redbook by Adolfo, John & Roland. It is at the http://www.redbooks.ibm.com/abstracts/gg243376.html...
Page 83
The application layer consists of all the various application clients and servers, such as the Samba file and print server, the Apache web server, and others. Some of the application-level protocols include Telnet, for remote login; FTP, for file transfer; and, SMTP, for mail transfer.
5.4.2 Transport layer protocols The transport layer protocols supported by the SLES kernel are TCP and UDP. 5.4.2.1 TCP is a connection-oriented, end-to-end, reliable protocol designed to fit into a layered hierarchy of protocols that support multi-network applications. TCP provides for reliable IPC between pairs of processes in host computers attached to distinct but interconnected computer communication networks.
Page 85
The following section introduces Internet Protocol Version 6 (IPv6). For additional information about referenced socket options and advanced IPv6 applications, see RFC 3542. Internet Protocol Version 6 (IPv6) was designed to improve upon and succeed Internet Protocol Version 4 (IPv4). IPv4 addresses consist of 32 bits.
5.4.3.2.3 Flow Labels The IPv6 header has a field to in which to enter a flow label. This provides the ability to identify packets for a connection or a traffic stream for special processing. 5.4.3.2.4 Security The IPv6 specifications mandate IP security. IP security must be included as part of an IPv6 implementation. IP security provides authentication, data integrity, and data confidentiality to the network through the use of the Authentication and Encapsulating Security Payload extension headers.
Page 87
The phrase data integrity implies that the data received is as it was when sent. It has not been tampered, altered, or impaired in any way. Data authentication ensures that the sender of the data is really who you believe it to be. Without data authentication and integrity, someone can intercept a datagram and alter the contents to reflect something other than what was sent, as well as who sent it.
Page 88
In tunnel mode, the entire IP datagram is encapsulated, protecting the entire IP datagram. An IP Packet with tunnel mode AH 5.4.3.4.1.2 Encapsulating Security Payload Protocol (ESP) The Encapsulating Security Payload (ESP) header is defined in RFC 2406. Besides data confidentiality, ESP also provides authentication and integrity as an option.
Page 89
5.4.3.4.1.3 Security Associations RFC2401 defines a Security Association (SA) as a simplex or one-way connection that affords security services to the traffic it carries. Separate SAs must exist for each direction. IPSec stores the SAs in the Security Association Database (SAD), which resides in the Linux kernel. 5.4.3.4.1.4 Security Policy A Security Policy is a general plan that guides the actions taken on an IP datagram.
5.4.3.4.1.8 Cryptographic subsystem IPSec uses the cryptographic subsystem described in this section. The cryptographic subsystem performs several cryptographic-related assignments, including Digital Signature Algorithm (DSA) signature verification, in-kernel key management, arbitrary-precision integer arithmetic, and verification of kernel modules signatures. This subsystem was initially designed as a general-purpose mechanism, preserving the design ideas of simplicity and flexibility, including security-relevant network and file system services such as encrypted files and file systems, network file system security, strong file system integrity, and other kernel networking services where cryptography was required.
The other end, on the client, creates its own communication endpoint and actively connects to the server endpoint at its known address. Figure 5-19 shows the steps taken by both server and client to establish a connection, along with the system calls used at each stage.
The following subsections describe access control and object reuse handling associated with establishing a communications channel. 5.4.5.1 socket() socket() creates an endpoint of communication using the desired protocol type. Object reuse handling during socket creation is described in Section 5.3.5. socket() may perform additional access control checks by calling the security_socket_create() and security_socket_post_create() LSM hooks, but the SLES kernel does not use these LSM hooks.
Figure 5-21: bind() function for UNIX domain TCP socket Similarly, for UNIX domain sockets, bind() invokes unix_bind(). unix_bind() creates an entry in the regular ext3 file system space. This process of creating an entry for a socket in the regular file system space has to undergo all file system access control restrictions.
5.4.5.6 Generic calls read(), write() and close(): read(), write() and close() are generic I/O system calls that operate on a file descriptor. Depending on the type of object, whether regular file, directory, or socket, appropriate object-specific functions are invoked. 5.4.5.7 Access control DAC mediation is performed at bind() time.
Page 95
A system call interface is provided to provide restricted access to user processes. This interface • allows user processes to allocate and free storage, and also to perform memory-mapped file I/O. Figure 5-23: Memory subsystem and its interaction with other subsystems This section highlights the implementation of the System Architecture requirements of a) allowing the kernel software to protect its own memory resources and b) isolating memory resources of one process from those of another, while allowing controlled sharing of memory resources between user processes.
5.5.1 Four-Level Page Tables Before the current implementation of four-level page tables, the kernel implemented a three-level page table structure for all architectures. The three-level page table structure that previously existed was constituted, from top to bottom, for the page global directory (PGD), page middle directory (PMD), and PTE. In this implementation, the PMD is absent on systems that only present two-level page tables, so the kernel was able to recognize all architectures as if they possessed three-level page tables.
Figure 5-25: New page-table implementation: the four-level page-table architecture The creation and insertion of a new level, the PUD level, immediately below the top-level PGD directory aims to maintain portability and transparency once all architectures have an active PGD at the top of hierarchy and an active PTE at the bottom.
Page 98
The larger kernel virtual address space allows the system to manage more physical memory. Up to 64 GB of main memory is supported by SLES on x86-compatible systems. The larger user virtual address space allows applications to use approximately 30% more memory (3.7—3.8 GB), improving performance for applications that take advantage of the feature.
Page 99
5.5.2.1.1 Segmentation The segmentation unit translates a logical address into a linear address. A logical address consists of two parts: a 16 bit segment identifier called the segment selector, and a 32-bit offset. For quick retrieval of the segment selector, the processor provides six segmentation registers whose purpose is to hold segment selectors.
Page 100
5.5.2.1.2 Paging The paging unit translates linear addresses into physical addresses. It checks the requested access type against the access rights of the linear address. Linear addresses are grouped in fixed-length intervals called pages. To allow the kernel to specify the physical address and access rights of a page instead of addresses and access rights of all the linear addresses in the page, continuous linear addresses within a page are mapped to continuous physical addresses.
Page 101
Figure 5-30: Regular paging In extended paging, 32 bits of linear address are divided into two fields: Directory: The most significant 10 bits represents directory. • Offset: The remaining 22 bits represents offset. • Figure 5-31: Extended paging Each entry of the page directory and of the page table is represented by the same data structure. This data structure includes fields that describe the page table or page entry, such as accessed flag, dirty flag, and page size flag.
Page 102
User-Supervisor flag: This flag contains the privilege level that is required for accessing the page or page table. The User-Supervisor flag is either 0, which indicates that the page can be accessed only in kernel mode, or 1, which indicates that it can always be accessed. Figure 5-32: Access control through paging 5.5.2.1.2.1 Paging in the SLES kernel...
Page 103
For more information about call gates, refer to the http://www.csee.umbc.edu/~plusquel/310/slides/micro_arch4.html 5.5.2.1.2.3 Translation lookaside buffers The System x processor includes other caches, in addition to the hardware caches. These caches are called Translation Lookaside Buffers (TLBs), and they speed up the linear-to-physical address translation. The TLB is built up as the kernel performs linear-to physical translations.
OS instance and applications. LPAR allows multiple, independent operating system images of Linux to run on a single System p server. This section describes logical partitions and their impact on memory addressing and access control. To learn more about System p systems, see “PowerPC 64-bit Kernel Internals”...
Page 105
Figure 5-34: Logical partitions On System p systems without logical partitions, the processor has two operating modes, user and supervisor. The user and supervisor modes are implemented using the PR bit of the Machine State Register (MSR). Logical partitions on System p systems necessitate a third mode of operation for the processor. This third mode, called the hypervisor mode, provides all the partition control and partition mediation in the system.
Page 106
0 The processor is not in hypervisor state. • 1 If MSRPR= 0 the processor is in hypervisor state; otherwise, the processor is not in hypervisor • state. The hypervisor takes the value of 1 for hypervisor mode and 0 for user and supervisor mode. The following table describes the privilege state of the processor as determined by MSR [HV] and MSR [PR] as follows: Privilege State privileged(supervisor mode)
Page 107
Figure 5-36: Determination of processor mode in LPAR Just as certain memory areas are protected from access in user mode, some memory areas, such as hardware page tables, are accessible only in hypervisor mode. The PowerPC and POWER architecture provides only one system call instruction.
Page 108
5.5.2.2.2 Hypervisor The hypervisor program is stored in a system flash module in the server hardware. During system initialization, the hypervisor is loaded into the first physical address region of system memory. The hypervisor program is trusted to create partition environments, and is the only program that can directly access special processor registers and translation table entries.
Page 109
5.5.2.2.4 Virtual mode addressing Operating systems use another type of addressing, virtual addressing, to give user applications an effective address space that exceeds the amount of physical memory installed in the system. The operating system does this by paging infrequently used programs and data from memory out to disk, and bringing them back into memory on demand.
Figure 5-38: DMA addressing 5.5.2.2.7 Run-Time Abstraction Services System p hardware platforms provide a set of firmware Run-Time Abstraction Services (RTAS) calls. In LPAR, these calls perform additional validation checking and resource virtualization for the partitioned environment. For example, although there is only one physical non-volatile RAM chip, and one physical battery-powered Time-of-Day chip, RTAS makes it appear to each partition as though it has its own non- volatile RAM area, and its own uniquely settable Time-of-Day clock.
Page 111
For further information about PowerPC 64 bit processor, see PowerPC 64-bit Kernel Internals by David Engebretson, Mike Corrigan & Peter Bergner at http://lwn.net/2001/features/OLS/pdf/pdf/ppc64.pdf. You can find further in formation about System p hardware at http://www-1.ibm.com/servers/eserver/pseries/linux/. The following describes the four address types used in System p systems. They are effective, virtual, physical, and block: Effective address: The effective address, also called the logical address, is a 64-bit address included •...
Page 112
Figure 5-41: Block address To access a particular memory location, the CPU transforms an effective address into a physical • address using one of the following address translation mechanisms. Real mode address translation, where address translation is disabled. The physical address is the •...
Page 113
DR: Data Address Translation. The value of 0 disables translation, and the value of 1 enables • translation. 5.5.2.3.2 Page descriptor Pages are described by Page Table Entries (PTEs). The operating system generates and places PTEs in a page table in memory. A PTE on SLES is 128 bits in length. Bits relevant to access control are Page protection bits (PP), which are used with MSR and segment descriptor fields to implement access control.
Page 114
Figure 5-45: Block Address Translation entry Vs: Supervisor mode valid bit. Used with MSR[PR] to restrict translation for some block addresses. • Vp: User mode valid bit. Used with MSR[PR] to restrict translation for some block addresses. • PP: Protection bits for block. •...
Page 115
Real Mode Address Translation: Real Mode Address Translation is not technically the translation of any addresses. Real Mode Address Translation signifies no translation. That is, the physical address is the same as the effective address. The operating system uses this mode during initialization and some interrupt processing.
Page 116
Page address translation begins with a check to see if the effective segment ID, corresponding to the effective address, exists in the Segment Lookaside Buffer (SLB). The SLB provides a mapping between Effective Segment Ids (ESIDs) and Virtual Segment Ids (VSIDs). If the SLB search fails, a segment fault occurs. This is an Instruction Segment exception or a data segment exception, depending on whether the effective address is for an instruction fetch or for a data access.
Page 117
Figure 5-48: Page Address Translation and access control...
All CPUs, memory, and devices are directly under the control of the SLES operating system. Native Hardware mode is useful when a single server requires a large amount of memory. Native Hardware mode is not very common, because it requires device driver support in SLES for all attached devices, and Native Hardware does not provide the flexibility of the other two modes.
Page 119
Absolute address: An absolute address is the address assigned to a main memory location. An absolute address is used for a memory access without any transformations performed on it. Effective address: An effective address is the address that exists before any transformation takes place by dynamic address translation or prefixing.
Page 120
Figure 5-49: System z address types and their translation 5.5.2.4.7.1 Dynamic address translation Bit 5 of the current PSW indicates whether a virtual address is to be translated using paging tables. If it is, bits 16 and 17 control which address space translation mode (primary, secondary, access-register, or home) is used for the translation.
Page 121
Figure 5-51: Address translation modes Each address-space translation mode translates virtual addresses corresponding to that address space. For example, primary address-space mode translates virtual addresses from the primary address space, and home address space mode translates virtual addresses belonging to the home address space. Each address space has an associated Address Space Control Element (ASCE).
Page 122
Figure 5-52: 64-bit or 31-bit Dynamic Address Translation 5.5.2.4.7.2 Prefixing Prefixing provides the ability to assign a range of real addresses to a different block in absolute memory for each CPU, thus permitting more than one CPU sharing main memory to operate concurrently with a minimum of interference.
Page 123
For a detailed description of prefixing as well as implementation details, see z/Architecture Principles of Operation at http://publibz.boulder.ibm.com/epubs/pdf/dz9zr002.pdf. 5.5.2.4.8 Memory protection mechanisms In addition to separating the address space of user and supervisor states, the z/Architecture provides mechanisms to protect memory from unauthorized access. Memory protections are implemented using a combination of the PSW register, a set of sixteen control registers (CRs), and a set of sixteen access registers (ARs).
Page 124
5.5.2.4.8.2 Page table protection The page table protection mechanism is applied to virtual addresses during their translation to real addresses. The page table protection mechanism controls access to virtual storage by using the page protection bit in each page-table entry and segment-table entry. Protection can be applied to a single page or an entire segment (a collection of contiguous pages).
Page 127
5.5.2.4.8.3 Key-controlled protection When an access attempt is made to an absolute address, which refers to a memory location, key-controlled protection is applied. Each 4K page, real memory location, has a 7-bit storage key associated with it. These storage keys for pages can only be set when the processor is in the supervisor state. The Program Status Word contains an access key corresponding to the current running program.
Page 128
Access control and protection mechanisms are part of both segmentation and paging. The following sections describe the four address types on IBM eServer 326 computers (logical, effective, linear, and physical), and how segmentation and paging are used to provide access control and memory resource separation by SLES on IBM eServer 326 systems.
Page 129
The segment selector specifies an entry in either the global or local descriptor table. The specified descriptor- table entry describes the segment location in virtual-address space, its size, and other characteristics. The effective address is used as an offset into the segment specified by the selector. 5.5.2.5.2 Effective address The offset into a memory segment is referred to as an effective address.
Page 130
Requestor Privilege Level (RPL):RPL represents the privilege level of the program that created the • segment selector. The RPL is stored in the segment selector used to reference the segment descriptor. Descriptor Privilege Level (DPL):DPL is the privilege level that is associated with an individual •...
Page 131
calls. If the code segment is non-conforming (with conforming bit C set to zero in the segment descriptor), then the processor first checks to ensure that CPL is equal to DPL. If CPL is equal to DPL, then the processor performs the next check to see if the RPL value is less than or equal to the CPL.
Page 132
Figure 5-60: Contiguous linear addresses map to contiguous physical addresses The eServer 326 supports a four-level page table. The uppermost level is kept private to the architecture- specific code of SLES. The page-table setup supports up to 48 bits of address space. The x86-64 architecture supports page sizes of 4 KB and 2 MB.
Page 133
Figure 5-61: 4 KB page translation, virtual to physical address translation When the page size is 2 MB, bits 0 to 20 represent the byte offset into the physical page. That is, page table offset and byte offset of the 4 KB page translation are combined to provide a byte offset into the 2 MB physical page.
Page 134
Figure 5-62: 2 MB page translation, virtual to physical address translation Each entry of the page map level-4 table, the page-directory pointer table, the page-directory table, and the page table is represented by the same data structure. This data structure includes fields that interact in implementing access control during paging.
Read/Write flag: This flag contains access rights of the physical pages mapped by the table entry. • The R/W flag is either read/write or read. If set to 0, the corresponding page can only be read; otherwise, the corresponding page can be written to or read. The R/W flag affects all physical pages mapped by the table entry.
NUMA alleviates these bottlenecks by limiting the number of CPUs on any one memory bus and connecting the various nodes by means of a high-speed interconnect. NUMA allows SLES Server to scale more efficiently for systems with dozens or hundreds of CPUs, because CPUs can access a dedicated memory bus for local memory.
Figure 5-65: Rmap VM For more information about Rmap VM, see http://lwn.net/Articles/23732/ http://www-106.ibm.com/developerworks/linux/library/l-mem26/. 5.5.3.3 Huge Translation Lookaside Buffers This memory management feature is valuable for applications that use a large virtual address space. It is especially useful for database applications. The CPU Translation Lookaside Buffer (TLB) is a small cache used for storing virtual-to-physical mapping information.
Page 138
Huge TLB File system (hugetlbfs) is a pseudo file system, implemented in fs/hugetlbfs/inode.c. The basic idea behind the implementation is that large pages are being used to back up any file that exists in the file system. During initialization, init_hugetlbfs_fs() registers the file system and mounts it as an internal file system with kern_mount().
5.5.3.4 Remap_file_pages Remap_file_pages is another memory management feature that is suitable for large memory and database applications. It is primarily useful for x86 systems that use the shared memory file system (shmemfs). A shmemfs memory segment requires kernel structures for control and mapping functions, and these structures can grow unacceptably large given a large enough segment and multiple sharers.
5.5.3.6 Memory area management Memory areas are sequences of memory cells having contiguous physical addresses with an arbitrary length. The SLES kernel uses the buddy algorithm for dealing with relatively large memory requests, but in order to satisfy kernel needs of small memory areas, a different scheme, called slab allocator, is used. The slab allocator views memory areas as objects with data and methods.
Page 141
address returned by arch_get_unmapped_area() to contain a linear address that is part of another process’s address space. In addition to this process compartmentalization, the do_mmap() routine also makes sure that when a new memory region is inserted it does not cause the size of the process address space to exceed the threshold set by the system parameter rlimit.
5.5.5 Symmetric multiprocessing and synchronization The SLES kernel allows multiple processes to execute in the kernel simultaneously (the kernel is reentrant). It also supports symmetric multiprocessing (SMP), in which two or more processors share the same memory and have equal access to I/O devices. Because of re-entrancy and SMP synchronization, issues arises. This section describes various synchronization techniques used by the SLES kernel.
5.5.5.3 Spin locks Spin locks provide an additional synchronization primitive for applications running on SMP systems. A spin lock is just a simple flag. When a kernel control path tries to claim a spin lock, it first checks whether or not the flag is already set.
Figure 5-69: Audit framework components 5.6.1.1 Audit kernel components Linux Audit of the SLES kernel includes three kernel-side components relating to the audit functionality. The first component is a generic mechanism for creating audit records and communicating with user space. The communication is achieved via netlink socket interface.
Page 145
The kernel checks the effective capabilities of the sender process. If the sender does not possess the right capability, the netlink message is discarded. 5.6.1.1.2 Syscall auditing The second component is a mechanism that addresses system call auditing. It uses the generic logging mechanism for creating audit records and communicating with user space.
Page 146
Figure 5-71: Task Structure 5.6.1.1.5 Audit context fields Login ID: Login ID is the user ID of the logged-in user. It remains unchanged through the • setuid() or seteuid() system calls. Login ID is required by the Controlled Access Protection Profile to irrefutably associate a user with that user’s actions, even across su() calls or use of setuid binaries.
serial: A unique number that helps identify a particular audit record. Along with ctime, it can • determine which pieces belong to the same audit record. The (timestamp, serial) tuple is unique for each syscall and it lives from syscall entry to syscall exit. ctime: Time at system call entry.
When a filesystem object the audit subsystem is watching changes, the inotify subsystem calls the audit_handle_event() function. audit_handle_event() in turn updates the audit subsystem's watch data for the watched entity. This process is detailed in Section 5.6.3.1.3. 5.6.1.3 User space audit components The main user level audit components consist of a daemon (auditd), a control program (auditctl), a library (libaudit), a configuration file (auditd.conf), and an initial setup file (auditd.rules).
Figure 5-72: Audit User Space Components 5.6.2 Audit operation and configuration options 5.6.2.1 Configuration There are many ways to control the operation of the audit subsystem. The controls are available at compilation time, boot time, daemon startup time, and while the daemon is running. At compilation time, SLES kernel provides three kernel configuration options that control the level of audit support compiled into the kernel.
Page 150
Option log_file log_format priority_boost flush freq num_logs max_log_file max_log_file_action space_left space_left_action admin_space_left admin_space_left_actio disk_full_action disk_error_action Table 5-2: /etc/auditd.conf options In addition to setting the audit filter rules, auditctl can be used to control the audit subsystem behavior in the kernel even when auditd is running. These settings are listed in Table 5-3. Description name of the log file How to flush the data from...
Option Table 5-3: audictl control arguments 5.6.2.2 Operation The audit framework operation consists of the following steps: On kernel startup, if audit support was compiled into the kernel, the following are initialized: 1. The netlink socket is created. 2. Four lists that hold the filter rules in the kernel are initialized. 3.
7. If audit is enabled, the kernel intercepts the system calls, and generates audit records according to the filter rules. Or, the kernel generates audit records for watches set on particular file system files or directories. 8. Trusted programs can also write audit records for security relevant operation via the audit netlink, not directly to the audit log.
Page 153
Figure 5-73: Audit Record Generation 5.6.3.1.2 Syscall audit record generation Once attached, every security-relevant system call performed by the process is evaluated in the kernel. The process’s descendants maintain their attachment to the audit subsystem. 1. All security-relevant system calls made by the process are intercepted at the beginning or at the exit of the system call code.
Page 154
generates the audit record, and sends the record to netlink socket. Both audit_syscall_entry() and audit_syscall_exit() call audit_filter_syscall() to apply filter logic, to check whether to audit or not to audit the call. Figure 5-74: Extension to system calls interface Filtering logic allows an administrative user to filter out events based on the rules set, as described in the auditctl man page.
5.6.3.1.4 Socket call and IPC audit record generation Some system calls pass an argument to the kernel specifying which function the system call is requesting from the kernel. These system calls request multiple services from the kernel through a single entry point. For example, the first argument to the ipc() call specifies whether the request is for semaphore operation, shared memory operation, and so forth.
Page 156
timestamp of the record and the serial number are used by the user-space daemon to determine which pieces belong to the same audit record. The tuple is unique for each syscall and lasts from syscall entry to syscall exit. The tuple is composed of the timestamp and the serial number. Each audit record for system calls contain the system call return code, which indicates if the call was successful or not.
Page 157
Event Description Startup and shutdown of audit functions Modification of audit configuration files Successful and unsuccessful file read/write Audit storage space exceeds a threshold Audit storage space failure Operation on file system objects Operations on message queue Operations on semaphores Operations on shared memory segments Rejection or acceptance by the TSF of any tested secret.
Event Description Execution of the test of the underlying machine and the result of the test Changes to system time Setting up a trusted channel Table 5-4: Audit Subsystem event codes 5.6.4 Audit tools In addition to the main components, the user level provides a search utility, ausearch, and a trace utility, autrace.
Lower-layer functions, such as scheduling and interrupt management, cannot be modularized. Kernel modules can be used to add or replace system calls. The SLES kernel supports dynamically-loadable kernel modules that are loaded automatically on demand. Loading and unloading occurs as follows: 1.
Page 160
STRUCTURE task_struct linux_binprm super_block inode file sk_buff net_device kern_ipc_perm msg_msg Table 5-5: Kernel data structures modified by the LSM kernel patch and the corresponding abstract objects The security_operations structure is the main security structure that provides security hooks for various operations such as program execution, file system, inode, file, task, netlink, UNIX domain networking, socket, and System V IPC mechanisms.
Figure 5-76: LSM hook architecture LSM adds a general security system call that simply invokes the sys_security hook. This system call and hook permits security modules to implement new system calls for security-aware applications. 5.7.2 LSM capabilities module The LSM kernel patch moves most of the existing POSIX.1e capabilities logic into an optional security module stored in the file security/capability.c.
Administrative utilities provide a mechanism for administrators to configure, query, and control ● AppArmor. For background information on AppArmor which was originally named SubDomain, SubDomain: Parsimonious Server Security by Crispin Cowan, Steve Beattie, Greg KroahHartman, Calton Pu, Perry Wagle, and Virgil Gligor at https://forgesvn1.novell.com/viewsvn/apparmor/trunk/docs/papers/subdomain lisa00.pdf?revision=3 [CRISP] and http://www.novell.com/documentation/apparmor/pdfdoc/apparmor2_admin/apparmor2_admin.pdf and http://forge.novell.com/modules/xfmod/project/?apparmor . 5.8.1 AppArmor administrative utilities The primary configuration file for AppArmor is /etc/apparmor/subdomain.conf . (SubDomain was the original name for AppArmor.) The configuration file defines the directory where AppArmor profiles are...
px discrete profile execute ● Px discrete profile execute after scrubbing the environment ● ix inherit execute ● m allow PROT_EXEC with mmap(2) calls ● l – link ● For more information about complete AppArmor profile syntax, please see the apparmor.d man page. AppArmor profiles are loaded into the kernel by the apparmor_parser tool. apparmor_parser can load new profiles, replace profiles, and remove profiles. Profiles can optionally and individually be selected to be loaded in “Complain” mode so that AppArmor does not enforce the profile but just logs an error message if access would be denied by AppArmor with the profile. For more information on apparmor_parser, see the apparmor_parser man page. AppArmor also provides a status tool, apparmor_status. apparmor_status provides information about the number of profiles loaded in enforcing and complaining mode and the number of running processes being confined by AppArmor. For more information on apparmor_status please see the apparmor_status man page. The confined program reports which programs with open network sockets are running without the protection of an AppArmor profile. The complain program allows an authorized administrator to switch AppArmor out of enforcing mode and into complaining mode for a targeted program. The enforce program allows an authorized administrator to do the opposite, switch from complain to enforcing mode for a particular profile. genprof can be used to generate a profile with all of the permission that were exercised during a test run of the targeted program. Please see the confined, enforce, complain, and genprof man pages for more detail. For an application contained by an AppArmor profile, access that is not explicitly allowed is denied. 5.8.2 AppArmor access control functions AppArmor access control functions are called through LSM hooks from various points in the kernel when new subjects and objects are created, when access between subject and object is mediated, and when subject and object security attributes transition to different values (such as during an execve()call).
Device drivers A device driver is a software layer that makes a hardware device respond to a well-defined programming interface. The kernel interacts with the device only through these well-defined interfaces. For detailed information about device drivers, see Linux Device Drivers, 2nd Edition, by Alessandro Rubini and Jonathan Corbet.
guest program or interpreted machine. The interpreted and host machines execute guest and host programs, respectively. The interpretive-execution facility is invoked by executing the Start Interpretive Execution (SIE) processor instruction, which causes the CPU to enter the interpretive-execution mode and to begin execution of the guest program under control of the operant of the instruction, called the state description.
Conditional interceptions refer to functions that are executed for the guest unless a specified condition • is encountered that causes control to be returned to the host by the process that called the interception. Following are some of the controls that can cause interception when the related function is handled for the guest: Supervisor Call (SVC) instruction: It is to be specified whether all guest SVC instructions cause •...
This extra level of indirection is needed for character devices, but not for block devices, because of the large variety of character devices and the operations they support. The following diagram illustrates how the kernel maps the file operations vector of the device file object to the correct set of operations routines for that device. Figure 5-77: Setup of f_op for character device specific file operations 5.9.3 Block device driver...
Architecture AMD® AMD64 IBM eServer System p IBM eServer System z Table 5-6: Boot Loaders by Architecture This section describes the system initialization process of eServer systems. Because part of the initialization...
Page 169
the system runlevel by controlling PID 1. For more information on the /etc/inittab file, please see the inittab(5) man page. For more information on the init program, please see the init(8) manpage. The init program generally follows these startup steps: 1.
5.10.2.1 Boot methods SLES supports booting from a hard disk, a CD-ROM, or a floppy disk. CD-ROM and floppy disk boots are used for installation, and to perform diagnostics and maintenance. A typical boot is from a boot image on the local hard disk.
Page 171
14. The boot loader sets the IDT with null interrupt handlers. It puts the system parameters obtained from the BIOS and the parameters passed to the operating system into the first page frame. 15. The boot loader identifies the model of the processor. It loads the gdtr and idtr registers with the addresses of the Global Descriptor Table and Interrupt Descriptor Table, and jumps to the start_kernel() function.
Page 172
Figure 5-79: System x SLES boot sequence...
5.10.3 System p This section briefly describes the system initialization process for System p servers. 5.10.3.1 Boot methods SLES supports booting from a hard disk or from a CD-ROM. CD-ROM boots are used for installation and to perform diagnostics and maintenance. A typical boot is from a boot image on the local hard disk. 5.10.3.2 Boot loader A boot loader is the first program that is run after the system completes the hardware diagnostics setup in the...
Page 174
1. Yaboot allows an administrator to perform interactive debugging of the startup process by executing the /etc/sysconfig/init script. 2. Mounts the /proc special file system. 3. Mounts the /dev/pts special file system. 4. Executes /etc/rc.d/rc.local, which was set by an administrator to perform site-specific setup functions.
Figure 5-80: System p SLES boot sequence 5.10.4 System p in LPAR SLES runs in a logical partition on an System p system. The hypervisor program creates logical partitions, which interacts with actual hardware and provides virtual versions of hardware to operating systems running in different logical partitions.
5.10.4.1 Boot process For an individual computer, the boot process consists of the following steps when the CPU is powered on or reset: 1. The hypervisor assigns memory to the partition as a 64 MB contiguous load area and the balance in 256 KB chunks.
Page 177
Starts the agetty program. • For more details about services started at run level 3, see the scripts in /etc/rc.d/rc3.d on a SLES system. Figure 5-81 schematically describes the boot process of System p LPARs. Figure 5-81: System p LPAR SLES boot sequence...
5.10.5 System z This section briefly describes the system initialization process for System z servers. 5.10.5.1 Boot methods Linux on System z supports three installation methods: native, LPAR, and z/VM guest installations. SLES only supports z/VM guest installation. The process described below corresponds to the z/VM guest mode. The boot method for the SLES guest partition involves issuing an Initial Program Load (IPL) instruction to the Control Program (CP), which loads the kernel image from a virtual disk (DASD) device.
Page 179
4. Executes /etc/rc.d/rc.local, which was set by an administrator to perform site-specific setup functions. 5. Performs run-level specific initialization by executing startup scripts defined in /etc/inittab. The scripts are named /etc/rc.d/rcX.d, where X is the default run level. The default run level for a SLES system in the evaluated configuration is 3. The following lists some of the initializations performed at run level 3.
Page 180
Figure 5-82: System z SLES boot sequence 5.10.6 eServer 326 This section briefly describes the system initialization process for eServer 326 servers. For detailed information on system initialization, see AMD64 Architecture, Programmer’s Manual Volume 2: System Programming, at http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/24593.pdf. 5.10.6.1 Boot methods SLES supports booting from a hard disk, a CD-ROM, or a floppy disk.
Page 181
5.10.6.2 Boot loader After the system completes the hardware diagnostics setup in the firmware, the first program that runs is the boot loader. The boot loader is responsible for copying the boot image from hard disk and then transferring control to it. SLES supports GRUB, which lets you set pointers in the boot sector to the kernel image and to the RAM file system image.
Page 182
17. x86_64_start_kernel() completes the kernel initialization by initializing Page Tables, Memory Handling Data Structures IDT tables, slab allocator (described in Section 5.5.3.6), system date, and system time. 18. Uncompress the initrd initial RAM file system, mounts it, and then executes /linuxrc. 19.
Figure 5-83: eServer 326 SLES boot sequence 5.11 Identification and authentication Identification is when a user possesses an identity to a system in the form of a login ID. Identification establishes user accountability and access restrictions for actions on the system. Authentication is verification that the user’s claimed identity is valid, and is implemented through a user password at login time.
provides a way to develop programs that are independent of the authentication scheme. These programs need authentication modules to be attached to them at run-time in order to work. Which authentication module is to be attached is dependent upon the local system setup and is at the discretion of the local system administrator.
6. Each authentication module performs its action and relays the result back to the application. 7. The PAM library is modified to create a USER_AUTH type of audit record to note the success or failure from the authentication module. 8. The application takes appropriate action based on the aggregate results from all authentication modules.
Page 186
pam_passwdqc.so: Performs additional password strength checks. For example, it rejects • passwords such as “1qaz2wsx” that follow a pattern on the keyboard. In addition to checking regular passwords it offers support for passphrases and can provide randomly generated passwords. pam_env.so: Loads a configurable list of environment variables, and it is configured with the file •...
5.11.2 Protected databases The following databases are consulted by the identification and authentication subsystem during user session initiation: /etc/passwd: For all system users, it stores the login name, user ID, primary group ID, real name, • home directory, and shell. Each user’s entry occupies one line, and fields are separated by a colon (:). The file is owned by the root user and root group, and its mode is 644.
/etc/ftpusers: The ftpusers text file contains a list of users who cannot log in using the File • Transfer Protocol (FTP) server daemon. The file is owned by the root user and root group, and its mode is 644. /etc/apparmor/* and /etc/apparmor.d/*: The directories /etc/apparmor and •...
Page 189
6. Execs the login program. The steps that are relevant to the identification and authorization subsystem are step 5, which prompts for the user’s login name, and step 6, which executes the login program. The administrator can also use a command-line option to terminate the program if a user name is not entered within a specific amount of time.
Page 190
17. Sets effective, real, and saved user ID. 18. Changes directory to the user’s home directory. 19. Executes shell. 5.11.3.4 mingetty mingetty, the minimal Linux getty, is invoked from /sbin/init when the system transitions from single-user mode to multi-user mode. mingetty opens a pseudo tty port, prompts for a login name, and invokes /bin/login to authenticate.
Page 191
16. Sets up signals. 17. Forks a child. 18. Parent waits on child's return; child continues: 19. Adds the new GID to the group list. 20. Sets the GID. 21. Logs an audit record. 22. Starts a shell if the -c flag was specified. 23.
4. Processes command-line arguments. 5. Sets up the environment variable array. 6. Invokes pam_start() to initialize the PAM library, and to identify the application with a particular service name. 7. Invokes pam_set_item() to record the tty and user name. 8. Validates the user that the application invoker is trying to become. 9.
Page 193
SSL, refer to the following: Open SSL Web site at http://www.openssl.org/docs. • IBM Redbook TCP/IP Tutorial and Technical Overview, by Adolfo Rodriguez, et al. at • http://www.redbooks.ibm.com/redbooks/pdfs/gg243376.pdf. “The TLS Protocol version 1.1” by Tim Dierks and Eric Rescorla at •...
Page 194
5.12.1.1 Concepts SSL is used to authenticate endpoints and to secure the contents of the application-level communication. An SSL-secured connection begins by establishing the identities of the peers, and establishing an encryption method and key in a secure way. Application-level communication can then begin. All incoming traffic is decrypted by the intermediate SSL layer and then forwarded on to the application;...
Page 195
Figure 5-87: Encryption Algorithm and Key Data confidentiality can be maintained by keeping the algorithm, the key, or both, secret from unauthorized people. In most cases, including OpenSSL, the algorithm used is well-known, but the key is protected from unauthorized people. 5.12.1.1.1.1 Encryption with symmetric keys A symmetric key, also known as a secret key, is a single key that is used for both encryption and decryption.
Page 196
Figure 5-88: Asymmetric keys If encryption is done with a public key, only the corresponding private key can be used for decryption. This allows a user to communicate confidentially with another user by encrypting messages with the intended receiver’s public key. Even if messages are intercepted by a third party, the third party cannot decrypt them. Only the intended receiver can decrypt messages with his or her private key.
SSL consists of the SSL Handshake Protocol, the SSL Change Cipher Spec Protocol, and the SSL Alert Protocol. The SSL Handshake Protocol is used by the client and server to authenticate each other, and to agree on encryption and hash algorithms to be used by the SSL Record Protocol. The authentication method supported by SSL in the evaluated configuration is client and server authentication using X.509 certificates.
Page 198
A connection is identified with a server and client random numbers, a server write MAC secret key, a client write MAC secret key, a server write key, a client write key, initialization vectors, and sequence numbers.
Page 199
2. Server key exchange message: The server key exchange message is sent by the server if it has no certificate, has a certificate only used for signing (that is, DSS [DSS] certificates, or signing-only RSA [RSA] certificates), or FORTEZZA KEA key exchange.
The protocol consists of a single message, which is encrypted with the current security parameters. Using the change cipher spec message, security parameters can be changed by either the client or the server. The receiver of the change cipher spec message informs the SSL record protocol of the updates to security parameters.
Page 201
Data Encryption Standard (DES): DES is a symmetric key cryptosystem derived from the Lucifer • algorithm developed at IBM. DES describes the Data Encryption Algorithm (DEA). DEA operates on a 64-bit block size and uses a 56-bit key. TDES (3DES): TDES, or Triple DES, encrypts a message three times using DES. This encryption •...
On a local system, the user starts the SSH client to open a connection to a remote server running the sshd daemon. If the user is authenticated successfully, an interactive session is initiated, allowing the user to run commands on the remote system.
• diffie-hellman-group1- sha1 method specifies Diffie-Hellman key exchange with SHA-1 as HASH. Sections 5.12.2.1 and 5.12.2.2 briefly describe the implementation of SSH client and SSH server. For detailed information about the SSH Transport Layer Protocol, SSH Authentication Protocol, SSH Connection Protocol, and SSH Protocol Architecture, refer to the corresponding protocol documents at http://www.ietf.org/ids.by.wg/secsh.html.
5.12.3 Very Secure File Transfer Protocol daemon Very Secure File Transfer Protocol daemon (VSFTPD) provides a secure, fast, and stable file transfer service to and from a remote host. The behavior of VSFTPD can be controlled by its configuration file /etc/vsftpd/vsftpd.conf.
Page 205
22. Checks for new jobs. 23. Restarts the server if there are no active clients or jobs, or the reload timeout has been reached – this causes cupd to stop the server, reread the configuration file, and restart the server.
Page 206
38. Updates the root certificate every time a 5 minute timer has elapsed. 39. Goes back to step 20. 40. Upon exit from the main loop: 41. Logs a status message. 42. Stops the server. 43. Frees all jobs. 44. Frees file descriptor sets. 45. Closes audit file descriptor.
Page 207
• Calculation of message digests. • Encryption and Decryption with ciphers. • SSL and TLS client and server tests. • Handling of S/MIME signed or encrypted mail. • For detailed information about the openssl command and its usage, see: http://www.openssl.org/docs/apps/openssl.html.
# Service-level configuration # --------------------------- [ssmtp] accept = 465 connect = 25 The above configuration secures localhost-SMTP when someone connects to it via port 465. The configuration tells stunnel to listen to the SSH port 465, and to send all info to the plain port 25 on localhost.
Page 209
14. Invokes pam_chauthok() to rejuvenate user’s authentication tokens. 15. Exits. 5.13.1.2 chfn The chfn program allows users to change their finger information. The finger command displays the information, stored in the /etc/passwd file. Refer to the chfn man page for detailed information. chfn generally follows these steps: 1.
11. Invokes setpwnam() to update appropriate database files with the new shell. 12. Exits. 5.13.2 User management 5.13.2.1 useradd The useradd program allows an authorized user to create new user accounts on the system. Refer to the useradd man page for more information. useradd generally follows these steps: 1.
Page 211
6. Processes command-line arguments. 7. Ensures that the user account being modified exists. 8. Invokes open_files() to lock and open authentication database files. 9. Invokes usr_update() to update authentication database files with updated account information. 10. Generates audit record to log actions of the usermod command. The logged actions include locking and unlocking of user account, changing of user password, user name, user ID, default user group, user shell, user home directory, user comment, inactive days, expiration days, mail file owner, and moving of user’s home directory.
5.13.3 Group management 5.13.3.1 groupadd The groupadd program allows an administrator to create new groups on the system. Refer to the groupadd man page for more detailed information on usage of the command. groupadd generally follows these steps: 1. Sets language. 2.
5.13.3.2 groupmod The groupmod program allows an administrator to modify existing groups on the system. Refer to the groupmod man page for more information. groupmod generally follows these steps: 1. Sets language. 2. Invokes getpwuid (getuid()) to obtain application user’s passwd structure. 3.
5.13.4 System Time management 5.13.4.1 date The date program, for a normal user, displays current date and time. For an administrative user, date can also set the system date and time. Refer to the date man page for more information. date generally follows these steps: 1.
Page 216
(see packet (7)) when opening the socket. AMTU performs the following: 1. Using the PF_PACKET communication domain, opens another connection to the listening server and 2. Ensures that the random data transmitted is also the data received. Steps 1 and 2 are repeated for each configured network device.
Page 217
5.13.5.1.5.1 System p The instruction set for the PowerPC processor is given in the book at the following URL: http://www.ibm.com/chips/techlib/techlib.nsf/techdocs/852569B20050FF778525699600682CC7/$file/booke _rm.pdf For each instruction, the description in the book lists whether it is available only in supervisor mode or not.
To test CPU control registers, use MOVL %cs, 28(%esp). This overwrites the value of the register that contains the code segment. The register that contains the address of the next instruction (eip) is not directly addressable. Note that in the Intel documentation of MOV it is explicitly stated that MOV cannot be used to set the CS register.
Page 219
2. Gets its euid and uid. 3. Transforms old-style command line argument syntax into new-style syntax. 4. Processes the command line arguments. 5. Sets up signal handling. 6. Initializes the fifo. 7. Initializes any remote connection. 8. Sets back the real UID. 9.
5.13.6 I&A support 5.13.6.1 pam_tally The pam_tally utility allows administrative users to reset the failed login counter kept in the /var/log/faillog. Please see the /usr/share/doc/packages/pam/modules/README.pam_tally file on a SLES system for more information. 5.13.6.2 unix_chkpwd The unix_chkpwd helper program works with the pam_unix PAM module (Section 5.11.1.3). It is intended only to be executed by the pam_unix PAM module and logs an error if executed otherwise.
Page 221
The crontab program is used to install, deinstall, or list the tables used to drive the cron daemon in Vixie Cron. The crontab program allows an administrator to perform specific tasks on a regularly-scheduled basis without logging in. Users can have their own crontabs that allow them to create jobs that will run at given times.
commands that are to be executed. Information stored in this job file, along with its attributes, is used by the daemon to recreate the invocation of the user’s identity while performing tasks at the scheduled time. 5.14.2 Batch processing daemons 5.14.2.1 cron The cron daemon executes commands scheduled through crontab or listed in /etc/crontab for...
5.15 User-level audit subsystem The main user-level audit components consist of the auditd daemon, the auditctl control program, the libaudit library, the auditd.conf configuration file, and the auditd.rules initial setup file. There is also the /etc/init.d/auditd init script that is used to start and stop auditd. When run, this script sources another file, /etc/sysconfig/auditd, to set the locale, and to set the AUDIT_CLEAN_STOP variable, which controls whether to delete the watch points and the filter rules when auditd stops.
2. Processes the command line arguments. 3. Attempts to raise its resource limits. 4. Sets its umask. 5. Resets its internal counters. 6. Emits a title. 7. Processes audit records from an audit log file or stdin, incrementing counters depending on audit record contents.
5.16 Supporting functions Trusted programs and trusted processes in an SLES system use libraries. Libraries do not form a subsystem in the notation of the Common Criteria, but they provide supporting functions to trusted commands and processes. A library is an archive of link-edited objects and their export files. A shared library is an archive of objects that has been bound as a module with imports and exports, and is marked as a shared object.
Page 226
Library /lib/libc.so.6 /lib/libcrypt.so.1 /lib/libcrypt.so.o.9.8b /lib/security/pam_unix.so /lib/security/pam_passwdqc /lib/security/pam_wheel.so /lib/security/pam_nologin. /lib/security/pam_securett y.so /lib/security/pam_tally.so /lib/security/pam_listfile /lib/security/pam_deny.so /lib/security/pam_env.so /lib/security/pam_xauth.so /lib/security/pam_limits.s /lib/security/pam_shells.s /lib/security/pam_stack.so /lib/security/pam_rootok.s /usr/lib/libssl3.so /lib/libcrypto.so.4 Table 5-7: TSF libraries Description C Run time library functions. Library that performs one-way encryption of user and group passwords.
5.16.2 Library linking mechanism On SLES, a binary executable automatically causes the program loader /lib/ld-linux.so.2 to be loaded and run. This loader takes care of analyzing the library names in the executable file, locating the library in the system directory tree, and making requested code available to the executing process. The loader does not copy the library object code, but instead performs a memory mapping of the appropriate object code into the address space of the executing process.
system initialization, and sets the IDT entry corresponding to vector 128 (Ox80) to invoke the system call exception handler. When compiling and linking a program that makes a system call, the libc library wrapper routine for that system call stores the appropriate system call number in the eax register, and executes the int 0x80 assembly language instruction to generate the hardware exception.
Page 229
passed as system-call parameters. For the sake of efficiency, and satisfying the access control requirement, the SLES kernel performs validation in a two-step process, as follows: 1. Verifies that the linear address (virtual address for System p and System z) passed as a parameter does not fall within the range of interval addresses reserved for the kernel.
6 Mapping the TOE summary specification to the High-Level Design This chapter provides a mapping of the security functions of the TOE summary specification to the functions described in this High-Level Design document. Identification and authentication Section 5.11 provides details of the SLES Identification and Authentication subsystem. 6.1.1 User identification and authentication data management (IA.1) Section 5.11.2 provides details of the configuration files for user and authentication management.
6.2.3 Audit record format (AU.3) Section 5.6.3.2 describes information stored in each audit record. 6.2.4 Audit post-processing (AU.4) Section 5.15.2 describes audit subsystem utilities provided for post-processing of audit data. Discretionary Access Control Sections 5.1 and 5.2 provide details on Discretionary Access Control (DAC) on the SLES system. 6.3.1 General DAC policy (DA.1) Sections 5.1 and 5.2.2 provides details on the functions that implement general Discretionary Access policy.
6.5.1 Roles (SM.1) Section 5.13 provides details on various commands that support the notion of an administrator and a normal user. 6.5.2 Access control configuration and management (SM.2) Sections 5.1.1 and 5.1.2.1 provide details on the system calls of the file system that are used to set attributes on objects to configure access control.
Section 4.2.2 provides details on the non-kernel trusted process on the SLES system. 6.7.5 TSF Databases (TP.5) Section 4.3 provides details on the TSF databases on the SUSE Linux Enterprise Server system. 6.7.6 Internal TOE protection mechanisms (TP.6) Section 4.1.1 describes hardware privilege implementation for the System x, System p, System z and Opteron eServer 326.
Kernel Modules • Device Drivers • Trusted process subsystems: • System Initialization • Identification and Authentication • Network Applications • System Management • Batch Processing • User-level audit subsystem • 6.8.1 Summary of kernel subsystem interfaces This section identifies the kernel subsystem interfaces and structures them per kernel subsystem into: External Interfaces: System calls associated with the various kernel subsystems form the external interfaces.
Page 235
6.8.1.1.2 Internal Interfaces 6.8.1.1.3 Internal function permission vfs_permission get_empty_filp fget do_mount Specific ext3 methods ext3_create ext3_lookup ext3_get_block ext3_permission ext3_truncate Specific isofs methods isofs_lookup Basic inode operations. The create through the revalidate operations are described in Section 5.1.1, attribute and extended attribute functions are described in this document in Section 5.1.2.1 in the context of the ext3 file system.
Page 236
read_inode write_super read_inode2 write_super_lockfs dirty_inode unlockfs write_inode statfs put_inode remount_fs delete_inode clear_inode Dentry operations: Note that they are not used by other subsystems, so there is no subsystem interface: d_revalidate • d_hash • d_compare • d_delete • d_release • d_iput •...
System calls are listed in the Functional Specification mapping table. 6.8.1.2.2 Internal Interfaces Internal function current request_irq free_irq send_sig_info && check_kill_permission 6.8.1.2.3 Data Structures task_struct and include/linux/sched.h 6.8.1.3 Kernel subsystem inter-process communication This section lists external interfaces, internal interfaces, and data structures of the inter-process communication subsystem.
Page 238
6.8.1.3.1 External interfaces (system calls) TSFI system calls • Non-TSFI system calls • System calls are listed in the Functional Specification mapping table. 6.8.1.3.2 Internal Interfaces Internal function Interfaces defined in do_pipe Understanding the LINUX KERNEL, Chapter 19, 2nd Edition, Daniel P. Bovet, Marco Cesati, ISBN# 0-596-00213-0/ and this document, Section 5.3.1.1 pipe_read Understanding the LINUX KERNEL, Chapter 19, 2nd Edition, Daniel P.
6.8.1.4 Kernel subsystem networking This section lists external interfaces, internal interfaces and data structures of the networking subsystem. 6.8.1.4.1 External interfaces (system calls) TSFI system calls • Non-TSFI system calls • System calls are listed in the Functional Specification mapping table. 6.8.1.4.2 Internal interfaces Sockets are implemented within the inode structure as specific types of inodes.
System calls are listed in the Functional Specification mapping table 6.8.1.5.2 Internal interfaces Internal interfaces Interfaces defined in get_zeroed_page Linux Device Drivers, O’Reilly, Chapter 7, 2nd Edition June 2001, Alessandro Rubini /this document, chapter 5.5.2.1 __vmalloc Linux Device Drivers, O’Reilly, Chapter 7, 2nd Edition June 2001, Alessandro Rubini vfree Linux Device Drivers, O’Reilly, Chapter 7, 2nd Edition June 2001,...
audit_sockaddr • audit_ipc_perms • 6.8.1.6.3 Data structures audit_sock: The netlink socket through which all user space communication is done. • audit_buffer: The audit buffer is used when formatting an audit record to send to user space. • The audit subsystem pre-allocates audit buffers to enhance performance. audit_context: The audit subsystem extends the task structure to potentially include an •...
Page 242
driver methods for character device drivers and block device drivers, see [RUBN]. Chapter 3 describes the methods for character devices and chapter 6 describes the methods for block devices. 6.8.1.7.2.1 Character Devices Possible Character Device methods are: llseek flush read release write fsync...
6.8.1.7.3 Data structures device_struct file_operations block_device_operati 6.8.1.8 Kernel subsystems kernel modules This section lists external interfaces, internal interfaces, and data structures of the kernel modules subsystem. 6.8.1.8.1 External interfaces (system calls) TSFI system calls • Non_TSFI system calls • System calls are listed in the Functional Specification mapping table. 6.8.1.8.2 Internal interfaces Module dependent.
Page 244
Requirements [STALLS] Cryptography and Network Security, 2nd Edition, William Stallings, ISBN# 0-13- 869017-0 [LH] Linux Handbook, A Guide to IBM Linux Solutions and Resources, Nick Harris et al. [PSER] IBM eServer pSeries and IBM RS6000 Linux Facts and Features...
Page 245
Technology, U.S.Department of Commerce, 18 May 1994. [SCHNEIR] "Applied Cryptography Second Edition: protocols algorithms and source in code in C",1996, Schneier, B. [FIPS-186] Federal Information Processing Standards Publication, "FIPS PUB 186, Digital Signature Standard", May 1994. [CRISP] SubDomain: Parsimonious Server Security by Crispin Cowan, Steve Beattie, Greg KroahHartman, Calton Pu, Perry Wagle, and Virgil Gligor at https://forgesvn1.novell.com/viewsvn/apparmor/trunk/docs/papers/subdomainlisa00.pdf?revision=3.
Page 246
Any references in this information to non-IBM Web sites are provided for convenience only and do not in any manner serve as an endorsement of those Web sites. The materials at those Web sites are not part of the...