In Linux, what happens to the state of a process when it needs to read blocks from a disk? Is it blocked? If so, how is another process chosen to execute?
相关问题
- slurm: use a control node also for computing
- Is shmid returned by shmget() unique across proces
- how to get running process information in java?
- Stop child process when parent process stops
- Error building gcc 4.8.3 from source: libstdc++.so
Yes, the task gets blocked in the read() system call. Another task which is ready runs, or if no other tasks are ready, the idle task (for that CPU) runs.
A normal, blocking disc read causes the task to enter the "D" state (as others have noted). Such tasks contribute to the load average, even though they're not consuming the CPU.
Some other types of IO, especially ttys and network, do not behave quite the same - the process ends up in "S" state and can be interrupted and doesn't count against the load average.
While waiting for
read()
orwrite()
to/from a file descriptor return, the process will be put in a special kind of sleep, known as "D" or "Disk Sleep". This is special, because the process can not be killed or interrupted while in such a state. A process waiting for a return from ioctl() would also be put to sleep in this manner.An exception to this is when a file (such as a terminal or other character device) is opened in
O_NONBLOCK
mode, passed when its assumed that a device (such as a modem) will need time to initialize. However, you indicated block devices in your question. Also, I have never tried anioctl()
that is likely to block on a fd opened in non blocking mode (at least not knowingly).How another process is chosen depends entirely on the scheduler you are using, as well as what other processes might have done to modify their weights within that scheduler.
Some user space programs under certain circumstances have been known to remain in this state forever, until rebooted. These are typically grouped in with other "zombies", but the term would not be correct as they are not technically defunct.
When a process needs to fetch data from a disk, it effectively stops running on the CPU to let other processes run because the operation might take a long time to complete – at least 5ms seek time for a disk is common, and 5ms is 10 million CPU cycles, an eternity from the point of view of the program!
From the programmer point of view (also said "in userspace"), this is called a blocking system call. If you call
write(2)
(which is a thin libc wrapper around the system call of the same name), your process does not exactly stops at that boundary: it continues, on kernel side, running the system call code. Most of the time it goes all the way up to a specific disk controller driver (filename → filesystem/VFS → block device → device driver), where a command to fetch a block on disk is submitted to the proper hardware: this is a very fast operation most of the time.THEN the process is put in sleep state (in kernel space, blocking is called sleeping – nothing is ever 'blocked' from the kernel point of view). It will be awoken again once the hardware has finally fetched the proper data, then the process will be marked as runnable, scheduled and run as soon as the scheduler allows it to.
Finally in userspace the blocking system call returns with proper status and data, and the program flow goes on.
It is possible to invoke most I/O system calls in non-blocking mode (see
O_NONBLOCK
inopen(2)
andfcntl(2)
). In this case, the system calls return immediately and only tells about the proper submission of the disk operation. The programmer will have to explicitly check at a later time if the operation completed, with success or not, and fetch its result (e.g. withselect(2)
). This is called asynchronous or event based programming.Most answers here mentioning the D state (which exact name is
TASK_UNINTERRUPTIBLE
from Linux sate names) are incorrect. The D state is a special sleep mode which is only triggered in a kernel space code path, when that code path can't be interrupted (because it would be to complex to program), most of the time in the hope that it would block very shortly. I believe that most "D states" are actually invisible, they are very short lived and can't be observed by sampling tools such as 'top'.But you will sometimes encounter those unkillable processes in D state in a few situations. NFS is famous for that, and I've encountered it many times. I think there's a semantic clash between some VFS code paths which assume to always reach local disks and fast error detection (on SATA, an error timeout would be around a few 100 ms), and NFS which actually fetches data from the network which is more resilient and has slow recovery (a TCP timeout of 300 seconds is common). Read this article for the cool solution introduced in Linux 2.6.25 with the
TASK_KILLABLE
state. Before this era there was a hack where you could actually send signals to NFS process clients by sending a SIGKILL to the kernel threadrpciod
, but forget about that ugly trick…Yes, tasks waiting for IO are blocked, and other tasks get executed. Selecting the next task is done by the Linux scheduler.
A process performing I/O will be put in D state (uninterruptable sleep), which frees the CPU until there is a hardware interrupt which tells the CPU to return to executing the program. See
man ps
for the other process states.Depending on your kernel, there is a process scheduler, which keeps track of a runqueue of processes ready to execute. It, along with a scheduling algorithm, tells the kernel which process to assign to which CPU. There are kernel processes and user processes to consider. Each process is allocated a time-slice, which is a chunk of CPU time it is allowed to use. Once the process uses all of its time-slice, it is marked as expired and given lower priority in the scheduling algorithm.
In the 2.6 kernel, there is a O(1) time complexity scheduler, so no matter how many processes you have up running, it will assign CPUs in constant time. It is more complicated though, since 2.6 introduced preemption and CPU load balancing is not an easy algorithm. In any case, it’s efficient and CPUs will not remain idle while you wait for the I/O.
As already explained by others, processes in "D" state (uninterruptible sleep) are responsible for the hang of ps process. To me it has happened many times with RedHat 6.x and automounted NFS home directories.
To list processes in D state you can use the following commands:
To know the current directory of the process and, may be, the mounted NFS disk that has issues you can use a command similar to the following example (replace 31134 with the sleeping process number):
I found that giving the umount command with the -f (force) switch, to the related mounted nfs file system, was able to wake-up the sleeping process:
the file system wasn't unmounted, because it was busy, but the related process did wake-up and I was able to solve the issue without rebooting.