/proc/schedstat format and /proc/<pid>/stat format changes

Linux Scheduler Statistics
/proc/schedstat format and /proc/<pid>/stat format changes
/proc/schedstat format and /proc/<pid>/stat format changes version 6

If you have scripts for version 4, note that several fields were deleted, causing other fields to move, and some new fields were added. Version 4 scripts should require a moderate porting effort, depending on how modular you made the field parsing.

Format for version 6 of schedstat:

tag 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29

tag is cpuN or totals.

NOTE:

In the sched_yield() statistics, the active queue is considered empty if it has only one process in it, since obviously the process calling sched_yield() is that process.

First four are sched_yield() statistics:

# of times both the active and the expired queue were empty
# of times just the active queue was empty
# of times just the expired queue was empty
# of times sched_yield() was called

Next three are schedule() statistics:

# of times the active queue had at least one other process on it.
# of times we switched to the expired queue and reused it
# of times schedule() was called

Next six are statistics dealing with load_balance() (requires CONFIG_SMP):

# of times load_balance() was called at an idle tick
# of times load_balance() was called at a busy tick
# of times load_balance() was called
sum of imbalances discovered (if any) with each call to load_balance()
# of times load_balance() was called when we did not find a "busiest" group
# of times load_balance() was called when we did not find a "busiest" queue

Next six are statistics dealing with pull_task() (requires CONFIG_SMP):

# of times pull_task() moved a task to this cpu when newly idle
# of times pull_task() stole a task from this cpu when newly idle
# of times pull_task() moved a task to this cpu when idle
# of times pull_task() stole a task from this cpu when idle
# of times pull_task() moved a task to this cpu when busy
# of times pull_task() stole a task from this cpu when busy

Next three are statistics dealing with active_load_balance() (requires CONFIG_SMP):

# of times active_load_balance() was called
# of times active_load_balance() caused us to gain a task
# of times active_load_balance() caused us to lose a task

Next two are simply call counters for two routines:

# of times sched_balance_exec() was called
# of times migrate_to_cpu() was called

Next two are statistics dealing with load_balance_newidle():

# of times load_balance_newidle() was called
sum of imbalances discovered (if any) with each call to load_balance_newidle()

Last three are statistics dealing with scheduling latency:

sum of all time spent running by tasks on this processor (in ms)
sum of all time spent waiting by tasks for this processor (in ms)
# of tasks (not necessarily unique) given to the processor

The last three make it possible to find the average latency on a particular runqueue or the system overall. Given two points in time, A and B, (28B - 28A)/(29B - 29A) will give you the average time processes had to wait after being scheduled to run but before actually running.

/proc/<pid>/stat

The patch also patches the stat output of individual processes to include the same information (obtainable from /proc/<pid>/stat). There, the above three new fields are tacked on the end but apply only for that process. The program latency.c, mentioned on the previous page, makes use of these extra fields to report on how well a particular process is faring under the scheduler's policies. The example below utilizes that program on a particular process, rather than examining runqueue statistics through sampling /proc/schedstat. The program observed, loadtest, is a simply-written cpu-intensive program and thus uses up most of its allocated timeslice without voluntarily pausing for I/O. Processes such as cc and bash may well pause for other I/O events and give up the cpu at times, and thus appear to have much smaller timeslices. But it is important to remember that avgrun only tells us, on average, how long we were on the cpu each time, not what our given timeslice was.

% latency 25611
25611 (loadtest) avgrun=60.36ms avgwait=0.00ms
25611 (loadtest) avgrun=92.56ms avgwait=0.00ms
25611 (loadtest) avgrun=99.94ms avgwait=0.00ms
25611 (loadtest) avgrun=99.94ms avgwait=0.00ms
25611 (loadtest) avgrun=99.94ms avgwait=0.00ms
25611 (loadtest) avgrun=99.96ms avgwait=0.00ms
25611 (loadtest) avgrun=99.96ms avgwait=0.00ms
25611 (loadtest) avgrun=99.94ms avgwait=0.02ms
25611 (loadtest) avgrun=99.94ms avgwait=0.00ms
25611 (loadtest) avgrun=99.94ms avgwait=0.00ms
25611 (loadtest) avgrun=99.94ms avgwait=0.00ms
25611 (loadtest) avgrun=99.94ms avgwait=0.00ms
25611 (loadtest) avgrun=99.94ms avgwait=0.00ms
25611 (loadtest) avgrun=99.92ms avgwait=0.02ms
25611 (loadtest) avgrun=99.96ms avgwait=0.00ms
Process 25611 (loadtest) has exited.
%

Since the above test was done on an unloaded, multiple-cpu machine, loadtest pretty much had a cpu to itself, was granted about a 100ms timeslice, and used virtually all of it before giving up the cpu. Renicing loadtest can show dramatically how the timeslice changes with the priority of the process. Running N+1 loadtests, where N is the number of processors on the machine, introduces contention and the avgwait field starts to go up significantly.

Questions to ricklind@us.ibm.com