/proc/schedstat format and /proc/<pid>/stat format changes

Linux Scheduler Statistics
/proc/schedstat format and /proc/<pid>/stat format changes
/proc/schedstat format and /proc/<pid>/stat format changes version 4

If you have scripts for version 3, note that the only difference with version 4 is three new fields appended to the end. Version 3 scripts should need little or no porting since no previous fields have moved or changed meaning.

Format for version 4 of schedstat:

tag 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23

tag is cpuN or totals.

NOTE:

In the sched_yield() statistics, the active queue is considered empty if it has only one process in it, since obviously the process calling sched_yield() is that process.

First four are sched_yield() statistics:

# of times both the active and the expired queue were empty
# of times just the active queue was empty
# of times just the expired queue was empty
# of times sched_yield() was called

Next four are schedule() statistics:

# of times the active queue had at least one other process on it.
# of times we switched to the expired queue and reused it
# of times schedule() was called

Next seven are statistics dealing with load_balance() (requires CONFIG_SMP):

# of times load_balance() was called at an idle tick
# of times load_balance() was called at a busy tick
# of times load_balance() was called from schedule()
# of times load_balance() was called
sum of imbalances discovered (if any) with each call to load_balance()
# of times load_balance() was called when we did not find a "busiest" queue
# of times load_balance() was called from balance_node() (requires CONFIG_NUMA)

Next four are statistics dealing with pull_task() (requires CONFIG_SMP):

# of times pull_task() moved a task to this cpu
# of times pull_task() stole a task from this cpu
# of times pull_task() moved a task to this cpu from another node (requires CONFIG_NUMA)
# of times pull_task() stole a task from this cpu for another node (requires CONFIG_NUMA)

Next two are statistics dealing with balance_node() (requires CONFIG_SMP and CONFIG_NUMA):

# of times balance_node() was called
# of times balance_node() was called at an idle tick

Last three are statistics dealing with scheduling latency:

sum of all time spent running by tasks on this processor (in ms)
sum of all time spent waiting by tasks for this processor (in ms)
# of tasks (not necessarily unique) given to the processor

The last three make it possible to find the average latency on a particular runqueue or the system overall. Given two points in time, A and B, (22B - 22A)/(23B - 23A) will give you the average time processes had to wait after being scheduled to run but before actually running.

/proc/<pid>/stat

This version of the patch also patches the stat output of individual processes to include the same information (obtainable from /proc/<pid>/stat). There, the above three new fields are tacked on the end but apply only for that process. The program latency.c, mentioned on the previous page, makes use of these extra fields to report on how well a particular process is faring under the scheduler's policies. The example below utilizes that program on a particular process, rather than examining runqueue statistics through sampling /proc/schedstat. The program observed, loadtest, is a simply-written cpu-intensive program and thus uses up most of its allocated timeslice without voluntarily pausing for I/O. Processes such as cc and bash may well pause for other I/O events and give up the cpu at times, and thus appear to have much smaller timeslices. But it is important to remember that avgrun only tells us, on average, how long we were on the cpu each time, not what our given timeslice was.

% latency 25611
25611 (loadtest) avgrun=60.36ms avgwait=0.00ms
25611 (loadtest) avgrun=92.56ms avgwait=0.00ms
25611 (loadtest) avgrun=99.94ms avgwait=0.00ms
25611 (loadtest) avgrun=99.94ms avgwait=0.00ms
25611 (loadtest) avgrun=99.94ms avgwait=0.00ms
25611 (loadtest) avgrun=99.96ms avgwait=0.00ms
25611 (loadtest) avgrun=99.96ms avgwait=0.00ms
25611 (loadtest) avgrun=99.94ms avgwait=0.02ms
25611 (loadtest) avgrun=99.94ms avgwait=0.00ms
25611 (loadtest) avgrun=99.94ms avgwait=0.00ms
25611 (loadtest) avgrun=99.94ms avgwait=0.00ms
25611 (loadtest) avgrun=99.94ms avgwait=0.00ms
25611 (loadtest) avgrun=99.94ms avgwait=0.00ms
25611 (loadtest) avgrun=99.92ms avgwait=0.02ms
25611 (loadtest) avgrun=99.96ms avgwait=0.00ms
Process 25611 (loadtest) has exited.
%

Since the above test was done on an unloaded, multiple-cpu machine, loadtest pretty much had a cpu to itself, was granted about a 100ms timeslice, and used virtually all of it before giving up the cpu. Renicing loadtest can show dramatically how the timeslice changes with the priority of the process. Running N+1 loadtests, where N is the number of processors on the machine, introduces contention and the avgwait field starts to go up significantly.

Questions to ricklind@us.ibm.com