Batch Queue Predictor: FAQ
Q:
What information does this webpage provide?
A:
This webpage provides an interface which allows users to
gather information about the delay they can expect their jobs to
experience if submitted into batch queues on various machines.
Instead of providing point-value job wait time predictions (such as
the mean), we supply predictions which can answer questions of the
form, "If I submit my job to machine M in queue Q, at most how long
will it wait before executing, with P percent of the time?". We find
that in general, these types of predictions are more useful to users
because we can make inference on individual jobs, whereas predictions
of the mean wait time, for instance, are more appropriate making
inferences on groups of jobs.
Q:
What does the "Predicted Wait Times" data mean?
A:
Our predictions are parameterized by a quantile and a confidence. The quantile describes the percent of all jobs submitted to a particular batch queue with a specific node range. If we choose the .75 quantile, .1 confidence, and the prediction is 300 seconds we can say with a confidence of 10% that 75% of jobs that are submitted to this queue and node range will take less than 300 seconds to exit the queue.
Q:
I have a certain machine that I submit jobs to, what can this webpage tell me?
A:
If a user has a certain machine which they use to submit their
jobs, the user should select that machine from the dropdown list of
machines at the top of the homepage. This selection brings the user
to a new page which show an overview of upper bound wait times for
jobs of various node request sizes within each queue on the selected
machine. There is a table showing, for each automatically determined
requested node range, circles which indicate upper bound wait time
ranges. For instance, the table may indicate that a job requesting
between one and four nodes will currently wait at most '4 to 10 hours'
if submitted to the shown queue.
Q:
I have a certain machine AND queue that I submit jobs to, what can this webpage tell me?
A:
If a user which machine and queue they will be submitting
their jobs to, the user should first select that machine from the
dropdown list of machines at the top of the homepage, and then click
on the desired queue from the overview table on the next page. Doing
so brings the user to a new page which shows current and past bound
predictions in the form of historical prediction graphs. There is one
graph per each range of nodes that a job might request, which shows
time along the X axis and bound delay predictions along the Y axis.
These graphs show how the bound predictions (95%, 75% and 50%
(median)) have been changing in recent history. In addition, to the
right of each graph, current 95%, 75%, and median predictions are
displayed in terms of seconds. For example, if the 95% prediction
displayed was '45 seconds' for jobs of size 1-4 nodes, then the user
knows that if they submit a job requesting between one and four nodes
to this machine/queue right now, then they would wait in the queue no
more than 45 seconds with 95% certainty.
Q:
Can I see the actual wait times that jobs are experiencing for a certain machine/queue/node range tuple?
A:
Yes! From the detailed machine/queue graph page (first select
the machine of interest from the homepage, then click the desired
queue from the color table), a user can click the '[more]' link above
each graph to obtain a graph showing time along the X axis, and queue
delay experienced by real jobs on the Y axis for a fairly long
history. These graphs can show how job delay is changing (or not) in
recent history.
Q:
Can I find out the probability that my job will wait for at most a specified amount of time?
A:
We call lthis feature 'inverted batch queue prediction' since,
instead of specifying a probability (95%, 75%, median) and looking at
the number of seconds which corresponds to that value, you specify a
number of seconds you are willing to tolerate, and we tell you the
probability of you job waiting at most that number of seconds. To
obtain this information, the user starts at the homepage and clicks on
'See Your Chances'. This brings the user to a form in which they
select the machine, queue, requested node number, requested max
walltime, and maximum start deadline. When the form is filled out and
'sumbit' is clicked, the system will display the current probability
that a job of the specified parameters will wait 'deadline' seconds or
less.
Q:
My question is not answered here, what do I do?
A:
Please, send email any time to nurmi@cs.ucsb.edu, we'll be
happy to work with you on using the web page whether it be for simple
curiosity, or for integration into your existing reserach projects!