Troubleshooting Failed Job Runs

There are several stages at which a job creation/submission can go wrong.

During Job Creation:

Trouble creating a job: If you are seeing inexplicable behavior while trying to create a job; for example, you try to select a tool and the app opens an irrelevant page; please note: CIPRES behavior is unpredictable if you are submitting jobs through multiple tabs. The app detects and warns about this, but if you have pop-ups disabled you won't see the warning.

Submission to CIPRES:

Failure on submission is generally noted by an error message at the top of the submission page, which appear inside a red box. The failure message should be descriptive about why submission to the CIPRES application failed.

Common reasons would be:

Submission to a compute cluster:

These failures will usually cause the job in the Tasks list page to be highlighted in red. Click the View Error button on the right hand side. Failure on submission to a compute cluster generally means there is an issue at CIPRES; the fastest solution is to report the issue to us.
Common reasons would be:

Failure to Start

You can follow the progress of your job by viewing intermediate results. You can (and are encouraged to) monitor the progress of long runs. To view intermediate results for a running job, click on the "View Status" button. When that page opens, you will see an "Intermediate results" hyperlink. Click it and you and a pop-up will appear that displays all input files and any results for the selected job. The intermediate results pane shows input files and scheduling information files immediately, so you can make sure everything looks fine before the job begins. When the job initiates (normally within a few minutes), more files will appear, beginning with start.txt. This start.txt file notes the time when the job began executing.

How do I know if my job has started? If the job has not started, the intermediate results pane shows only input files and scheduling files. This means your job is still waiting in the queue. Jobs usually (but not always) start within 15 minutes. You can keep the intermediate results pane open and refresh from time to time to see when the intermediate fields appear. The first new file to appear in a running job is named start.txt.

What if the queue time seems long? If the queue time seems long, there are two possible reasons. First, and most likely, the queue may be paused because of a scheduled or unscheduled maintenance. Typically, when a maintenance occurs, we will post a message on the front page. The (normally green) status box will turn red, with an informative message. You can also check for scheduled maintenance events; or just contact us at the link below if you aren't sure. Any job that might be still running into a scheduled maintenance period will be held in the queue until maintenance is complete. Once maintenance completes, the job should initiate within an hour or so. If it does not, you should contact us at the link below. Occasionally (but rarely) a job does not come out of maintenance cleanly, so it is possible that your job will hang. In that case please contact us.

Second, and less likely, the queue might actually be long, and it could take hours for the job to start. This is quite rare, however. If you have questions, please contact us.

Premature Termination of a run:

A job ran for a while, but ended before the configured maximum wall time.
Common reasons would be:

Results return:

At the end of a job run, a job might be highlighted in red, with a View Error button to the right. If you click you might see this kind of message:

LOAD_RESULTS : ERROR : Error retrieving results: ****ALERT****, one or more results files are too large for us to return through the browser interface. Your results will be posted automatically at [url=https://object.cloud.sdsc.edu/v1/AUTH_cipres/test5mark/]https://object.cloud.sdsc.edu/v1/AUTH_cipres/test5mark/[/url] by a script that runs at hourly. You can contact cipresadmin@sdsc.edu for assistance in retrieving your results.

This happens because the result files are (singly or in the aggregate) more than 4 GB in size. The results will appear within an hour at the link provided in the message, which is personalized to your account. If they dont't appear, please contact us and inquire.

...
hummingbird in flight

Get 1000 Hours free

On the UCSD Supercomputer

Start Your Trial