Troubleshooting Job Runs
There are several stages at which a job creation/submission can go wrong.
During Job Creation:
Trouble creating a job: If you are seeing inexplicable behavior while trying to create a job; for example, you try to select a tool and the app opens an irrelevant page; please note: CIPRES behavior is unpredictable if you are submitting jobs through multiple tabs. The app detects and warns about this, but if you have pop-ups disabled you won't see the warning.
Submission to CIPRES:
Failure on submission is generally noted by an error message at the top of the submission page, which appear inside a red box. The failure message should be descriptive about why submission to the CIPRES application failed.
Common reasons would be:
- There is not enough time in the account to run the job. If you have running jobs, you can wait to see if the time you need will be given back when your running jobs complete. If you don't have running jobs, you can either decrease the maximum run time requested, or purchase a higher level subscription. See this page for more information
- You have exceeded the amount of storage space (150GB) that you are allowed. The available space is noted above the folder list, in the upper left hand corner of the screen. Try deleting some jobs or data to decrease the amount of data you have. While this adjustment should occur in a moment or two, some users report that may have to logout and log back in to reset the value listed for data storage.
- You tried to clone a job that used to run but won't now. This can happen if some data file for the job has been deleted from your account.
- Your job configuration has activated a bug in the CIPRES interface. You might see this message: Your job submission was unsuccessful, most likely because of a bug in the CIPRES interface. Please contact firstname.lastname@example.org for help. Many interfaces are complex, and some combinations of parameter choices may cause an illegal submission. Please let us know if this happens.
Submission to a compute cluster:
These failures will usually cause the job in the Tasks list page to be highlighted in red. Click the View Error button on the irhgt hand side. Failure on submission to a compute cluster generally means there is an issue at CIPRES; the fastest solution is to report the issue to us.
Common reasons would be:
- The compute cluster is down, or communication ot the cluster havs been interrupted. This can produce a number of possible messages.
- There is an issue with credentials or the scheduler. For example, you might see a message like this under View Error in a job that is highlighted in red on the task list page: SUBMITTING : ERROR : NGBW-JOB-RAXMLHPC8_XSEDE-AA3271F48DEB44D5BFE2C1D6C81B8171 : java.lang.Exception: Error submitting job: Error submitting job, sbatch says: running: sbatch -L cipres:1 ./_batch_command.run 2>> ./_batch_command.status sbatch: error: bank_limit plugin: expired user, can't submit job sbatch: error: Batch job submission failed: Invalid account or account/partition combination specified. This does not pertain to your account, but is a failure of CIPRES credentials. Contact us if you see such a message.
Premature Termination of a run:
A job ran for a while, but ended before the configured maximum wall time.
Common reasons would be:
- Your job ran out of memory: Look in the stderr.txt file for reports of a memory issue. Many code interfaces allow for you to request more memory. Also, if you need more than you are able to access, contact us. We can help.
- Your job produced a file greater than 8 GB in size. We have a hard coded limit of 8 GB on files produced by CIPRES. If a file exceeds that size, the job will terminate
- There was a machine or scheduler error on our side. If you can't figure out why it ended, it may be a node failure or some other machine issue on our side. Please contact us for help and reimbursement.
At the end of a job run, a job might be highlighted in red, with a View Error button to the right. If you click you might see this kind of message:
LOAD_RESULTS : ERROR : Error retrieving results: ****ALERT****, one or more results files are too large for us to return through the browser interface. Your results will be posted automatically at [url=https://object.cloud.sdsc.edu/v1/AUTH_cipres/test5mark/]https://object.cloud.sdsc.edu/v1/AUTH_cipres/test5mark/[/url] by a script that runs at 6 am, 12 am, 6 pm, and 12 pm Pacific Time. You can contact email@example.com to request the results be made available earlier, or for assistance in retrieving your results.
This happens because the result files are (singly or in the aggregate) more than 4 GB in size. The results will appear within an hour at the link provided in the message, which is personalized to your account.If they dont't appear, please contact us and inquire.