Skip to content

Troubleshooting

Generally to be done with the assistance of Altair Support.

Checking status of Altair SLC Hub services

Note

In a Windows environment, the hubctl command must be run from a command prompt that has been started 'as Administrator'.

To check that the Altair SLC Hub services are running, use the hubctl command:

hubctl service status

This should print a summary table of all of the Altair SLC Hub services and whether they are active (running).

If any are marked as inactive try restarting them with

hubctl service start <name>

If they are still inactive, or marked as failed, then view the logs of the service.

Viewing service logs

The logs from the services are captured by systemd/journald and, they can most easily be accessed using the hubctl log command. For more details see Logging.

Note

In most scenarios the logs can be retrieved by the hubctl log command. However, if a service fails unexpectedly, systemd can fail to associate the final log messages with the relevant service, in this case it is necessary to use journalctl to view the systemd output.

Log viewing tips on Linux

By default, hubctl log pipes the log entries through a pager. Piping the output to a file, and then editing the file using an editor such as vi can be a useful alternative way of viewing the log files.

Rather than limiting the display to a fixed number of entries, the output can be limited based on the timestamp of the log record. To return the log entries for all Altair SLC Hub services that have happened in the last 5 minutes, use the command:

hubctl log --since -5m

The logs from the services are located in [var directory]\log. Your user must be in the 'Administrators' group on the server. Then, give yourself access to this folder by navigating to it in Windows Explorer.

Missing nomad logs on worker nodes

Nomad has a garbage collector which by default deletes nomad log files when disk space usage exceeds 80%.

This can lead to nomad deleting log files as soon as a task completes, making it extremely difficult to diagnose the reason for a task failure.

This is unlikely to occur in a production environment.

If it does occur and there is an urgent need to diagnose a task failure, as a short term measure add a file named 90-gc-config.hcl to the [etc directory]/nomad.d directory of the Altair SLC Hub installation with this content:

client {
  gc_disk_usage_threshold = 99
}
Then restart nomad before continuing:
hubctl service restart nomad

A proper remedy is to increase the disk space available, for example on Linux putting the [var directory]/nomad directory of the Altair SLC Hub installation on its own volume.

Jobs remain pending with "Dimension 'disk' exhausted"

This error indicates that Nomad cannot schedule the job because it believes there is insufficient disk capacity available.

Error message: Dimension 'disk' exhausted on X nodes

Jobs remain in a Pending state and hubcli job status reports a message similar to the following:

State    Pending
Reason   Resources exhausted on 2 nodes
         Dimension 'disk' exhausted on 2 nodes

This issue can occur if Nomad was started while the disk was already full.

Nomad determines the available disk capacity when the Nomad agent starts and stores this value. It does not monitor disk capacity changes while running. As a result, if disk space is later freed (for example by cleaning up files) or the disk is resized, Nomad does not automatically detect the change.

The updated disk state is only recognized after restarting Nomad.

To solve it, restart Nomad so that it re-evaluates the available disk capacity:

hubctl service restart nomad

hubctl bootstrap fails with "Child process 'cmd (xxxx)' finished with code -1073741515" on Windows

This error indicates that a required DLL was not found (STATUS_DLL_NOT_FOUND). This usually happens when Microsoft Visual C++ Redistributable is not installed, or when OpenSSL 1.1.1 is not available on the system.

See Initial Windows Installation.

Nomad fails to start with "Failed to resolve Serf advertise address"

Nomad fails to start because it cannot determine which IP address to use for cluster communication.

Error message: Failed to resolve Serf advertise address ":4648"

This can occur on Windows systems, including disaster recovery (DR) environments, where Nomad cannot automatically detect the correct IP address to use because it can’t find the default private network.

To solve it, configure Nomad to use the machine's fixed IP address explicitly. Add a file to the etc/nomad.d directory of the Altair SLC Hub installation with the following content:

addresses {
  http = "<ip>"
  rpc  = "<ip>"
  serf = "<ip>"
}

Replace <ip> with the fixed IP address of the machine. Then restart Nomad:

hubctl service restart nomad

Note

This workaround requires the machine to have a fixed IP address. If the machine's IP address changes, update the configuration accordingly.

For more information on Nomad address configuration, see the Nomad configuration reference.