[hackathon/real life] non-cvmfs version of pilot does not run at RAL-LCG2

During the hackathon pilot jobs at RAL-LCG2 kept failing. I was not able to retrieve the logs of the failed jobs, but from the running jobs I managed to retrieve the following excerpts:
pilot.log
```
Linking glibmm-2.4-2.66.3-h87e66e5_0
Linking gdk-pixbuf-2.42.12-hb9ae30d_0

error    libmamba response code: -1 error message: Invalid argument
critical libmamba failed to execute pre/post link script for gdk-pixbuf

2024-06-06T12:52:35.119845Z DEBUG [InstallDIRAC] Return code of bash DIRACOS-Linux-x86_64.sh 2>&1: 1
2024-06-06T12:52:35.120660Z ERROR [InstallDIRAC] Could not install DIRACOS [ERROR 1]
2024-06-06T12:52:35.120768Z INFO [InstallDIRAC] Content of pilot.cfg

```
pilot.error
```
https://lbcertifdirac70.cern.ch unreacheable (this is normal!)


Traceback (most recent call last):
  File "dirac-pilot.py", line 115, in <module>
    command.execute()
  File "/pool/condor/dir_3079989/CtXNDmGgRZ5nzEJDjqAzJLkqISAWgmABFKDmK2FKDmi0DNDm5RX1Dm/DIRAC_zNnoXIpilot/pilotCommands.py", line 81, in wrapper
    return func(self)
  File "/pool/condor/dir_3079989/CtXNDmGgRZ5nzEJDjqAzJLkqISAWgmABFKDmK2FKDmi0DNDm5RX1Dm/DIRAC_zNnoXIpilot/pilotCommands.py", line 427, in execute
    self._localInstallDIRAC()
  File "/pool/condor/dir_3079989/CtXNDmGgRZ5nzEJDjqAzJLkqISAWgmABFKDmK2FKDmi0DNDm5RX1Dm/DIRAC_zNnoXIpilot/pilotCommands.py", line 330, in _localInstallDIRAC
    self.exitWithError(retCode)
  File "/pool/condor/dir_3079989/CtXNDmGgRZ5nzEJDjqAzJLkqISAWgmABFKDmK2FKDmi0DNDm5RX1Dm/DIRAC_zNnoXIpilot/pilotTools.py", line 797, in exitWithError
    with open("pilot.cfg") as f:
IOError: [Errno 2] No such file or directory: 'pilot.cfg'
```
```
Owner = "dteam077"
ActivationDuration = 187
Cmd = "/var/spool/arc/grid08/CtXNDmGgRZ5nzEJDjqAzJLkqISAWgmABFKDmK2FKDmi0DNDm5RX1Dm/condorjob.sh"
User = "dteam077@gridpp.rl.ac.uk"
LastMatchTime = 1717678179
StreamOut = false
JobPrio = 0
CumulativeRemoteUserCpu = 0.0
JobStartDate = 1717678179
MachineRalScaling = "$$([ifThenElse(isUndefined(RalScaling), ifThenElse(isUndefined(ScalingFactor), 1.00, ScalingFactor), RalScaling)])"
TargetType = "Machine"
LastPublicClaimId = "<130.246.219.48:9618?addrs=130.246.219.48-9618+[2001-630-54-10-82f6-db30--]-9618&alias=lcg2641.gridpp.rl.ac.uk&noUDP&sock=startd_4774_2538>#1715586785#9783#..."
TransferInputStats = [ CedarFilesCountTotal = 999; CedarFilesCountLastRun = 999 ]
OnExitRemove = true
RalAcctGroup = "group_DTEAM_OPS"
JobCurrentFinishTransferInputDate = 1717678179
OriginalTransferInput = "/var/spool/arc/grid08/CtXNDmGgRZ5nzEJDjqAzJLkqISAWgmABFKDmK2FKDmi0DNDm5RX1Dm"
scan-condor-job: ----- end condor history message -----
scan-condor-job: ----- Information extracted from condor_history -----
scan-condor-job: LastRemoteHost=slot1_36@lcg2641.gridpp.rl.ac.uk
scan-condor-job: RemoteWallClockTime=187
scan-condor-job: RemoteUserCpu=0
scan-condor-job: RemoteSysCpu=0
scan-condor-job: ImageSize=40
scan-condor-job: ExitCode=1
scan-condor-job: ExitStatus=0
scan-condor-job: JobStatus=4
scan-condor-job: JobCurrentStartDate=1717678179
scan-condor-job: EnteredCurrentStatus=1717678366
scan-condor-job: RequestCpus=8
scan-condor-job: -----------------------------------------------------
scan-condor-job: LRMSStartTime=20240606124939Z
scan-condor-job: LRMSEndTime=20240606125246Z
scan-condor-job: Job failed with exit code 1
2024-06-06T12:53:18Z Job state change INLRMS -> FINISHING   Reason: Job failure detected
2024-06-06T12:53:18Z Job state change FINISHING -> FINISHED   Reason: Job failure detected
```

We've seen the same issue on our production instance, and we are working around it by getting the pilot off cvmfs.
Simon thinks this might be related to:
https://github.com/mamba-org/mamba/issues/2501
Note that this behaviour several hundred jobs per hour that then fail, and that this is how my DN got banned at RAL before. (Hence killing all user jobs targeting RAL before leaving the hackthon is a necessity.)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[hackathon/real life] non-cvmfs version of pilot does not run at RAL-LCG2 #7657

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[hackathon/real life] non-cvmfs version of pilot does not run at RAL-LCG2 #7657

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions