If you are trying to run any I/O with netCDF, pnetCDF or hdf5 on AWS’s Lustre file system (FSx), you might encounter the following error and your program crashes:
File locking failed in ADIOI_Set_lock64(fd B,cmd F_SETLKW64/7,type F_WRLCK/1,whence 0) with return value FFFFFFFF and errno 26. If the file system is NFS, you need to use NFS version 3, ensure that the lockd daemon is running on all the machines, and mount the directory with the 'noac' option (no attribute caching). ADIOI_Set_lock64:: Function not implemented ADIOI_Set_lock:offset 0, length 4006 Abort(1) on node 0 (rank 0 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 1) - process 0 In: PMI_Abort(1, application called MPI_Abort(MPI_COMM_WORLD, 1) - process 0)
The solution is “simple”, but not as straightforward as you might think. It’s not about the
noac option but instead about
flock. This mount option is missing by default in AWS’s ParallelCluster when hooked up to FSx;
/etc/fstab in the compute nodes does not mount with this option.
We just need to remount the FSx partition with
flock and all will be good. How can we do it for all compute nodes without tinkering with all the launch templates and give us more trouble than its worth?
Quick and highly dirty solution: Go to the master node’s
/etc/fstab and find your AWS FSx partition’s domain name. e.g.
Create a script in
#!/bin/bash sudo umount /fsx sudo mount -t lustre -o noatime,flock fs-abcdef12345678.fsx.us-east-2.amazonaws.com@tcp:/fsx /fsx
Edit the above script to reflect the correct domain name. Save and
chmod +x remount_fsx.sh.
Run this on all the compute nodes. A very hacky way is
srun -N 12 /shared/remount_fsx.sh. You will need to redo this if you have more compute nodes added later.
The right way to fix: Edit the compute node launch template. It would probably not fit in one page and I know you just want to get things up and running.
This is a quick tip post. They are written in a way to help users quickly resolve a problem that I ran into in my daily work and usually don’t go in depth about the details of ‘why’. Please feel free to comment or contribute to the knowledge.