Skip to content

PlatformAwareProgramming/Multicluster.jl

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

24 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

CompatHelper TagBot

The Multicluster.jl Package

Overview

On top of the multilevel version of Distributed.jl and MPIClusterManagers.jl, Multicluster.jl has been developed as a package for deploying multicluster computing systems using Julia. It introduces new functions and methods1 tailored for multicluster computing.

By comparing the code below with its counterpart implemented using the multilevel version of Distributed.jl, it becomes evident—both visually and by line count—that the configuration of the multicluster environment is significantly simpler with Multicluster.jl. In fact, the resulting code is comparable in complexity to a single-cluster implementation.


Example: Multicluster Computation in Julia

using Multicluster

SUM = Ref(0)

# the reduce0 function at the driver process
function reduce0(x)
    SUM[] *= x
end

# create four clusters
cid1 = addcluster(<IP1>, 16; access_node_args = <A1>, compute_node_args = <C1>)
cid2 = addcluster(<IP2>, 32; access_node_args = <A2>, compute_node_args = <C2>)
cid3 = addcluster(<IP3>, 512; access_node_args = <A3>, compute_node_args = <C3>)
cid4 = addcluster(<IP4>, 64; access_node_args = <A4>, compute_node_args = <C4>)

# create the reduce1 function at each entry process
@everywhere workers() reduce1(x) = @spawnat role = :worker 1 reduce0(x)

# run the computation logic at each computing process
for cid in clusters()
    @cluster_everywhere cid begin
       MPI.Init()
       size = MPI.Comm_size(MPI.COMM_WORLD)
       rank = MPI.Comm_rank(MPI.COMM_WORLD)
       @info "my info: rank=$rank, size=$size, cluster=$clusterid"
       X = rand(1:10)
       r = MPI.Reduce(X, (x, y) -> x + y, 0, MPI.COMM_WORLD)
       rank == 0 && @spawnat role = :worker 1 reduce1(r)
       MPI.Finalize()
    end
end

@info "The sum across all clusters is $(SUM[])"

Adding a Cluster Access

The addcluster function receives:

  • The address of the access node of a cluster
  • An integer N representing the number of computing processes
  • Keyword arguments to configure authentication and execution environments for both access and compute nodes

Internally, addcluster:

  1. Creates an entry process on the access node
  2. Launches a team of N computing processes across the compute nodes
  3. Connects those processes via an MPI communicator using MPIWorkerManager
  4. Returns a cluster handle

A cluster handle is a structure with two fields:

  • cid: the process ID of the entry process
  • xid: a context identifier (Union{Nothing, Integer}), initially set to nothing

A context is a set of computing-process PIDs that can communicate through a shared MPI communicator.

Example

julia> ch0 = addcluster(
           "$user_login@$cluster_address",
           16;
           access_node_args = [
               :sshflags => `-i $(homedir())/publickey`,
               :tunnel   => true
           ],
           compute_node_args = [
               :master_tcp_interface => master_tcp_interface,
               :exeflags             => `--threads=32`,
               :threadlevel          => :multiple,
               :mpiflags             => `--hostfile /home/$user_login/hostfile`
           ]
       )
Cluster(2, nothing)

This example creates 16 computing processes distributed across the cluster’s compute nodes. Each process runs with 32 threads, resulting in a total of 512 threads. The returned handle ch0 corresponds to the entry process PID and an initial empty context.


Managing Contexts with addworkers

The addworkers function creates a new context within an existing cluster. It:

  • Receives a cluster handle and an integer N
  • Launches N new computing processes
  • Creates a new MPI communicator for those processes
  • Returns a new cluster handle with the same cid and a new xid

Example

julia> addworkers(ch0, 4)
Cluster(2, 2)

julia> addworkers(ch0, 8)
Cluster(2, 3)

Inspecting the contexts:

julia> contexts(ch0)
3-element Vector{Union{Nothing, Vector{Integer}}}:
 Integer[2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17]
 Integer[18, 19, 20, 21]
 Integer[22, 23, 24, 25, 26, 27, 28, 29]

Only processes within the same context share an MPI communicator and can interact via MPI.


Removing Clusters and Contexts

The rmcluster function removes cluster resources:

  • If xid is specified, only the corresponding context is removed
  • If xid is nothing, the entire cluster is removed

Example

julia> rmcluster(ch0, 2)
julia> rmcluster(ch0)

The first call removes the second context only, while the second call removes all remaining contexts and the cluster itself.


Getting Information About Clusters

Multicluster.jl provides several introspection functions:

  • clusters() → list of all cluster handles
  • nclusters() → number of clusters
  • nodes(cid) → set of node handles for a cluster

A node handle is a record with:

  • cid: PID of the entry process
  • pid: PID of a computing process

Node handles are used as arguments in several Multicluster.jl operations.

Extended Distributed.jl Functions

Several Distributed.jl functions have been extended:

  • procs, nprocs applied to cluster handles return entry-process information
  • workers, nworkers applied to cluster handles account for contexts

If a cluster handle includes an xid, only the processes in that context are considered.


Interacting with Cluster Processes

All cluster-management extensions are implemented in cluster.jl, extending functions originally defined in Distributed.jl. Additional communication helpers are defined in remotecall.jl.

Remote Calls

The following functions now accept node handles and cluster handles:

  • remotecall
  • remotecall_fetch
  • remotecall_wait
  • remote_do

Behavior:

  • Node handle → executes on a single computing process
  • Cluster handle → executes in parallel across computing processes and returns a list of results

A new helper function, cluster_fetch, is provided to retrieve results from these calls. It may also accept a reducer function to aggregate results from multiple futures.

Macros

New macros extend the familiar Distributed.jl workflow:

  • @cluster_spawnat
  • @node_spawnat
  • fetchfrom_cluster
  • fetchfrom_node
  • @cluster_everywhere
  • @cluster_distributed

These macros apply spawnat, everywhere, and distributed semantics directly to a cluster’s computing processes.


Summary

Multicluster.jl simplifies the deployment, management, and programming of multicluster computing systems in Julia. By abstracting away complex orchestration details and extending familiar Distributed.jl constructs, it enables scalable, readable, and maintainable multicluster applications.

Footnotes

  1. In Julia, methods are specific implementations of a function specialized for particular combinations of argument types, enabling multiple dispatch.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages