Skip to content

Anytopo#2598

Open
ziyuezzy wants to merge 110 commits intosstsimulator:develfrom
ziyuezzy:anytopo
Open

Anytopo#2598
ziyuezzy wants to merge 110 commits intosstsimulator:develfrom
ziyuezzy:anytopo

Conversation

@ziyuezzy
Copy link
Copy Markdown
Contributor

@ziyuezzy ziyuezzy commented Nov 4, 2025

Hi Devs,

This is a pull request, previously mentioned in issue #2584

There are documentations of this work in
"sst-elements-src/src/sst/elements/merlin/interfaces/endpointNIC/documentations/README.md", and
"sst-elements-src/src/sst/elements/merlin/topology/anytopo_ultility/README.md"

In a few words, this branch extends the Merlin module in SST-element, such that any network topology can be built with an input networkx graph. Another key feature is to support source routing (the endpoint NIC determines the routed paths). And a few popular HPC network topologies (dragonfly, slimfly, polarfly, jellyfish) are already defined in
"sst-elements-src/src/sst/elements/merlin/topology/anytopo_ultility/" dir, together with some tests in "sst-elements-src/src/sst/elements/merlin/topology/anytopo_ultility/tests".

If you would like to accept this pull requests, maybe we can move these tests to the main merlin test dir?

It worth mention that the reorderedlinkcontrol has been modified to fit in the new 'ExtendedRequest' framework, this has been included in tests as well, see EndpointNIC(use_reorderLinkControl=True ... in the tests.

I believe that this pull request will have the following contribution to Merlin:

  • supporting more network topologies (any topology from input graph)
  • supporting source routing, with is increasingly interesting in HPC network traffic engineering
  • Adds a framework of EndpointNIC that allows different NIC/smartNIC functionalities to be plugged in the packet-processing pipeline. For now source routing is implemented through this framework.

Please let me know if there is any feedback/comments/questions for the code.

Best regards,
Z.

sst-autotester and others added 30 commits August 17, 2023 07:57
Automatically Merged using SST Master Branch Merger
Automatically Merged using SST Master Branch Merger
Automatically Merged using SST Master Branch Merger
Automatically Merged using SST Master Branch Merger
Automatically Merged using SST Master Branch Merger
Automatically Merged using SST Master Branch Merger
Automatically Merged using SST Master Branch Merger
Automatically Merged using SST Master Branch Merger
Automatically Merged using SST Master Branch Merger
Automatically Merged using SST Master Branch Merger
Automatically Merged using SST Master Branch Merger
Automatically Merged using SST Master Branch Merger
Automatically Merged using SST Master Branch Merger
Automatically Merged using SST Master Branch Merger
Automatically Merged using SST Master Branch Merger
Automatically Merged using SST Master Branch Merger
Automatically Merged using SST Master Branch Merger
Automatically Merged using SST Master Branch Merger
Automatically Merged using SST Master Branch Merger
Automatically Merged using SST Master Branch Merger
Automatically Merged using SST Master Branch Merger
Automatically Merged using SST Master Branch Merger
Automatically Merged using SST Master Branch Merger
Automatically Merged using SST Master Branch Merger
Automatically Merged using SST Master Branch Merger
Automatically Merged using SST Master Branch Merger
Automatically Merged using SST Master Branch Merger
Automatically Merged using SST Master Branch Merger
Automatically Merged using SST Master Branch Merger
Automatically Merged using SST Master Branch Merger
class ExtendedRequest : public SST::Interfaces::SimpleNetwork::Request {
private:
// Map from plugin name/key to metadata
std::map<std::string, std::shared_ptr<PluginMetadata>> metadata;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using shard_ptr in classes that end up in events is not supported since std::shared_ptr is not serializable for events used in synchronization. This is because pointers are not tracked during event serialization, so nothing will end up being shared outside of the rank it was created on. I know this will cause some memory increase, but, in general, elements in the merlin library are assumed to be able to run across multiple ranks. I'm going to continue to review the code, but I think this will need to be changed before we can merge it.

Comment thread src/sst/elements/merlin/interfaces/endpointNIC/NICPlugin.h
{
public:
// Static shared data accessible by both SourceRoutingPlugin and topology classes
static std::map<int, int> endpoint_to_router_map; // Shared endpoint-to-router mapping
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally, static variables shouldn't be used in SubComponents. They should use the SharedObjects provided by the core. Since both of the classes that need access are derived from SubComponent, they will have access to those APIs.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, a minor question:
these SharedObjects support array of common data types (int, bool, ...) right?
Can we also share arbitrary objects like mutex of iostream?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The SharedObjects pretty much support any Object that is serializable, which means that a mutex would not be supported. If you need to have a shared Mutex, then that will still need to be a static variable. I/O generally doesn't fall under the same restrictions as variables used during the simulation.

@ziyuezzy
Copy link
Copy Markdown
Contributor Author

ziyuezzy commented Jan 6, 2026

Hi, no worries, we are not in any hurry.
And thanks for giving feedbacks on the code, I will try to fix the points you raised.

Recently I have been thinking about the design of this branch.
Of course importing network topology from graph, and making the routers do source-routing, is probably necessary to be implemented in Merlin.
But this whole concept of 'EndpointNIC with plugins', feels suitable to be an individual sst-element, similar to the 'rdmaNic' element?
Implementing these in Merlin, may benefit its extensibility. For example, I recently implemented another small plugin (not in this branch) that dumps traffic traces from Merlin, it was really clean and efficient. I wonder whether it is even feasible to implement circuit-switching within this framework....
But on the other hand, with a lot of plugins, this framework might also make Merlin complicated/confusing to read for people just started to use this simulator... I don't know if that is a concern or not.

@sst-autotester
Copy link
Copy Markdown
Contributor

Status Flag 'Pre-Test Inspection' - - This Pull Request Requires Inspection... The code must be inspected by a member of the Team before Testing/Merging
NO INSPECTION HAS BEEN PERFORMED ON THIS PULL REQUEST! - This PR must be inspected by setting label 'AT: PRE-TEST INSPECTED'.

@sst-autotester
Copy link
Copy Markdown
Contributor

Status Flag 'Pre-Test Inspection' - - This Pull Request Requires Inspection... The code must be inspected by a member of the Team before Testing/Merging
NO INSPECTION HAS BEEN PERFORMED ON THIS PULL REQUEST! - This PR must be inspected by setting label 'AT: PRE-TEST INSPECTED'.

sst-autotester and others added 2 commits January 16, 2026 08:10
Automatically Merged using SST Master Branch Merger
…ding ugal routing with the source routing mechanisms.
@sst-autotester
Copy link
Copy Markdown
Contributor

Status Flag 'Pre-Test Inspection' - - This Pull Request Requires Inspection... The code must be inspected by a member of the Team before Testing/Merging
NO INSPECTION HAS BEEN PERFORMED ON THIS PULL REQUEST! - This PR must be inspected by setting label 'AT: PRE-TEST INSPECTED'.

void serialize_order(SST::Core::Serialization::serializer &ser) override {
Request::serialize_order(ser);

// Manually serialize the metadata map since std::variant isn't directly supported
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

std::variant is supported by the serializer now, but it was developed primarily for checkpointing and may not work properly with pointer tracking turned off. I'd have to look through that code to know if pointer tracking matters. It may be worth trying to see if you can just serialize the vector directly.

@sst-autotester
Copy link
Copy Markdown
Contributor

Status Flag 'Pre-Test Inspection' - - This Pull Request Requires Inspection... The code must be inspected by a member of the Team before Testing/Merging
NO INSPECTION HAS BEEN PERFORMED ON THIS PULL REQUEST! - This PR must be inspected by setting label 'AT: PRE-TEST INSPECTED'.

@ziyuezzy ziyuezzy requested a review from feldergast January 27, 2026 09:11
@sst-autotester
Copy link
Copy Markdown
Contributor

Status Flag 'Pre-Test Inspection' - - This Pull Request Requires Inspection... The code must be inspected by a member of the Team before Testing/Merging
NO INSPECTION HAS BEEN PERFORMED ON THIS PULL REQUEST! - This PR must be inspected by setting label 'AT: PRE-TEST INSPECTED'.

@feldergast
Copy link
Copy Markdown
Contributor

Just an update. I've worked through about a third of the code changes, but have a deadline this week and then have a week of vacation. I will continue to review code when I get back.

@sst-autotester
Copy link
Copy Markdown
Contributor

Status Flag 'Pre-Test Inspection' - - This Pull Request Requires Inspection... The code must be inspected by a member of the Team before Testing/Merging
NO INSPECTION HAS BEEN PERFORMED ON THIS PULL REQUEST! - This PR must be inspected by setting label 'AT: PRE-TEST INSPECTED'.

@deanchester
Copy link
Copy Markdown
Contributor

One thing that would really improve the usability of these new components is filling out the ELI parameter documentation. Currently SST_ELI_DOCUMENT_PARAMS() in anytopo.h is empty, and endpointNIC.h only documents plugins despite the components accepting several other parameters.

The ELI documentation is what users see through sst-info and is often the first place people look when trying to configure a component — without it, the only way to figure out valid parameters and their expected values is to read the source. Filling these in would make the new topology and NIC framework much more approachable.

For reference, here's what the anytopo.h SST_ELI_DOCUMENT_PARAMS could look like, following the style used by the existing dragonfly topology:

SST_ELI_DOCUMENT_PARAMS(
    {"num_routers",                "Number of routers in the network."},
    {"num_R2R_ports",              "Number of router-to-router ports on this router."},
    {"num_R2N_ports",              "Number of router-to-endpoint (NIC) ports on this router."},
    {"routing_mode",               "Routing mode per VN [source_routing | dest_tag_routing].", "source_routing"},
    {"source_routing_algo",        "Source routing algorithm per VN [weighted | UGAL | UGAL_THRESHOLD].", "weighted"},
    {"vcs_per_vn",                 "Number of virtual channels per virtual network.", ""},
    {"vn_ugal_num_valiant",        "Number of valiant paths to consider for UGAL routing per VN.", "3"},
    {"adaptive_threshold",         "Threshold for UGAL_THRESHOLD adaptive routing decisions.", "2.0"},
    {"connectivity",               "Router connectivity string. Format: \"dest_id,port0,port1;dest_id,port0,...\""},
    {"simple_routing_entry_string","Serialized routing table for untimed packets. Format: \"dest:weight,hop1,hop2;weight,hop3|dest:...\""},
    {"endpoint_to_port_map",       "Map of global endpoint IDs to local router port numbers."},
    {"port_to_endpoint_map",       "Map of local router port numbers to global endpoint IDs."},
    {"verbose_level",              "Verbosity level for debug output.", "0"},
)

Similarly, endpointNIC.h should document EP_id, networkIF, and plugin_names alongside the existing plugins entry.

Copy link
Copy Markdown
Contributor

@feldergast feldergast left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've looked through all the code. Generally it looks good, with just a few minor code cleanup suggestions. However, there is one big issue that needs to have some thought put into it, which is the fact that the modified code structure has added a series of dynamic_casts() to the heavily used code paths. We have seen noticeable performance issues when dynamic_cast() is used often. We need to give some thought to limiting the use of dynamic casts, which may require some changes to the SimpleNetwork API in the core. I will need to give this some more thought after the SST 16 release goes in the next week or so. I had hoped to get this into the release, but I don't know that we can resolve this issue before the release.

delete rng;
}

std::deque<int> RandomWeightedSelection::selectPath(int dest_router, const std::vector<path_with_weight>& paths) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have to return this by value, causing a copy of the deque? If the contents of the deque are only read after the function is called, can you return a const reference instead?

WeightedRoundRobinSelection::~WeightedRoundRobinSelection() {
}

std::deque<int> WeightedRoundRobinSelection::selectPath(int dest_router, const std::vector<path_with_weight>& paths) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

const std::deque& instead??

namespace SST {
namespace Merlin {

class endpointNIC : public SST::Interfaces::SimpleNetwork
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally the class name would capitalized camel case, i.e. EndpointNIC


loadPlugins(params);

std::string networkIF = params.find<std::string>("networkIF");
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this is not used other then in the commented out code below.

out.init(getName() + ": ", 0, 0, Output::STDOUT);
// See if we need to set up a nid map
bool found = false;
EP_id = params.find<int>("EP_id",-1,found);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think for the anytopo topology, getting the endpoint ID from params makes sense, but if you're using this with any of the other topologies as a way to do a NIC with extra plugins, then the EndpointNIC will need to get the ID from the topology using LinkControl::getEndpointID during init(). This may necessitate a change in some of the plug-in APIs so that the endpoint ID can be assigned later.

I don't think this needs to be done before an initial merge, but some thought should be given to how this could be done to make this new infrastructure more generally useful.

if (!req) return nullptr;

// Convert to ExtendedRequest if not already
ExtendedRequest* ext_req = dynamic_cast<ExtendedRequest*>(req);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dynamic casts can have a noticeable performance impact. I'm not sure there's a way to avoid this given the core APIs being used, but we should think of enhancements to the SimpleNetwork interface that could remove the need for a dynamic_cast().

// Lookup path in routing table and set it as metadata
std::deque<int> path = selectPath(lookupRtrForEndpoint(ext_req->dest));

// // use verbose to print out this routing decision:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this code isn't needed, it should be removed. If you want to keep it around, it should use the Output object instead cout.

} // namespace Merlin
} // namespace SST

#endif No newline at end of file
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New line


// Need to put in the sequence number
// Convert to ExtendedRequest if not already
Merlin::ExtendedRequest* ext_req = dynamic_cast<Merlin::ExtendedRequest*>(req);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm worried about potentially having to have multiple dynamic_casts() for every request that is sent. If you were to have en EndpointNIC that needed a ReorderLinkControl that also has a LinkControl, then you end up multiple calls to dynamic_cast(). When I was architecting the event hierarchy for merlin, there was a noticeable performance difference between dynamic casting to determine event type and putting an enum in the base class to query the type. This type of structure is little more difficult given the fact that the base class is a core class, but I'm not comfortable with the dynamic casts that will potentially slow down simulations that don't even use the plug-in capable EndpointNIC.

// uint32_t seq;

~ReorderRequest() {}
// ReorderRequest() :
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Due to all the additional dynamic casts, I'm not convinced this is a good path to take.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants