Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
123 changes: 123 additions & 0 deletions docs/how-tos/graph-commands/extracting-subsets.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,123 @@
Extracting Graph Subsets
========================

The ``fromager graph subset`` command extracts a focused subgraph containing only the dependencies and dependents of a specific package. This is useful for understanding the impact scope of a particular package, debugging specific dependency issues, or creating smaller, more manageable graphs for analysis.

Basic Usage
-----------

To extract a subset graph for a specific package:

.. code-block:: bash

fromager graph subset <graph-file> <package-name>

Example
-------

Using the example graph file from the e2e test, let's extract a subset for the ``keyring`` package:

.. code-block:: bash

fromager graph subset e2e/build-parallel/graph.json keyring

This command will output a JSON graph containing:

- The ``keyring`` package itself
- All packages that depend on ``keyring`` (dependents)
- All packages that ``keyring`` depends on (dependencies)
- The ROOT node if ``keyring`` is a top-level dependency

The resulting subset will include packages like:

- ``keyring==25.6.0`` (the target package)
- ``imapautofiler==1.14.0`` (depends on keyring)
- ``jaraco-classes==3.4.0`` (keyring dependency)
- ``jaraco-context==6.0.1`` (keyring dependency)
- ``jaraco-functools==4.1.0`` (keyring dependency)
- And their transitive dependencies

Version Filtering
-----------------

You can limit the subset to a specific version of the target package using the ``--version`` flag:

.. code-block:: bash

fromager graph subset e2e/build-parallel/graph.json setuptools --version 80.8.0

This is particularly useful when dealing with packages that have multiple versions in the graph, allowing you to focus on the relationships of a specific version.

File Output
-----------

Save the subset graph to a file instead of printing to stdout:

.. code-block:: bash

fromager graph subset e2e/build-parallel/graph.json jinja2 -o jinja2-subset.json

The output file will be in the same JSON format as the original graph file and can be used as input to other ``fromager graph`` commands.

Use Cases
---------

**Debugging Dependency Issues**
When a specific package is causing build problems, extract its subset to focus on just the relevant dependencies without the noise of the full graph.

**Impact Analysis**
Before upgrading or removing a package, understand what other packages would be affected by examining its dependents.

**Creating Focused Build Graphs**
Generate smaller graphs for specific components of your application, making it easier to understand and manage complex dependency trees.

**Documentation and Communication**
Create focused dependency diagrams for specific packages when documenting or explaining system architecture to team members.

**Performance Optimization**
When working with very large dependency graphs, extract subsets to improve performance of analysis tools and reduce memory usage.

Example Workflow
----------------

Here's a typical workflow for investigating a package's dependencies:

.. code-block:: bash

# Extract subset for a problematic package
fromager graph subset my-project-graph.json problematic-package -o debug-subset.json

# Visualize the subset
fromager graph to-dot debug-subset.json -o debug-subset.dot
dot -Tpng debug-subset.dot -o debug-subset.png

# Analyze why specific dependencies appear
fromager graph why debug-subset.json some-unexpected-dependency

This workflow helps you quickly isolate and understand issues within a complex dependency tree.

Output Format
-------------

The subset command preserves the original graph structure and format. The output is a valid dependency graph that:

- Maintains all edge relationships between included nodes
- Preserves requirement specifications and constraint information
- Can be used as input to other graph commands
- Is compatible with existing fromager workflows

Error Handling
--------------

The command will report an error if:

- The specified package is not found in the graph
- The specified version of a package is not found
- The graph file is invalid or corrupted

Example error output:

.. code-block:: bash

$ fromager graph subset e2e/build-parallel/graph.json nonexistent-package
Error: Package nonexistent-package not found in graph
5 changes: 3 additions & 2 deletions docs/how-tos/graph-commands/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -9,17 +9,18 @@ All examples use the sample graph file ``e2e/build-parallel/graph.json`` which c
:maxdepth: 1
:glob:

[uvw]*
[euvw]*

Overview of Graph Commands
--------------------------

The ``fromager graph`` command group provides several subcommands for analyzing dependency graphs:

- ``subset``: Extract a focused subgraph containing only dependencies and dependents of a specific package
- ``why``: Understand why a package appears in the dependency graph
- ``to-dot``: Convert graph to DOT format for visualization with Graphviz
- ``explain-duplicates``: Analyze multiple versions of packages in the graph
- ``to-constraints``: Convert graph to constraints file format
- ``migrate-graph``: Convert old graph formats to the current format

These tools help you understand complex dependency relationships, debug unexpected dependencies, and create visual representations of your build requirements.
These tools help you understand complex dependency relationships, debug unexpected dependencies, create focused subgraphs for analysis, and create visual representations of your build requirements.
184 changes: 184 additions & 0 deletions src/fromager/commands/graph.py
Original file line number Diff line number Diff line change
Expand Up @@ -459,6 +459,190 @@ def why(
find_why(graph, node, depth, 0, requirement_type)


@graph.command()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we please add documentation for this new command?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The documentation for the CLI reference section is generated automatically, but I can add a how-to.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added some additional docs.

@click.option(
"-o",
"--output",
type=clickext.ClickPath(),
help="Output file path for the subset graph",
)
@click.option(
"--version",
type=clickext.PackageVersion(),
help="Limit subset to specific version of the package",
)
@click.argument(
"graph-file",
type=str,
)
@click.argument("package-name", type=str)
@click.pass_obj
def subset(
wkctx: context.WorkContext,
graph_file: str,
package_name: str,
output: pathlib.Path | None,
version: Version | None,
) -> None:
"""Extract a subset of a build graph related to a specific package.

Creates a new graph containing only nodes that depend on the specified package
and the dependencies of that package. By default includes all versions of the
package, but can be limited to a specific version with --version.
"""
try:
graph = DependencyGraph.from_file(graph_file)
subset_graph = extract_package_subset(graph, package_name, version)

if output:
with open(output, "w") as f:
subset_graph.serialize(f)
else:
subset_graph.serialize(sys.stdout)
except ValueError as e:
raise click.ClickException(str(e)) from e


def extract_package_subset(
graph: DependencyGraph,
package_name: str,
version: Version | None = None,
) -> DependencyGraph:
"""Extract a subset of the graph containing nodes related to a specific package.

Creates a new graph containing:
- All nodes matching the package name (optionally filtered by version)
- All nodes that depend on the target package (dependents)
- All dependencies of the target package

Args:
graph: The source dependency graph
package_name: Name of the package to extract subset for
version: Optional version to filter target nodes

Returns:
A new DependencyGraph containing only the related nodes

Raises:
ValueError: If package not found in graph
"""
# Find target nodes matching the package name
target_nodes = graph.get_nodes_by_name(package_name)
if version:
target_nodes = [node for node in target_nodes if node.version == version]

if not target_nodes:
version_msg = f" version {version}" if version else ""
raise ValueError(f"Package {package_name}{version_msg} not found in graph")

# Collect all related nodes
related_nodes: set[str] = set()

# Add target nodes
for node in target_nodes:
related_nodes.add(node.key)

# Traverse up to find dependents (what depends on our package)
visited_up: set[str] = set()
for target_node in target_nodes:
_collect_dependents(target_node, related_nodes, visited_up)

# Traverse down to find dependencies (what our package depends on)
visited_down: set[str] = set()
for target_node in target_nodes:
_collect_dependencies(target_node, related_nodes, visited_down)

# Always include ROOT if any target nodes are top-level dependencies
for target_node in target_nodes:
for parent_edge in target_node.parents:
if parent_edge.destination_node.key == ROOT:
related_nodes.add(ROOT)
break
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the code we have from line 555 to 560 is most likely dead code. Here is the analysis from a AI agent

The code is dead. By line 555, _collect_dependents (line 548) has already run. That function walks every ancestor of the target node all the way up to ROOT. ROOT is
always added to related_nodes because it's the ultimate ancestor of every node. We proved this empirically earlier:

_collect_dependents(pyyaml==6.0.2):
→ walks to imapautofiler==1.14.0
→ walks to ROOT ("")
→ ROOT.parents is empty, stops

So by the time execution reaches line 555, ROOT is already in related_nodes. The add() on line 559 is a no-op on a set that already contains the value.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I'll take a look through and see, thanks for catching that.


# Create new graph with only related nodes
subset_graph = DependencyGraph()
_build_subset_graph(graph, subset_graph, related_nodes)

return subset_graph


def _collect_dependents(
node: DependencyNode,
related_nodes: set[str],
visited: set[str],
) -> None:
"""Recursively collect all nodes that depend on the given node."""
if node.key in visited:
return
visited.add(node.key)

for parent_edge in node.parents:
parent_node = parent_edge.destination_node
related_nodes.add(parent_node.key)
_collect_dependents(parent_node, related_nodes, visited)


def _collect_dependencies(
node: DependencyNode,
related_nodes: set[str],
visited: set[str],
) -> None:
"""Recursively collect all dependencies of the given node."""
if node.key in visited:
return
visited.add(node.key)

for child_edge in node.children:
child_node = child_edge.destination_node
related_nodes.add(child_node.key)
_collect_dependencies(child_node, related_nodes, visited)


def _build_subset_graph(
source_graph: DependencyGraph,
target_graph: DependencyGraph,
included_nodes: set[str],
) -> None:
"""Build the subset graph with only the included nodes and their edges."""
# First pass: add all included nodes
for node_key in included_nodes:
source_node = source_graph.nodes[node_key]
if node_key == ROOT:
continue # ROOT is already created in the new graph

# Add the node to target graph
target_graph._add_node(
req_name=source_node.canonicalized_name,
version=source_node.version,
download_url=source_node.download_url,
pre_built=source_node.pre_built,
constraint=source_node.constraint,
)

# Second pass: add edges between included nodes
for node_key in included_nodes:
source_node = source_graph.nodes[node_key]
for child_edge in source_node.children:
child_key = child_edge.destination_node.key
# Only add edge if both parent and child are in the subset
if child_key in included_nodes:
child_node = child_edge.destination_node
target_graph.add_dependency(
parent_name=source_node.canonicalized_name
if source_node.canonicalized_name
else None,
parent_version=source_node.version
if source_node.canonicalized_name
else None,
req_type=child_edge.req_type,
req=child_edge.req,
req_version=child_node.version,
download_url=child_node.download_url,
pre_built=child_node.pre_built,
constraint=child_node.constraint,
)


def find_why(
graph: DependencyGraph,
node: DependencyNode,
Expand Down
Loading
Loading