Fix zFPKM Calculations #238

JoshLoecker · 2025-11-03T16:55:25Z

This pull request replicates R's zFPKM functionality by rewriting underlying C code from R into Python using NumPy.

…dicates an incorrect taxon id Signed-off-by: Josh Loecker <joshloecker@icloud.com>

Signed-off-by: Josh Loecker <joshloecker@icloud.com>

…ht across the study Signed-off-by: Josh Loecker <joshloecker@icloud.com>

Signed-off-by: Josh Loecker <joshloecker@icloud.com>

…sible Signed-off-by: Josh Loecker <joshloecker@icloud.com>

Signed-off-by: Josh Loecker <joshloecker@icloud.com>

BREAKING-CHANGE: instead of providing a type `PeakIdentificationParameters`, simply provide the min peak height and distance directly Signed-off-by: Josh Loecker <joshloecker@icloud.com>

Signed-off-by: Josh Loecker <joshloecker@icloud.com>

…, not at the maximum fpkm distribution Signed-off-by: Josh Loecker <joshloecker@icloud.com>

Signed-off-by: Josh Loecker <joshloecker@icloud.com>

…sible Signed-off-by: Josh Loecker <joshloecker@icloud.com>

Signed-off-by: Josh Loecker <joshloecker@icloud.com>

…sible Signed-off-by: Josh Loecker <joshloecker@icloud.com>

Signed-off-by: Josh Loecker <joshloecker@icloud.com>

…ecting gene data Signed-off-by: Josh Loecker <joshloecker@icloud.com>

Signed-off-by: Josh Loecker <joshloecker@icloud.com>

Copilot

Pull Request Overview

This PR replaces NumPy-specific type hints (e.g., np.float64, np.float32, np.int8) with Python built-in types (float, int) across the codebase. It also includes significant refactoring of zFPKM calculations to match R's implementation, adds new helper modules for density estimation and peak finding, updates enum values to lowercase, and removes test/debugging code.

Replaces NumPy-specific dtype references with Python built-in types for better compatibility
Refactors zFPKM transformations to align with R's implementation using custom density and peak-finding functions
Updates enum values from uppercase to lowercase for consistency
Adds new configuration file (ty.toml) and helper modules (density.py, peak_finder.py, approx.py)

Reviewed Changes

Copilot reviewed 24 out of 25 changed files in this pull request and generated 20 comments.

Show a summary per file

File	Description
ty.toml	Adds new type checker configuration
tests/unit/test_*.py	Updates test fixtures to use `float`/`int` instead of NumPy types
ruff.toml	Reformats linting configuration and moves F401 to unfixable rules
main/como/utils.py	Updates type hints and adds logic to handle DataFrame index names
main/como/stats/*.py	Replaces NumPy types with built-in types in statistical test classes
main/como/rnaseq_*.py	Major refactoring including zFPKM calculation updates and gene ID handling
main/como/proteomics_*.py	Updates type hints and gene symbol processing logic
main/como/data_types.py	Changes enum values from uppercase to lowercase
main/como/create_context_specific_model.py	Adds taxon parameter, refactors model I/O, and updates type hints
main/como/merge_xomics.py	Updates enum usage and improves batch number extraction
main/como/combine_distributions.py	Refactors z-score combination logic to handle replicates per batch
main/como/peak_finder.py	New module implementing R's findpeaks function
main/como/density.py	New module implementing R's density function for KDE
main/como/approx.py	New module implementing R's approx interpolation function

Comments suppressed due to low confidence (1)

main/como/create_context_specific_model.py:1

Comment uses incorrect type checker ignore syntax. Should be type: ignore not ty: ignore.

from __future__ import annotations

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

main/como/utils.py

Copilot · 2025-11-03T16:59:07Z

main/como/rnaseq_gen.py

        A NumPy array of the filtered data.
    """
-    if not isinstance(data, pd.DataFrame | npt.NDArray):
+    if not isinstance(data, pd.DataFrame | np.ndarray):


The isinstance() call should use a tuple, not a union type with |. This should be isinstance(data, (pd.DataFrame, np.ndarray)). The current syntax will raise a TypeError at runtime.

Suggested change

if not isinstance(data, pd.DataFrame | np.ndarray):

if not isinstance(data, (pd.DataFrame, np.ndarray)):

Copilot · 2025-11-03T16:59:07Z

main/como/combine_distributions.py

-            "ensembl_gene_id": merged_source_data["ensembl_gene_id"],
-            "combine_z": weighted_matrix,
-        }
+    # alidate alignment between columns and replicate vector


Corrected spelling of 'Validate' (missing 'V').

Suggested change

# alidate alignment between columns and replicate vector

# Validate alignment between columns and replicate vector

Copilot · 2025-11-03T16:59:07Z

main/como/create_context_specific_model.py

            # replace gene_id with activity, using optional whitespace before and after the gene id
            # Do not replace the whitespace (if it exists) before and after the gene ID
-            gene_reaction_rule = re.sub(pattern=rf"\b{gene_id}\b", repl=activity, string=gene_reaction_rule)
+            gene_reaction_rule = re.sub(pattern=rf"\b{gene_id}\b", repl=str(activity), string=str(gene_reaction_rule))


Calling str() on gene_reaction_rule in every iteration is unnecessary if it's already a string. Consider verifying the type once before the loop or using a type assertion if you're certain it's always a string.

Suggested change

gene_reaction_rule = re.sub(pattern=rf"\b{gene_id}\b", repl=str(activity), string=str(gene_reaction_rule))

gene_reaction_rule = re.sub(pattern=rf"\b{gene_id}\b", repl=str(activity), string=gene_reaction_rule)

Copilot · 2025-11-03T16:59:08Z

main/como/proteomics_gen.py

+                )
+            )
+        df = pd.concat(dataframes, axis="columns")
+        print(df)


Debug print statement left in production code. This should either be removed or replaced with proper logging using logger.debug(df) or logger.trace(df).

Suggested change

print(df)

logger.debug(df)

Copilot · 2025-11-03T16:59:11Z

main/como/utils.py


 @overload
-async def _read_file(path: pd.DataFrame, h5ad_as_df: bool, **kwargs) -> pd.DataFrame: ...
+async def _read_file(path: pd.DataFrame, h5ad_as_df: Literal[True] | Literal[False], **kwargs) -> pd.DataFrame: ...


This statement has no effect.

Copilot · 2025-11-03T16:59:11Z

main/como/utils.py


 @overload
-async def _read_file(path: sc.AnnData, h5ad_as_df: bool = False, **kwargs) -> sc.AnnData: ...
+async def _read_file(path: sc.AnnData, h5ad_as_df: Literal[False] = False, **kwargs) -> sc.AnnData: ...


This statement has no effect.

Copilot · 2025-11-03T16:59:12Z

main/como/utils.py

-def _num_rows(item: pd.DataFrame | npt.NDArray) -> int:
-    return item.shape[0]
+@overload
+async def _read_file(path: sc.AnnData, h5ad_as_df: Literal[True] = True, **kwargs) -> pd.DataFrame: ...


This statement has no effect.

Copilot · 2025-11-03T16:59:12Z

main/como/utils.py

+
+
+@overload
+async def _read_file(path: Path, h5ad_as_df: Literal[False] = False, **kwargs) -> pd.DataFrame | sc.AnnData: ...


This statement has no effect.

Copilot · 2025-11-03T16:59:12Z

main/como/utils.py


 @overload
-async def _read_file(path: sc.AnnData, h5ad_as_df: bool = True, **kwargs) -> pd.DataFrame: ...
+async def _read_file(path: Path, h5ad_as_df: Literal[True] = True, **kwargs) -> pd.DataFrame: ...


This statement has no effect.

Signed-off-by: Josh Loecker <joshloecker@icloud.com>

JoshLoecker added 30 commits October 9, 2025 14:30

feat: added log statement and hint to user that an empty dataframe in…

ebd5a39

…dicates an incorrect taxon id Signed-off-by: Josh Loecker <joshloecker@icloud.com>

fix: do not drop na values; keep as much data as possible

7cab8f3

Signed-off-by: Josh Loecker <joshloecker@icloud.com>

fix: allow empty genomic values

7282a34

Signed-off-by: Josh Loecker <joshloecker@icloud.com>

chore: move from np.float(ing,64,32) to python's float

c903cc8

Signed-off-by: Josh Loecker <joshloecker@icloud.com>

chore: ruff formatting

5b39a1c

Signed-off-by: Josh Loecker <joshloecker@icloud.com>

chore(dev): ignore, but provide warning for, unused imports

34451a9

Signed-off-by: Josh Loecker <joshloecker@icloud.com>

fix: clipped values should use floats, not integers

c0de528

Signed-off-by: Josh Loecker <joshloecker@icloud.com>

chore: import required modules

526da0d

Signed-off-by: Josh Loecker <joshloecker@icloud.com>

fix: set index name + single column name

512fa4b

Signed-off-by: Josh Loecker <joshloecker@icloud.com>

chore: more explicit variable name

c319e8e

Signed-off-by: Josh Loecker <joshloecker@icloud.com>

chore: update docstring

86ea376

Signed-off-by: Josh Loecker <joshloecker@icloud.com>

chore: do not modify input dataframe inplace

d8179bb

Signed-off-by: Josh Loecker <joshloecker@icloud.com>

fix: use a per-replicate count for weighting instead of a single weig…

838947f

…ht across the study Signed-off-by: Josh Loecker <joshloecker@icloud.com>

fix: use index value for ensembl ids

ae0797f

Signed-off-by: Josh Loecker <joshloecker@icloud.com>

refactor: do not use async

f423d1b

Signed-off-by: Josh Loecker <joshloecker@icloud.com>

refactor: do not use async

7f99af7

Signed-off-by: Josh Loecker <joshloecker@icloud.com>

refactor: use lowercase identifier names; provide default values

926c139

Signed-off-by: Josh Loecker <joshloecker@icloud.com>

refactor: import required modules

4149eb9

Signed-off-by: Josh Loecker <joshloecker@icloud.com>

fix: use .loc[] to prevent copy-on-write warning

f56b853

Signed-off-by: Josh Loecker <joshloecker@icloud.com>

fix: explicitly check column names

6105b98

Signed-off-by: Josh Loecker <joshloecker@icloud.com>

refactor: raise error if adjustment method not found

1219cc4

Signed-off-by: Josh Loecker <joshloecker@icloud.com>

refactor: validate batch in loop

9c71754

Signed-off-by: Josh Loecker <joshloecker@icloud.com>

refactor: use lowercase names

4142156

Signed-off-by: Josh Loecker <joshloecker@icloud.com>

refactor: import required modules

89f05aa

Signed-off-by: Josh Loecker <joshloecker@icloud.com>

feat: replicate R's zFPKM module

60a883a

Signed-off-by: Josh Loecker <joshloecker@icloud.com>

refactor: do not drop na values, keep as much data for as long as pos…

95dc2db

…sible Signed-off-by: Josh Loecker <joshloecker@icloud.com>

refactor: explicit type cast

bb1449a

Signed-off-by: Josh Loecker <joshloecker@icloud.com>

refactor!: provide default min peak height and distance

5480246

BREAKING-CHANGE: instead of providing a type `PeakIdentificationParameters`, simply provide the min peak height and distance directly Signed-off-by: Josh Loecker <joshloecker@icloud.com>

feat: remove NA values by default (replicates R functionality)

038a5fe

Signed-off-by: Josh Loecker <joshloecker@icloud.com>

fix: do not build a list of None

31cb4d4

Signed-off-by: Josh Loecker <joshloecker@icloud.com>

JoshLoecker added 12 commits November 3, 2025 10:43

fix: plot gaussian distribution with a peak of the fpkm value at mu…

2f17279

…, not at the maximum fpkm distribution Signed-off-by: Josh Loecker <joshloecker@icloud.com>

refactor!: provide default min peak height and distance

2d6b20f

Signed-off-by: Josh Loecker <joshloecker@icloud.com>

refactor: pythonic method to collect merged gene z-scores

8076c76

Signed-off-by: Josh Loecker <joshloecker@icloud.com>

refactor: do not drop na values, keep as much data for as long as pos…

9bcda5f

…sible Signed-off-by: Josh Loecker <joshloecker@icloud.com>

refactor: use new min zfpkm peak height/distance

6fde5bd

Signed-off-by: Josh Loecker <joshloecker@icloud.com>

refactor: do not drop na values, keep as much data for as long as pos…

74ca5de

…sible Signed-off-by: Josh Loecker <joshloecker@icloud.com>

chore: remove __main__ function

666fb84

Signed-off-by: Josh Loecker <joshloecker@icloud.com>

fix: convert identifiers to lowercase

4de5421

Signed-off-by: Josh Loecker <joshloecker@icloud.com>

refactor: more robust data handling when creating the gene info file

b917ac1

Signed-off-by: Josh Loecker <joshloecker@icloud.com>

refactor: include more robust error handling for dataframes when coll…

e512740

…ecting gene data Signed-off-by: Josh Loecker <joshloecker@icloud.com>

chore(dev): expand overloaded functions to add more type paths

8d505f7

Signed-off-by: Josh Loecker <joshloecker@icloud.com>

feat(dev): added ty.toml file for type hints

448627b

Signed-off-by: Josh Loecker <joshloecker@icloud.com>

Copilot AI review requested due to automatic review settings November 3, 2025 16:55

JoshLoecker changed the base branch from main to develop November 3, 2025 16:55

JoshLoecker added 2 commits November 3, 2025 10:56

Merge branch 'develop' into fix-zfpkm

3135061

fix(test): lowercase names

cebab44

Signed-off-by: Josh Loecker <joshloecker@icloud.com>

Copilot AI reviewed Nov 3, 2025

View reviewed changes

format: ruff formatting

37be94b

Signed-off-by: Josh Loecker <joshloecker@icloud.com>

JoshLoecker merged commit a609a56 into develop Nov 3, 2025
3 checks passed

JoshLoecker deleted the fix-zfpkm branch November 3, 2025 17:02

JoshLoecker added a commit that referenced this pull request Nov 3, 2025

Fix zFPKM Calculations (#238)

f1e9dcb

JoshLoecker added a commit that referenced this pull request Nov 3, 2025

Fix zFPKM Calculations (#238)

4e889f1

JoshLoecker added a commit that referenced this pull request Nov 3, 2025

Fix zFPKM Calculations (#238)

0df0142

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix zFPKM Calculations #238

Fix zFPKM Calculations #238

Uh oh!

JoshLoecker commented Nov 3, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Copilot AI Nov 3, 2025

Uh oh!

Copilot AI Nov 3, 2025

Uh oh!

Copilot AI Nov 3, 2025

Uh oh!

Copilot AI Nov 3, 2025

Uh oh!

Copilot AI Nov 3, 2025

Uh oh!

Copilot AI Nov 3, 2025

Uh oh!

Copilot AI Nov 3, 2025

Uh oh!

Copilot AI Nov 3, 2025

Uh oh!

Copilot AI Nov 3, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	if not isinstance(data, pd.DataFrame \| np.ndarray):
	if not isinstance(data, (pd.DataFrame, np.ndarray)):

	# alidate alignment between columns and replicate vector
	# Validate alignment between columns and replicate vector

	gene_reaction_rule = re.sub(pattern=rf"\b{gene_id}\b", repl=str(activity), string=str(gene_reaction_rule))
	gene_reaction_rule = re.sub(pattern=rf"\b{gene_id}\b", repl=str(activity), string=gene_reaction_rule)



		@overload
		async def _read_file(path: Path, h5ad_as_df: Literal[False] = False, **kwargs) -> pd.DataFrame \| sc.AnnData: ...

Fix zFPKM Calculations #238

Fix zFPKM Calculations #238

Uh oh!

Conversation

JoshLoecker commented Nov 3, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Copilot AI Nov 3, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 3, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 3, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 3, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 3, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 3, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 3, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 3, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 3, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants