Skip to content

Conversation

@ajpotts
Copy link
Contributor

@ajpotts ajpotts commented Jan 12, 2026

Summary

This PR introduces a generic Arkouda pandas dtype, dtype="ak", allowing users to construct Arkouda-backed pandas arrays without specifying a concrete Arkouda dtype (e.g. ak_int64, ak_string) up front.

The new generic dtype improves ergonomics and aligns Arkouda’s pandas integration with standard pandas patterns such as dtype="string" or dtype="category".

Motivation

Prior to this change, users had to explicitly specify a concrete Arkouda dtype when constructing pandas objects:

pd.array(data, dtype="ak_int64")
pd.Series(data, dtype="ak_float64")

This is unnecessarily verbose and diverges from typical pandas usage, where users usually select a backend or family and allow the system to infer the concrete dtype.

With this PR, users can now write:

pd.array(data, dtype="ak")
pd.Series(data, dtype="ak")

and rely on Arkouda to infer the appropriate concrete dtype.

What’s in this PR

1. Generic ArkoudaDtype

A new pandas ExtensionDtype, ArkoudaDtype, is introduced and registered under the name "ak".

Key properties:

  • dtype="ak" resolves to ArkoudaDtype
  • construct_array_type() returns ArkoudaExtensionArray
  • Acts as a dispatcher, not a concrete storage dtype

2. Factory-style dispatch in _from_sequence

ArkoudaExtensionArray._from_sequence has been refactored into a true factory:

  • Normalizes pandas-provided dtypes, treating "ak" / ArkoudaDtype as a request for backend inference
  • Converts Python / NumPy inputs to Arkouda objects once
  • Dispatches based on the resulting Arkouda type:
    • pdarrayArkoudaArray
    • StringsArkoudaStringArray
    • pandas-style CategoricalArkoudaCategoricalArray

This makes _from_sequence the single construction choke point used by pandas when dtype="ak" is specified.

3. Updated documentation

The _from_sequence docstring was updated to accurately reflect:

  • Factory/dispatcher behavior
  • Post-conversion dispatch
  • The semantics of dtype="ak" vs concrete Arkouda dtypes
  • pandas construction context (pd.array(..., dtype="ak"))

4. Comprehensive tests

New tests verify that dtype="ak" correctly dispatches for:

  • Numeric data (int64, float64)
  • Strings
  • Arkouda pandas Categorical
  • Both pd.array and pd.Series construction paths

Tests also document the required construction pattern for categoricals (pd.array(..., dtype="ak") prior to Series) to avoid pandas eager iteration.

Non-goals / Follow-ups

  • astype("ak") behavior is intentionally out of scope for this PR
  • No changes to existing concrete Arkouda dtype strings
  • No changes to pandas materialization or iteration semantics

These can be addressed in follow-up PRs if desired.

Impact

  • Significantly improves usability of Arkouda’s pandas ExtensionArray API
  • Reduces boilerplate for exploratory and backend-agnostic code
  • Provides a clean foundation for future dtype-related enhancements

Example

import pandas as pd

pd.array([1, 2, 3], dtype="ak")        # ArkoudaArray (int64)
pd.array(["a", "b"], dtype="ak")       # ArkoudaStringArray
pd.Series([1.0, 2.0], dtype="ak")      # ArkoudaArray (float64)

Closes #5303: Pandas ExtensionArray: allow dtype=ak for generic Arkouda-backed arrays

@ajpotts ajpotts force-pushed the 5303_Pandas_ExtensionArray_allow_dtype=ak branch 3 times, most recently from 76f3b24 to cfc9e7e Compare January 12, 2026 20:56
@codecov-commenter
Copy link

codecov-commenter commented Jan 12, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
⚠️ Please upload report for BASE (main@9bedaa2). Learn more about missing BASE report.

Additional details and impacted files
@@           Coverage Diff            @@
##             main     #5304   +/-   ##
========================================
  Coverage        ?   100.00%           
========================================
  Files           ?         4           
  Lines           ?        63           
  Branches        ?         0           
========================================
  Hits            ?        63           
  Misses          ?         0           
  Partials        ?         0           
Flag Coverage Δ
python-coverage 100.00% <ø> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@ajpotts ajpotts marked this pull request as ready for review January 12, 2026 21:58
@ajpotts ajpotts force-pushed the 5303_Pandas_ExtensionArray_allow_dtype=ak branch from cfc9e7e to 11fcad3 Compare January 16, 2026 19:23
Copy link
Contributor

@1RyanK 1RyanK left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!


from numpy.typing import NDArray
from pandas.api.extensions import ExtensionArray
from pandas.core.dtypes.base import ExtensionDtype
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would from pandas.api.extensions import ExtensionDtype be better?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

return cls(out)

@classmethod
def _from_sequence(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing to note: pd.array(ak.pandas.Categorical(...), dtype="ak_int64") currently raises NotImplementedError because pandas routes that path through ArkoudaArray._from_sequence (which tries to iterate the categorical) rather than this dispatcher. This PR only supports categorical construction when using the generic "ak" dtype or ArkoudaCategoricalDtype. That’s probably fine, but it could be good to either (a) document that categorical → concrete dtype casts are unsupported, or (b) add a clearer error/guard for that case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is outside the scope of this PR.

In [12]: pd.array(Categorical(ak.array(["a","a","b"])), dtype="ak")
Out[12]: ArkoudaCategoricalArray(['a', 'a', 'b'])

appears to work.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did create a ticket for the issue: #5335

@ajpotts ajpotts force-pushed the 5303_Pandas_ExtensionArray_allow_dtype=ak branch from 11fcad3 to 305014d Compare January 23, 2026 19:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Pandas ExtensionArray: allow dtype="ak" for generic Arkouda-backed arrays

3 participants