Skip to content

Commit fa80aa6

Browse files
committed
docs: enhance joins.rst with details on DataFrame naming and deduplication behavior
1 parent a58a574 commit fa80aa6

File tree

1 file changed

+7
-2
lines changed
  • docs/source/user-guide/common-operations

1 file changed

+7
-2
lines changed

docs/source/user-guide/common-operations/joins.rst

Lines changed: 7 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -108,6 +108,8 @@ Disambiguating Columns
108108

109109
When the join key exists in both DataFrames under the same name, the result contains two columns with that name. Assign a name to each DataFrame to use as a prefix and avoid ambiguity.
110110

111+
When you create a DataFrame with a ``name`` argument, that name is used as a prefix in ``col("name.column")`` to reference specific columns.
112+
111113
.. ipython:: python
112114
113115
from datafusion import col
@@ -116,7 +118,9 @@ When the join key exists in both DataFrames under the same name, the result cont
116118
joined = left.join(right, on="id")
117119
joined.select(col("l.id"), col("r.id"))
118120
119-
You can remove the duplicate column after joining.
121+
Note that the columns in the result appear in the same order as specified in the ``select()`` call.
122+
123+
You can remove the duplicate column after joining. Note that ``drop()`` returns a new DataFrame (DataFusion's API is immutable).
120124

121125
.. ipython:: python
122126
@@ -126,7 +130,8 @@ Automatic Deduplication
126130
----------------------
127131

128132
Use the ``deduplicate`` argument of :py:meth:`DataFrame.join` to automatically
129-
drop the duplicate join column from the right DataFrame.
133+
drop the duplicate join column from the right DataFrame. Unlike PySpark which uses a ``_`` suffix by default,
134+
DataFusion uses the ``__right_<col>`` naming convention for conflicting columns when not using deduplication.
130135

131136
.. ipython:: python
132137

0 commit comments

Comments
 (0)