cope with distinct: true #655

dvv · 2025-03-04T13:09:30Z

An attempt to fix elixir-ecto/ecto#4589
Please consider applying

greg-rychlewski · 2025-03-04T16:48:56Z

Thanks @dvv.

I'm not sure this is actually an issue. When you don't provide a select statement it selects the from struct. So this is akin to writing an invalid query, which should produce an error.

I'm not sure it's something we should change because now we are adding fields to the query that cannot be put into the resulting elixir struct.

dvv · 2025-03-04T17:06:03Z

@greg-rychlewski I see.
Do we have any other means to control select clause without too much intervention?
Say, allowing for |> select_merge([x], %{__just_a_placeholder__: field(x, ^field)})?
Or allowing for |> distinct([bar], [asc: bar.name], on: false) to disable DISTINCT ON while keeping field in select clause?

dvv · 2025-03-04T17:18:51Z

Ecto.Repo.Queryable

        {%{__struct__: name}, %{}} ->
          for {key, _} <- right, key != false and not Map.has_key?(left, key) do
            raise ArgumentError, "struct #{inspect(name)} does not have the key #{inspect(key)}"
          end

          Map.merge(left, right |> Map.drop([false]))

just does the trick

josevalim · 2025-03-04T19:05:33Z

@greg-rychlewski my gut feeling is that it makes sense to make it more convenient to the user and automatically include any order by in the select. The issue, however, is that we will be adding more fields to all queries written up to this day and in some cases, but I guess we could fix those by doing a Enum.uniq on the string values?

greg-rychlewski · 2025-03-04T21:04:08Z

The main potential issue I see is when getting the results from the db and trying to collect them at the end. We will have more fields than asked for and they will potentially be incompatible. For example a field that doesn't belong to a struct. If we try to de-dup I'm not 100% sure if that will cause complications as well. Because we generally iterate through a list of returned columns and assume they are in a certain order.

dvv · 2025-03-05T03:02:53Z

@greg-rychlewski this PR is exactly for that -- just before composing SQL it analyzes order_bys and appends mentioned fields (if they refer to joins) to select fields.

greg-rychlewski · 2025-03-05T03:07:32Z

@dvv That is the query that goes to the database. Now the result comes back to Elixir and it has to be packed inside of the Elixir structure you requested.

dvv · 2025-03-05T03:15:32Z

The extra fields appended at this stage do not spoil the result because packing instructions were collected before.

greg-rychlewski · 2025-03-05T03:31:46Z

I'm not personally sure if that is true in all cases. But if you and Jose have evaluated that thoroughly and know for sure it's not an issue then that is the main thing I was concerned about.

dvv · 2025-03-05T03:51:28Z

@greg-rychlewski I'm not sure too. I just have no sufficient expertise.

That is why I also proposed a lighter way: provide user with possibility to select_merge(%{}) ephemeral fields which would be filtered out when packing the result.

josevalim · 2025-03-05T08:36:15Z

Another option is to add a transform operation to either select or the query, so we can do:

select: {foo, foo.bar} |> transform(&elem(&1, 0))

However, it still sounds weird to return all of this over the wire only to discard it. I think ultimately the problem here is in the query. Do you really need to pass distinct: true? Isn't there a particular column that you could use? If so, consider passing distinct columns to distinct: or using group_by: ... instead.

dvv · 2025-03-05T09:04:47Z

@josevalim I namely do need to filter over association while returning distinct set of primary relation.
Real world example: I need distinct participants over movies filtered by participant type.

Do you see real problem pruning ephemeral keys from select_merge(%{})?

josevalim · 2025-03-05T09:12:20Z

@dvv but if you have primary keys, you can group by the primary key instead, or anything that uniquely identity them.

dvv · 2025-03-05T09:38:22Z

It forces to craft ad-hoc queries while the point was to fix (postgresql-bound only?) deficiency.
I'm making a library that turns a map of parameters into Ecto query given primary schema.
When it sees %{distinct: true} I'd like it to just mangle the query properly without human intervention.
It turned out I can not control select clause by pure Ecto means flexibly enough.
But this limitation seems artificial to me, hence my proposal.

Otherwise, you wrote "make it more convenient to the user and automatically include any order by in the select" and it sounds great. We can skip appending order-bys that come from primary schema, I believe. I drafted the patch and it passed tests AFAICS, but I can not state for sure I did it right.

josevalim · 2025-03-05T09:43:48Z

But this limitation seems artificial to me, hence my proposal.

There is nothing artificial given it is coming from the database. Perhaps if you can argue to PostgreSQL folks that it is artificial, then they will remove the restriction from their side, and it will automatically work on Ecto too. But given the database is the one that is imposing it, we need to balance the trade-offs from the database point of view, and the current solution may cause existing apps to load unnecessary data from the database. :(

Another option, if you want this to work transparently, is to use subqueries. If you use distinct: true, you can probably do something like:

from p in subquery(the_actual_query_with_order_by), distinct: true

dvv · 2025-03-05T09:50:31Z

By artificial limitation I meant only that mentioned here: elixir-ecto/ecto#4589 (comment)
If we'd treat a key as ephemeral it wouldn't break anything: neither currently running codebase, nor useless overfetching from the DB.

dvv · 2025-03-05T09:57:34Z

Or, maybe a special kinda helper would fix:

|> just_select_dont_merge([x], [field(x, ^field)])
|> select_merge([x], %{bebebe: field(x, ^field)}, just_select: [:bebebe])

greg-rychlewski · 2025-03-05T10:59:12Z

~~I'm pretty sure this can be done with lateral join like this~~

~~sql~~ ~~SELECT f0.id ~~ ~~FROM foo AS f0 INNER JOIN LATERAL (SELECT * FROM bar WHERE bar.foo_id = foo.id ORDER BY name LIMIT 1) AS b1 ~~ ~~ON true~~ ~~~~

Actually I am wrong sorry

greg-rychlewski · 2025-03-05T11:13:51Z

@dvv Is this a hypothetical query or is it something you ran into in real life?

The reason I'm asking is because I'm struggling to identify a real use case for combining these:

distinct: true
ordering by a join table
selecting main table

Since you are selecting the main table and also de-duplicating by the entire row of the main table, I'm not sure what is gained by ordering by the join table. No matter which row you pick from the join table you will always return the same results.

You will have something like this

main_table_row1 join_table_row10
main_table_row1 join_table_row13
main_table_row1 join_table_row99
main_table_row1 join_table_row1001

then you pick one of those and remove the join_table_rowX part. you always get main_table_row1

the only time you get a difference is if you do distinct on a subset of the main table or you include columns from the join table in the result

josevalim · 2025-03-05T11:51:59Z

If we'd treat a key as ephemeral it wouldn't break anything: neither currently running codebase, nor useless overfetching from the DB.

Unfortunately the overfetching would still be there, you are just not doing it twice (which is indeed worse). That's why rewriting the query to a subquery or similar should be the best choice. Or, since you have a helper function that builds a query, you could also have a function that runs the query and maps the results, this way you could strip the additional columns added by distinct. I don't understand why the functionality has to be provided by Ecto, you should be able to build abstractions on top.

In any case, it seems we are not going down this path, so we can close this PR. Thank you! I recommend exploring subqueries and rewriting the query for now.

dvv · 2025-03-05T16:16:05Z

@greg-rychlewski master...dvv:ecto_sql:patch-1#diff-df64a1e98eb69a93a67e388be3fb937b476bc769705a2a682d3e8e1ecdc8a38cR603
There's

SELECT DISTINCT s0."x", s1."id" FROM "schema" AS s0 INNER JOIN "schema3" AS s1 ON s1."id" = s0."y" ORDER BY s1."id"

Indeed the order of s0.x depends on s1.id. And if I'd apply limit I'd get different sets.

So in your example's terms my situation is rather:

main_table_row999 join_table_row1
main_table_row9     join_table_row2
main_table_row99   join_table_row3

Still thanks for the heads up

greg-rychlewski · 2025-03-05T16:20:49Z

That is being distinct on a subset of the row then, is it not? What's the reason for using distinct: true in this case?

dvv · 2025-03-05T16:51:43Z

As I said I'm developing a helper, not trying to solve way particular problem.
The helper tries to be composable (just the way the rest of Ecto's trying to be) and as such it does not know the semantics of input query.
If it sees instruction to go distinct it should output query |> distinct(true).
It can not wrap output format in tuple because there's no custom place in %Ecto.Query{} to store the instruction to pipe through |> elem(0) after the fetch.
Seems it just couldn't be done in generic way.

Schultzer · 2025-03-05T17:05:27Z

As I said I'm developing a helper, not trying to solve way particular problem. The helper tries to be composable (just the way the rest of Ecto's trying to be) and as such it does not know the semantics of input query. If it sees instruction to go distinct it should output query |> distinct(true). It can not wrap output format in tuple because there's no custom place in %Ecto.Query{} to store the instruction to pipe through |> elem(0) after the fetch. Seems it just couldn't be done in generic way.

This might be way off topic, but I wonder if something like https://github.com/elixir-dbvisor/sql would be helpful in your situation, you can compose and use parameterized queries with your Ecto.Repo. Although there is currently no support for Ecto.Schema so you would have to tie that together yourself. But it gives you complete control over the query.

dvv force-pushed the patch-1 branch from 5b7fbb7 to 1858dcf Compare March 5, 2025 03:10

cope with distinct: true

5a137a5

dvv force-pushed the patch-1 branch from 1858dcf to 5a137a5 Compare March 5, 2025 03:41

josevalim closed this Mar 5, 2025

cope with distinct: true #655

cope with distinct: true #655

Uh oh!

Conversation

dvv commented Mar 4, 2025

Uh oh!

greg-rychlewski commented Mar 4, 2025

Uh oh!

dvv commented Mar 4, 2025

Uh oh!

dvv commented Mar 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

josevalim commented Mar 4, 2025

Uh oh!

greg-rychlewski commented Mar 4, 2025

Uh oh!

dvv commented Mar 5, 2025

Uh oh!

greg-rychlewski commented Mar 5, 2025

Uh oh!

dvv commented Mar 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

greg-rychlewski commented Mar 5, 2025

Uh oh!

dvv commented Mar 5, 2025

Uh oh!

josevalim commented Mar 5, 2025

Uh oh!

dvv commented Mar 5, 2025

Uh oh!

josevalim commented Mar 5, 2025

Uh oh!

dvv commented Mar 5, 2025

Uh oh!

josevalim commented Mar 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dvv commented Mar 5, 2025

Uh oh!

dvv commented Mar 5, 2025

Uh oh!

greg-rychlewski commented Mar 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

greg-rychlewski commented Mar 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

josevalim commented Mar 5, 2025

Uh oh!

dvv commented Mar 5, 2025

Uh oh!

greg-rychlewski commented Mar 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dvv commented Mar 5, 2025

Uh oh!

Schultzer commented Mar 5, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

dvv commented Mar 4, 2025 •

edited

Loading

dvv commented Mar 5, 2025 •

edited

Loading

josevalim commented Mar 5, 2025 •

edited

Loading

greg-rychlewski commented Mar 5, 2025 •

edited

Loading

greg-rychlewski commented Mar 5, 2025 •

edited

Loading

greg-rychlewski commented Mar 5, 2025 •

edited

Loading