Skip to content

Conversation

@dvv
Copy link

@dvv dvv commented Mar 4, 2025

An attempt to fix elixir-ecto/ecto#4589
Please consider applying

@greg-rychlewski
Copy link
Member

Thanks @dvv.

I'm not sure this is actually an issue. When you don't provide a select statement it selects the from struct. So this is akin to writing an invalid query, which should produce an error.

I'm not sure it's something we should change because now we are adding fields to the query that cannot be put into the resulting elixir struct.

@dvv
Copy link
Author

dvv commented Mar 4, 2025

@greg-rychlewski I see.
Do we have any other means to control select clause without too much intervention?
Say, allowing for |> select_merge([x], %{__just_a_placeholder__: field(x, ^field)})?
Or allowing for |> distinct([bar], [asc: bar.name], on: false) to disable DISTINCT ON while keeping field in select clause?

@dvv
Copy link
Author

dvv commented Mar 4, 2025

Ecto.Repo.Queryable

        {%{__struct__: name}, %{}} ->
          for {key, _} <- right, key != false and not Map.has_key?(left, key) do
            raise ArgumentError, "struct #{inspect(name)} does not have the key #{inspect(key)}"
          end

          Map.merge(left, right |> Map.drop([false]))

just does the trick

@josevalim
Copy link
Member

@greg-rychlewski my gut feeling is that it makes sense to make it more convenient to the user and automatically include any order by in the select. The issue, however, is that we will be adding more fields to all queries written up to this day and in some cases, but I guess we could fix those by doing a Enum.uniq on the string values?

@greg-rychlewski
Copy link
Member

The main potential issue I see is when getting the results from the db and trying to collect them at the end. We will have more fields than asked for and they will potentially be incompatible. For example a field that doesn't belong to a struct. If we try to de-dup I'm not 100% sure if that will cause complications as well. Because we generally iterate through a list of returned columns and assume they are in a certain order.

@dvv
Copy link
Author

dvv commented Mar 5, 2025

@greg-rychlewski this PR is exactly for that -- just before composing SQL it analyzes order_bys and appends mentioned fields (if they refer to joins) to select fields.

@greg-rychlewski
Copy link
Member

@dvv That is the query that goes to the database. Now the result comes back to Elixir and it has to be packed inside of the Elixir structure you requested.

@dvv
Copy link
Author

dvv commented Mar 5, 2025

The extra fields appended at this stage do not spoil the result because packing instructions were collected before.

@greg-rychlewski
Copy link
Member

I'm not personally sure if that is true in all cases. But if you and Jose have evaluated that thoroughly and know for sure it's not an issue then that is the main thing I was concerned about.

@dvv
Copy link
Author

dvv commented Mar 5, 2025

@greg-rychlewski I'm not sure too. I just have no sufficient expertise.

That is why I also proposed a lighter way: provide user with possibility to select_merge(%{}) ephemeral fields which would be filtered out when packing the result.

@josevalim
Copy link
Member

Another option is to add a transform operation to either select or the query, so we can do:

select: {foo, foo.bar} |> transform(&elem(&1, 0))

However, it still sounds weird to return all of this over the wire only to discard it. I think ultimately the problem here is in the query. Do you really need to pass distinct: true? Isn't there a particular column that you could use? If so, consider passing distinct columns to distinct: or using group_by: ... instead.

@dvv
Copy link
Author

dvv commented Mar 5, 2025

@josevalim I namely do need to filter over association while returning distinct set of primary relation.
Real world example: I need distinct participants over movies filtered by participant type.

Do you see real problem pruning ephemeral keys from select_merge(%{})?

@josevalim
Copy link
Member

@dvv but if you have primary keys, you can group by the primary key instead, or anything that uniquely identity them.

@dvv
Copy link
Author

dvv commented Mar 5, 2025

It forces to craft ad-hoc queries while the point was to fix (postgresql-bound only?) deficiency.
I'm making a library that turns a map of parameters into Ecto query given primary schema.
When it sees %{distinct: true} I'd like it to just mangle the query properly without human intervention.
It turned out I can not control select clause by pure Ecto means flexibly enough.
But this limitation seems artificial to me, hence my proposal.

Otherwise, you wrote "make it more convenient to the user and automatically include any order by in the select" and it sounds great. We can skip appending order-bys that come from primary schema, I believe. I drafted the patch and it passed tests AFAICS, but I can not state for sure I did it right.

@josevalim
Copy link
Member

josevalim commented Mar 5, 2025

But this limitation seems artificial to me, hence my proposal.

There is nothing artificial given it is coming from the database. Perhaps if you can argue to PostgreSQL folks that it is artificial, then they will remove the restriction from their side, and it will automatically work on Ecto too. But given the database is the one that is imposing it, we need to balance the trade-offs from the database point of view, and the current solution may cause existing apps to load unnecessary data from the database. :(

Another option, if you want this to work transparently, is to use subqueries. If you use distinct: true, you can probably do something like:

from p in subquery(the_actual_query_with_order_by), distinct: true

@dvv
Copy link
Author

dvv commented Mar 5, 2025

By artificial limitation I meant only that mentioned here: elixir-ecto/ecto#4589 (comment)
If we'd treat a key as ephemeral it wouldn't break anything: neither currently running codebase, nor useless overfetching from the DB.

@dvv
Copy link
Author

dvv commented Mar 5, 2025

Or, maybe a special kinda helper would fix:

  • |> just_select_dont_merge([x], [field(x, ^field)])
  • |> select_merge([x], %{bebebe: field(x, ^field)}, just_select: [:bebebe])

@greg-rychlewski
Copy link
Member

greg-rychlewski commented Mar 5, 2025

I'm pretty sure this can be done with lateral join like this

sql~~ ~~SELECT f0.id ~~ ~~FROM foo AS f0 INNER JOIN LATERAL (SELECT * FROM bar WHERE bar.foo_id = foo.id ORDER BY name LIMIT 1) AS b1 ~~ ~~ON true~~ ~~

Actually I am wrong sorry

@greg-rychlewski
Copy link
Member

greg-rychlewski commented Mar 5, 2025

@dvv Is this a hypothetical query or is it something you ran into in real life?

The reason I'm asking is because I'm struggling to identify a real use case for combining these:

  1. distinct: true
  2. ordering by a join table
  3. selecting main table

Since you are selecting the main table and also de-duplicating by the entire row of the main table, I'm not sure what is gained by ordering by the join table. No matter which row you pick from the join table you will always return the same results.

You will have something like this

main_table_row1 join_table_row10
main_table_row1 join_table_row13
main_table_row1 join_table_row99
main_table_row1 join_table_row1001

then you pick one of those and remove the join_table_rowX part. you always get main_table_row1

the only time you get a difference is if you do distinct on a subset of the main table or you include columns from the join table in the result

@josevalim
Copy link
Member

If we'd treat a key as ephemeral it wouldn't break anything: neither currently running codebase, nor useless overfetching from the DB.

Unfortunately the overfetching would still be there, you are just not doing it twice (which is indeed worse). That's why rewriting the query to a subquery or similar should be the best choice. Or, since you have a helper function that builds a query, you could also have a function that runs the query and maps the results, this way you could strip the additional columns added by distinct. I don't understand why the functionality has to be provided by Ecto, you should be able to build abstractions on top.

In any case, it seems we are not going down this path, so we can close this PR. Thank you! I recommend exploring subqueries and rewriting the query for now.

@josevalim josevalim closed this Mar 5, 2025
@dvv
Copy link
Author

dvv commented Mar 5, 2025

@greg-rychlewski master...dvv:ecto_sql:patch-1#diff-df64a1e98eb69a93a67e388be3fb937b476bc769705a2a682d3e8e1ecdc8a38cR603
There's

SELECT DISTINCT s0."x", s1."id" FROM "schema" AS s0 INNER JOIN "schema3" AS s1 ON s1."id" = s0."y" ORDER BY s1."id"

Indeed the order of s0.x depends on s1.id. And if I'd apply limit I'd get different sets.

So in your example's terms my situation is rather:

main_table_row999 join_table_row1
main_table_row9     join_table_row2
main_table_row99   join_table_row3

Still thanks for the heads up

@greg-rychlewski
Copy link
Member

greg-rychlewski commented Mar 5, 2025

That is being distinct on a subset of the row then, is it not? What's the reason for using distinct: true in this case?

@dvv
Copy link
Author

dvv commented Mar 5, 2025

As I said I'm developing a helper, not trying to solve way particular problem.
The helper tries to be composable (just the way the rest of Ecto's trying to be) and as such it does not know the semantics of input query.
If it sees instruction to go distinct it should output query |> distinct(true).
It can not wrap output format in tuple because there's no custom place in %Ecto.Query{} to store the instruction to pipe through |> elem(0) after the fetch.
Seems it just couldn't be done in generic way.

@Schultzer
Copy link
Contributor

As I said I'm developing a helper, not trying to solve way particular problem. The helper tries to be composable (just the way the rest of Ecto's trying to be) and as such it does not know the semantics of input query. If it sees instruction to go distinct it should output query |> distinct(true). It can not wrap output format in tuple because there's no custom place in %Ecto.Query{} to store the instruction to pipe through |> elem(0) after the fetch. Seems it just couldn't be done in generic way.

This might be way off topic, but I wonder if something like https://github.com/elixir-dbvisor/sql would be helpful in your situation, you can compose and use parameterized queries with your Ecto.Repo. Although there is currently no support for Ecto.Schema so you would have to tie that together yourself. But it gives you complete control over the query.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

for SELECT DISTINCT, ORDER BY expressions must appear in select list

4 participants