The queries in DataFusion scan data from tables and return 0 or more rows. Please be aware that column names in queries are made lower-case, but not on the inferred schema. Accordingly, if you want to query against a capitalized field, make sure to use double quotes. Please see this example for clarification. In this documentation we describe the SQL syntax in DataFusion.
DataFusion supports the following syntax for queries:
[ WITH with_query [, ...] ]
SELECT [ ALL | DISTINCT ] select_expr [, ...]
[ FROM from_item [, ...] ]
[ JOIN join_item [, ...] ]
[ WHERE condition ]
[ GROUP BY grouping_element [, ...] ]
[ HAVING condition]
[ QUALIFY condition]
[ UNION [ ALL | select ]
[ ORDER BY expression [ ASC | DESC ][, ...] ]
[ LIMIT count ]
[ EXCLUDE | EXCEPT ]
Pipe operators
A with clause allows to give names for queries and reference them by name.
WITH x AS (SELECT a, MAX(b) AS b FROM t GROUP BY a)
SELECT a, b FROM x;Example:
SELECT a, b, a + b FROM tableThe DISTINCT quantifier can be added to make the query return all distinct rows.
By default ALL will be used, which returns all the rows.
SELECT DISTINCT person, age FROM employeesExample:
SELECT t.a FROM table AS tExample:
SELECT a FROM table WHERE a > 10DataFusion supports INNER JOIN, LEFT OUTER JOIN, RIGHT OUTER JOIN,
FULL OUTER JOIN, NATURAL JOIN, CROSS JOIN, LEFT SEMI JOIN,
RIGHT SEMI JOIN, LEFT ANTI JOIN, and RIGHT ANTI JOIN.
Unless you add an ORDER BY clause, joins do not guarantee the order of
the returned rows. DataFusion executes queries in parallel, so the same
join query may produce the same rows in a different order across runs.
The following examples are based on this table:
select * from x;
+----------+----------+
| column_1 | column_2 |
+----------+----------+
| 1 | 2 |
+----------+----------+The keywords JOIN or INNER JOIN define a join that only shows rows where there is a match in both tables.
SELECT * FROM x INNER JOIN x y ON x.column_1 = y.column_1;
+----------+----------+----------+----------+
| column_1 | column_2 | column_1 | column_2 |
+----------+----------+----------+----------+
| 1 | 2 | 1 | 2 |
+----------+----------+----------+----------+The keywords LEFT JOIN or LEFT OUTER JOIN define a join that includes all rows from the left table even if there
is not a match in the right table. When there is no match, null values are produced for the right side of the join.
SELECT * FROM x LEFT JOIN x y ON x.column_1 = y.column_2;
+----------+----------+----------+----------+
| column_1 | column_2 | column_1 | column_2 |
+----------+----------+----------+----------+
| 1 | 2 | | |
+----------+----------+----------+----------+The keywords RIGHT JOIN or RIGHT OUTER JOIN define a join that includes all rows from the right table even if there
is not a match in the left table. When there is no match, null values are produced for the left side of the join.
SELECT * FROM x RIGHT JOIN x y ON x.column_1 = y.column_2;
+----------+----------+----------+----------+
| column_1 | column_2 | column_1 | column_2 |
+----------+----------+----------+----------+
| | | 1 | 2 |
+----------+----------+----------+----------+The keywords FULL JOIN or FULL OUTER JOIN define a join that is effectively a union of a LEFT OUTER JOIN and
RIGHT OUTER JOIN. It will show all rows from the left and right side of the join and will produce null values on
either side of the join where there is not a match.
SELECT * FROM x FULL OUTER JOIN x y ON x.column_1 = y.column_2;
+----------+----------+----------+----------+
| column_1 | column_2 | column_1 | column_2 |
+----------+----------+----------+----------+
| 1 | 2 | | |
| | | 1 | 2 |
+----------+----------+----------+----------+A NATURAL JOIN defines an inner join based on common column names found between the input tables. When no common
column names are found, it behaves like a CROSS JOIN.
SELECT * FROM x NATURAL JOIN x y;
+----------+----------+
| column_1 | column_2 |
+----------+----------+
| 1 | 2 |
+----------+----------+A CROSS JOIN produces a cartesian product that matches every row in the left side of the join with every row in the
right side of the join.
SELECT * FROM x CROSS JOIN x y;
+----------+----------+----------+----------+
| column_1 | column_2 | column_1 | column_2 |
+----------+----------+----------+----------+
| 1 | 2 | 1 | 2 |
+----------+----------+----------+----------+The LEFT SEMI JOIN returns all rows from the left table that have at least one matching row in the right table, and
projects only the columns from the left table.
SELECT * FROM x LEFT SEMI JOIN x y ON x.column_1 = y.column_1;
+----------+----------+
| column_1 | column_2 |
+----------+----------+
| 1 | 2 |
+----------+----------+The RIGHT SEMI JOIN returns all rows from the right table that have at least one matching row in the left table, and
only projects the columns from the right table.
SELECT * FROM x RIGHT SEMI JOIN x y ON x.column_1 = y.column_1;
+----------+----------+
| column_1 | column_2 |
+----------+----------+
| 1 | 2 |
+----------+----------+The LEFT ANTI JOIN returns all rows from the left table that do not have any matching row in the right table, projecting
only the left table’s columns.
SELECT * FROM x LEFT ANTI JOIN x y ON x.column_1 = y.column_1;
+----------+----------+
| column_1 | column_2 |
+----------+----------+
+----------+----------+The RIGHT ANTI JOIN returns all rows from the right table that do not have any matching row in the left table, projecting
only the right table’s columns.
SELECT * FROM x RIGHT ANTI JOIN x y ON x.column_1 = y.column_1;
+----------+----------+
| column_1 | column_2 |
+----------+----------+
+----------+----------+Example:
SELECT a, b, MAX(c) FROM table GROUP BY a, bGROUP BY determines how rows are grouped for aggregation, but it does not
determine the order of the output rows. If you need a stable row order, add
an ORDER BY clause to the outer query.
Some aggregation functions accept optional ordering requirement, such as ARRAY_AGG. If a requirement is given,
aggregation is calculated in the order of the requirement.
Example:
SELECT a, b, ARRAY_AGG(c, ORDER BY d) FROM table GROUP BY a, bExample:
SELECT a, b, MAX(c) FROM table GROUP BY a, b HAVING MAX(c) > 10Example:
SELECT ROW_NUMBER() OVER (PARTITION BY region) AS rk FROM table QUALIFY rk > 1;Example:
SELECT
a,
b,
c
FROM table1
UNION ALL
SELECT
a,
b,
c
FROM table2Orders the results by the referenced expression. By default it uses ascending order (ASC).
This order can be changed to descending by adding DESC after the order-by expressions.
Without ORDER BY, DataFusion does not guarantee the order of result rows.
This is especially important for queries involving joins, GROUP BY,
UNION, or parallel file scans, where rows may be returned in a different
order between runs even when the data itself has not changed.
Examples:
SELECT age, person FROM table ORDER BY age;
SELECT age, person FROM table ORDER BY age DESC;
SELECT age, person FROM table ORDER BY age, person DESC;Limits the number of rows to be a maximum of count rows. count should be a non-negative integer.
Example:
SELECT age, person FROM table
LIMIT 10Excluded named columns from query results.
Example selecting all columns except for age and person:
SELECT * EXCEPT(age, person)
FROM table;SELECT * EXCLUDE(age, person)
FROM table;Some SQL dialects (e.g. BigQuery) support the pipe operator |>.
The SQL dialect can be set like this:
set datafusion.sql_parser.dialect = 'BigQuery';DataFusion currently supports the following pipe operators:
(pipe_where)=
select * from range(0,10)
|> where value < 2;
+-------+
| value |
+-------+
| 0 |
| 1 |
+-------+(pipe_order_by)=
select * from range(0,3)
|> order by value desc;
+-------+
| value |
+-------+
| 2 |
| 1 |
| 0 |
+-------+(pipe_limit)=
select * from range(0,3)
|> order by value desc
|> limit 1;
+-------+
| value |
+-------+
| 2 |
+-------+(pipe_select)=
select * from range(0,3)
|> select value + 10;
+---------------------------+
| range().value + Int64(10) |
+---------------------------+
| 10 |
| 11 |
| 12 |
+---------------------------+(pipe_extend)=
select * from range(0,3)
|> extend -value AS minus_value;
+-------+-------------+
| value | minus_value |
+-------+-------------+
| 0 | 0 |
| 1 | -1 |
| 2 | -2 |
+-------+-------------+(pipe_as)=
select * from range(0,3)
|> as my_range
|> SELECT my_range.value;
+-------+
| value |
+-------+
| 0 |
| 1 |
| 2 |
+-------+(pipe_union)=
select * from range(0,3)
|> union all (
select * from range(3,6)
);
+-------+
| value |
+-------+
| 0 |
| 1 |
| 2 |
| 3 |
| 4 |
| 5 |
+-------+(pipe_intersect)=
select * from range(0,100)
|> INTERSECT DISTINCT (
select 3
);
+-------+
| value |
+-------+
| 3 |
+-------+(pipe_except)=
select * from range(0,10)
|> EXCEPT DISTINCT (select * from range(5,10));
+-------+
| value |
+-------+
| 0 |
| 1 |
| 2 |
| 3 |
| 4 |
+-------+(pipe_aggregate)=
select * from range(0,3)
|> aggregate sum(value) AS total;
+-------+
| total |
+-------+
| 3 |
+-------+(pipe_join)=
(
SELECT 'apples' AS item, 2 AS sales
UNION ALL
SELECT 'bananas' AS item, 5 AS sales
)
|> AS produce_sales
|> LEFT JOIN
(
SELECT 'apples' AS item, 123 AS id
) AS produce_data
ON produce_sales.item = produce_data.item
|> SELECT produce_sales.item, sales, id;
+--------+-------+------+
| item | sales | id |
+--------+-------+------+
| apples | 2 | 123 |
| bananas| 5 | NULL |
+--------+-------+------+