diff --git a/docs/cookbook.md b/docs/cookbook.md index 5f2f680..b55017e 100644 --- a/docs/cookbook.md +++ b/docs/cookbook.md @@ -40,12 +40,13 @@ For this question we will use the data in `./data/city.csv`. Query all columns for a city with the ID 1661. ### Solution + ```haskell dataframe> df |> D.filterWhere (id .== 1661) ----------------------------------------------------- id | name | country_code | district | population ------|--------|--------------|----------|----------- - Int | Text | Text | Text | Int + Int | Text | Text | Text | Int ------|--------|--------------|----------|----------- 1661 | Sayama | JPN | Saitama | 162472 ``` @@ -57,28 +58,31 @@ For this question we will use the data in `./data/city.csv`. Query all columns of every Japanese city. The `country_code` for Japan is "JPN". ### Solution + ```haskell dataframe> df |> D.filterWhere (country_code .== "JPN") -------------------------------------------------------- id | name | country_code | district | population ------|----------|--------------|-----------|----------- - Int | Text | Text | Text | Int + Int | Text | Text | Text | Int ------|----------|--------------|-----------|----------- - 1613 | Neyagawa | JPN | Osaka | 257315 - 1630 | Ageo | JPN | Saitama | 209442 - 1661 | Sayama | JPN | Saitama | 162472 - 1681 | Omuta | JPN | Fukuoka | 142889 + 1613 | Neyagawa | JPN | Osaka | 257315 + 1630 | Ageo | JPN | Saitama | 209442 + 1661 | Sayama | JPN | Saitama | 162472 + 1681 | Omuta | JPN | Fukuoka | 142889 1739 | Tokuyama | JPN | Yamaguchi | 107078 ``` ### Combining Conditions Often you'll need to filter on multiple conditions simultaneously. You can combine boolean expressions using logical operators: + - `.&&` for AND (both conditions must be true) - `.||` for OR (either condition can be true) - `.>`, `.>=`, `.<`, `.<=` for comparisons For example, to find cities with large populations in a specific country: + ```haskell df |> D.filterWhere ((population .> 100000) .&& (country_code .== "USA")) ``` @@ -88,10 +92,12 @@ df |> D.filterWhere ((population .> 100000) .&& (country_code .== "USA")) For this question we will use the data in `./data/city.csv`. Query all columns for all American cities in city dataframe with: -* populations larger than 100000, and -* the CountryCode for America is "USA". + +- populations larger than 100000, and +- the CountryCode for America is "USA". ### Solution + ```haskell dataframe> D.readCsv "./data/country.csv" dataframe> :declareColumns df @@ -99,12 +105,12 @@ dataframe> df |> D.filterWhere ((population .> 100000) .&& (country_code .== "US -------------------------------------------------------------- id | name | country_code | district | population ------|---------------|--------------|------------|----------- - Int | Text | Text | Text | Int + Int | Text | Text | Text | Int ------|---------------|--------------|------------|----------- - 3878 | Scottsdale | USA | Arizona | 202705 - 3965 | Corona | USA | California | 124966 - 3973 | Concord | USA | California | 121780 - 3977 | Cedar Rapids | USA | Iowa | 120758 + 3878 | Scottsdale | USA | Arizona | 202705 + 3965 | Corona | USA | California | 124966 + 3973 | Concord | USA | California | 121780 + 3977 | Cedar Rapids | USA | Iowa | 120758 3982 | Coral Springs | USA | Florida | 117549 ``` @@ -125,17 +131,18 @@ For this question we will use the data in `./data/city.csv`. Show the first 5 rows of the dataframe. ### Solution + ```haskell dataframe> df |> D.take 5 ---------------------------------------------------------------- id | name | country_code | district | population -----|------------------|--------------|-------------------|----------- - Int | Text | Text | Text | Int + Int | Text | Text | Text | Int -----|------------------|--------------|-------------------|----------- - 6 | Rotterdam | NLD | Zuid-Holland | 593321 - 19 | Zaanstad | NLD | Noord-Holland | 135621 - 214 | Porto Alegre | BRA | Rio Grande do Sul | 1314032 - 397 | Lauro de Freitas | BRA | Bahia | 109236 + 6 | Rotterdam | NLD | Zuid-Holland | 593321 + 19 | Zaanstad | NLD | Noord-Holland | 135621 + 214 | Porto Alegre | BRA | Rio Grande do Sul | 1314032 + 397 | Lauro de Freitas | BRA | Bahia | 109236 547 | Dobric | BGR | Varna | 100399 ``` @@ -156,16 +163,17 @@ For this question we will use the data in `./data/city.csv`. Get the first 5 names of the city names. ### Solution + ```haskell dataframe> df |> D.select [F.name name] |> D.take 5 ----------------- - name + name ----------------- - Text + Text ----------------- - Rotterdam - Zaanstad - Porto Alegre + Rotterdam + Zaanstad + Porto Alegre Lauro de Freitas Dobric ``` @@ -187,17 +195,18 @@ For this question we will use the data in `./data/city.csv`. Query the names of all the Japanese cities and show only the first 5 results. ### Solution + ```haskell dataframe> df |> D.filterWhere (country_code .== "JPN") |> D.select [F.name name] |> D.take 5 --------- - name + name --------- - Text + Text --------- Neyagawa - Ageo - Sayama - Omuta + Ageo + Sayama + Omuta Tokuyama ``` @@ -208,18 +217,19 @@ For this question we will use the data in `./data/station.csv`. Show the first five city and state rows. ### Solution + ```haskell dataframe> df |> D.select [F.name city, F.name state] |> D.take 5 --------------------- city | state --------------|------ - Text | Text + Text | Text --------------|------ - Kissee Mills | MO - Loma Mar | CA - Sandy Hook | CT - Tipton | IN - Arlington | CO + Kissee Mills | MO + Loma Mar | CA + Sandy Hook | CT + Tipton | IN + Arlington | CO ``` ## Removing Duplicates @@ -239,32 +249,33 @@ For this question we will use the data in `./data/station.csv`. Query a list of city names for cities that have an even ID number. Show the results in any order, but exclude duplicates from the answer. ### Solution + ```haskell -dataframe> df |> D.filterWhere (F.lift even id) |> D.select [F.name city] |> D.distinct +dataframe> df |> D.filterWhere (F.lift even id) |> D.select [F.name city] |> D.distinct ---------------------- - city + city ---------------------- - Text + Text ---------------------- - Rockton - Forest Lakes - Yellow Pine - Mosca - Rocheport - Millville + Rockton + Forest Lakes + Yellow Pine + Mosca + Rocheport + Millville ... - Lee - Elm Grove - Orange City - Baker + Lee + Elm Grove + Orange City + Baker Clutier ``` ## Sorting and Combining Results Sometimes you need to sort data and then combine results from multiple queries. The `sortBy` function orders rows by specified columns. Much like SQL, you can specify multiple columns to -order by. The results are ordered by the first column, with ties broken by the next column -respectively. +order by. The results are ordered by the first column, with ties broken by the next column +respectively. You can also can use the `<>` operator to concatenate dataframes vertically (similar to SQL's UNION). @@ -273,7 +284,7 @@ You can also can use the `<>` operator to concatenate dataframes vertically (sim df |> D.sortBy [D.Asc age] -- 1. Sort by descending age -- 2. Within those who have the same age, sort by alphabetical order of name. -df |> D.sortBy [D.Asc age, D.Desc name] +df |> D.sortBy [D.Asc age, D.Desc name] ``` You can also derive new columns using `derive` to compute values based on existing columns: @@ -342,14 +353,77 @@ dataframe> df |> D.select [F.name city] |> D.filterWhere (F.lift (\c -> any (`T. Odin ``` +## Reading Parquet with Options + +Parquet reads can be configured so you only load the columns and rows you need. This is useful when files are wide or when you want to filter data at read-time. + +For this section we will use `./data/mtcars.parquet`. + +```haskell +dataframe> df0 <- D.readParquet "./data/mtcars.parquet" +dataframe> :declareColumns df0 +``` + +`ParquetReadOptions` currently supports: + +- `selectedColumns` +- `predicate` +- `rowRange` + +Options are applied in this order: predicate filtering, column projection, then row range. + +**Exercise 11: Parquet projection** + +Read only the `mpg`, `cyl`, and `wt` columns. + +### Solution + +```haskell +dataframe> D.readParquetWithOpts +dataframe| (D.defaultParquetReadOptions{D.selectedColumns = Just ["mpg", "cyl", "wt"]}) +dataframe| "./data/mtcars.parquet" +``` + +**Exercise 12: Row range** + +Read rows `5` to `10` (start inclusive, end exclusive). + +### Solution + +```haskell +dataframe> D.readParquetWithOpts +dataframe| (D.defaultParquetReadOptions{D.rowRange = Just (5, 10)}) +dataframe| "./data/mtcars.parquet" +``` + +**Exercise 13: Predicate and projection** + +Read rows where `cyl >= 6`, but return only the `mpg` column. + +### Solution + +```haskell +dataframe> D.readParquetWithOpts +dataframe| ( D.defaultParquetReadOptions +dataframe| { D.selectedColumns = Just ["mpg"] +dataframe| , D.predicate = Just (cyl .>= 6) +dataframe| } +dataframe| ) +dataframe| "./data/mtcars.parquet" +``` + +When `selectedColumns` is set, columns referenced by `predicate` are automatically read as needed, then projected back to the requested output columns. + ## Summary You've now learned the fundamental operations for working with dataframes in Haskell: + - **Filtering** with `filterWhere` to select rows based on conditions - **Selecting** with `select` to choose specific columns - **Limiting** with `take` to control output size - **Removing duplicates** with `distinct` - **Sorting** with `sortBy` and combining results with `<>` - **Applying custom functions** with `F.lift` for sophisticated data manipulation +- **Reading Parquet with options** using `readParquetWithOpts` for projection, predicate filtering, and row ranges -These building blocks can be composed together to answer complex data analysis questions in a clear, functional style. \ No newline at end of file +These building blocks can be composed together to answer complex data analysis questions in a clear, functional style. diff --git a/src/DataFrame.hs b/src/DataFrame.hs index 088fc7a..382f507 100644 --- a/src/DataFrame.hs +++ b/src/DataFrame.hs @@ -106,6 +106,8 @@ __I/O__ * @D.writeCsv :: FilePath -> DataFrame -> IO ()@ * @D.readParquet :: FilePath -> IO DataFrame@ * @D.readParquetWithOpts :: ParquetReadOptions -> FilePath -> IO DataFrame@ + * @D.readParquetFiles :: FilePath -> IO DataFrame@ + * @D.readParquetFilesWithOpts :: ParquetReadOptions -> FilePath -> IO DataFrame@ __Exploration__ diff --git a/src/DataFrame/IO/Parquet.hs b/src/DataFrame/IO/Parquet.hs index e8353e5..10b1060 100644 --- a/src/DataFrame/IO/Parquet.hs +++ b/src/DataFrame/IO/Parquet.hs @@ -40,13 +40,40 @@ import System.Directory (doesDirectoryExist) import qualified Data.Vector.Unboxed as VU import System.FilePath (()) +{- | Options for reading Parquet data. + +These options are applied in this order: + +1. predicate filtering +2. column projection +3. row range + +Column selection for @selectedColumns@ uses leaf column names only. +-} data ParquetReadOptions = ParquetReadOptions { selectedColumns :: Maybe [T.Text] + {- ^ Columns to keep in the final dataframe. If set, only these columns are returned. + Predicate-referenced columns are read automatically when needed and projected out after filtering. + -} , predicate :: Maybe (Expr Bool) + -- ^ Optional row filter expression applied before projection. , rowRange :: Maybe (Int, Int) + -- ^ Optional row slice @(start, end)@ with start-inclusive/end-exclusive semantics. } deriving (Eq, Show) +{- | Default Parquet read options. + +Equivalent to: + +@ +ParquetReadOptions + { selectedColumns = Nothing + , predicate = Nothing + , rowRange = Nothing + } +@ +-} defaultParquetReadOptions :: ParquetReadOptions defaultParquetReadOptions = ParquetReadOptions @@ -65,6 +92,18 @@ ghci> D.readParquet ".\/data\/mtcars.parquet" readParquet :: FilePath -> IO DataFrame readParquet = readParquetWithOpts defaultParquetReadOptions +{- | Read a Parquet file using explicit read options. + +==== __Example__ +@ +ghci> D.readParquetWithOpts +ghci| (D.defaultParquetReadOptions{D.selectedColumns = Just ["id"], D.rowRange = Just (0, 10)}) +ghci| "./tests/data/alltypes_plain.parquet" +@ + +When @selectedColumns@ is set and @predicate@ references other columns, those predicate columns +are auto-included for decoding, then projected back to the requested output columns. +-} readParquetWithOpts :: ParquetReadOptions -> FilePath -> IO DataFrame readParquetWithOpts opts path = do fileMetadata <- readMetadataFromPath path @@ -173,9 +212,27 @@ readParquetWithOpts opts path = do pure $ applyReadOptions opts (DI.fromNamedColumns orderedColumns) +{- | Read Parquet files from a directory or glob path. + +This is equivalent to calling 'readParquetFilesWithOpts' with 'defaultParquetReadOptions'. +-} readParquetFiles :: FilePath -> IO DataFrame readParquetFiles = readParquetFilesWithOpts defaultParquetReadOptions +{- | Read multiple Parquet files (directory or glob) using explicit options. + +If @path@ is a directory, all non-directory entries are read. +If @path@ is a glob, matching files are read. + +For multi-file reads, @rowRange@ is applied once after concatenation (global range semantics). + +==== __Example__ +@ +ghci> D.readParquetFilesWithOpts +ghci| (D.defaultParquetReadOptions{D.selectedColumns = Just ["id"], D.rowRange = Just (0, 5)}) +ghci| "./tests/data/alltypes_plain*.parquet" +@ +-} readParquetFilesWithOpts :: ParquetReadOptions -> FilePath -> IO DataFrame readParquetFilesWithOpts opts path = do isDir <- doesDirectoryExist path diff --git a/tests/Operations/Shuffle.hs b/tests/Operations/Shuffle.hs index 9b609c3..4ccf3e0 100644 --- a/tests/Operations/Shuffle.hs +++ b/tests/Operations/Shuffle.hs @@ -81,7 +81,7 @@ shuffleDoesNotAddOrDropIndices :: Test shuffleDoesNotAddOrDropIndices = let gen = mkStdGen 42 - actual = (Set.fromList [0 .. 10]) + actual = Set.fromList [0 .. 10] computedVector = shuffledIndices gen 11 computed = (Set.fromList $ VU.toList $ shuffledIndices gen 11) in