Delta
With DeltaSource
you can consume Delta Lake to fetch the features.
Example delta
datasource configuration:
from yummy import DeltaSource
my_stats_csv = DeltaSource(
path="/home/jovyan/notebooks/dataset/delta/",
timestamp_field="datetime",
)
additionally you can setup: s3_endpoint_override: str
You can also read from:
local filesystem
s3
store (you can uses3_endpoint_override
to use custom s3 like minio
DeltaSource
we be handled differently for yummy
backends. For polars
, dask
, ray
the rust delta-rs implementation will be used.
For spark
the spark delta reader will be used. To use it you will need to include addtional spark packages (eg. io.delta:delta-core_2.12:1.1.0
)
For example to run pyspark with jupyter driver:
#!/bin/bash
export PYSPARK_DRIVER_PYTHON=jupyter
export PYSPARK_DRIVER_PYTHON_OPTS="lab --notebook-dir=/home/jovyan --ip='0.0.0.0' --port=8888 --no-browser --allow-root --NotebookApp.password='' --NotebookApp.token=''"
pyspark \
--packages io.delta:delta-core_2.12:1.1.0 \
--conf "spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension" \
--conf "spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog" \
--conf "spark.driver.memory=5g" \
--conf "spark.executor.memory=5g"