Yummy Quickstart

Install yummy:

pip install yummy

pip install git+https://github.com/yummyml/yummy.git

Create a feature repository:

feast init feature_repo
cd feature_repo

Offline store:

Polars

To configure the offline store edit feature_store.yaml

project: repo
registry: s3://data/registry.db
provider: yummy.YummyProvider
backend: polars
online_store:
    ...
offline_store:
    type: yummy.YummyOfflineStore
    backend: polars

Dask

To configure the offline store edit feature_store.yaml

project: repo
registry: s3://data/registry.db
provider: yummy.YummyProvider
backend: dask
online_store:
    ...
offline_store:
    type: yummy.YummyOfflineStore

Ray

To configure the offline store edit feature_store.yaml

project: repo
registry: s3://data/registry.db
provider: yummy.YummyProvider
backend: ray
online_store:
    ...
offline_store:
    type: yummy.YummyOfflineStore

Spark

To configure the offline store edit feature_store.yaml

project: repo
registry: s3://data/registry.db
provider: yummy.YummyProvider
backend: spark
backend_config:
    spark.master: "local[*]"
    spark.ui.enabled: "false"
    spark.eventLog.enabled: "false"
    spark.sql.session.timeZone: "UTC"
online_store:
    ...
offline_store:
    type: yummy.YummyOfflineStore

Features definition

Example features.py:

from datetime import timedelta
from feast import Entity, Field, FeatureView
from yummy import ParquetSource, CsvSource, DeltaSource
from feast.types import Float32, Int32

my_stats_parquet = ParquetSource(
    path="/home/jovyan/notebooks/ray/dataset/all_data.parquet",
    timestamp_field="datetime",
)

my_stats_delta = DeltaSource(
    path="dataset/all",
    timestamp_field="datetime",
    #range_join=10,
)

my_stats_csv = CsvSource(
    path="/home/jovyan/notebooks/ray/dataset/all_data.csv",
    timestamp_field="datetime",
)

my_entity = Entity(name="entity_id", description="entity id",)

mystats_view_parquet = FeatureView(
    name="my_statistics_parquet",
    entities=[my_entity],
    ttl=timedelta(seconds=3600*24*20),
    schema=[
        Field(name="entity_id", dtype=Int32),
        Field(name="p0", dtype=Float32),
        Field(name="p1", dtype=Float32),
        Field(name="p2", dtype=Float32),
        Field(name="p3", dtype=Float32),
        Field(name="p4", dtype=Float32),
        Field(name="p5", dtype=Float32),
        Field(name="p6", dtype=Float32),
        Field(name="p7", dtype=Float32),
        Field(name="p8", dtype=Float32),
        Field(name="p9", dtype=Float32),
        Field(name="y", dtype=Float32),
    ], online=True, source=my_stats_parquet, tags={},)

mystats_view_delta = FeatureView(
    name="my_statistics_delta",
    entities=[my_entity],
    ttl=timedelta(seconds=3600*24*20),
    schema=[
        Field(name="entity_id", dtype=Int32),
        Field(name="d0", dtype=Float32),
        Field(name="d1", dtype=Float32),
        Field(name="d2", dtype=Float32),
        Field(name="d3", dtype=Float32),
        Field(name="d4", dtype=Float32),
        Field(name="d5", dtype=Float32),
        Field(name="d6", dtype=Float32),
        Field(name="d7", dtype=Float32),
        Field(name="d8", dtype=Float32),
        Field(name="d9", dtype=Float32),
    ], online=True, source=my_stats_delta, tags={},)

    
mystats_view_csv = FeatureView(
    name="my_statistics_csv",
    entities=[my_entity],
    ttl=timedelta(seconds=3600*24*20),
    schema=[
        Field(name="entity_id", dtype=Int32),
        Field(name="c1", dtype=Float32),
        Field(name="c2", dtype=Float32),
    ], online=True, source=my_stats_csv, tags={},)

Historical fetch

from feast import FeatureStore
import pandas as pd
import time
from datetime import datetime
from yummy import select_all

store = FeatureStore(repo_path=".")

start_time = time.time()
training_df = store.get_historical_features(
    entity_df=select_all(datetime(2022, 9, 14, 23, 59, 42)), 
    features = [
        'my_statistics_parquet:p1',
        'my_statistics_parquet:p2',
        'my_statistics_delta:d1',
        'my_statistics_delta:d2',
        'my_statistics_csv:c1',
        'my_statistics_csv:c2'
    ],
).to_df()


print("--- %s seconds ---" % (time.time() - start_time))

training_df

To fetch historical features you need to specify: entity_df and features.

entity_df

entity_df is the list of entities with event_timestamps you want to fetch. You can manually build the data frame with the list of entites:

import pandas as pd
from datetime import datetime

entity_df = pd.DataFrame.from_dict(
    {
        "entity_id": [1, 2, 3],
        "event_timestamp": [
            datetime.now() - timedelta(minutes=1),
            datetime.now() - timedelta(minutes=2),
            datetime.now() - timedelta(minutes=3),
        ],
    }
)

Or in yummy you can use select_all:

from yummy import select_all
from datetime import datetime

entity_df = select_all(datetime(2022, 9, 14, 23, 59, 42))

with select_all all entities will be selected from the first feature view.