Use cases

Testing and prototyping.

When implementing logic using a cloud provider backend. It is often useful to have mock data for testing.
Using InMemoryStore for mock data is a good way to test the logic without having to use a cloud provider backend or running infrastructure locally.

Example:

from spoonbill.datastores import DynamoDBStore, InMemoryStore
import os

environment = os.getenv("environment", "test")

if environment == "test":
    store = InMemoryStore.open("mock data")
elif environment == "dev":
    store = DynamoDBStore.open("dev table")
else:
    store = DynamoDBStore.open("prod table")

Data science with cloud infrastructure.

Simple Example

import numpy as np
import pandas as pd
from spoonbill.datastores import DynamoDBStore

df = pd.DataFrame({'user': [1, 2, 3]})
feature_store = DynamoDBStore.open("features table")  # {1: {"age":20:, "sex":female",...}}


def get_user_details(x):
    default = {"age": 25, "sex": "female"}
    return pd.Series(feature_store.get(x['user'], default).values())


df[['age', 'sex']] = df.apply(get_user_details, axis=1)
"""
   user  age     sex
0     1   20    male
1     2   30  female
2     3   25  female
"""

Online machine learning


from spoonbill.datastores import RedisStore, FilesystemStore
import time


class Model:

    def __init__(self):
        self.version = 123
        self.versions = RedisStore.open("redis://model_versions")
        self.models = FilesystemStore.open("s3://models/")  # can also use a faster store if needed
        self._model = self.models.get(self.version)

    @property
    def model(self):
        if self.version != self.versions.get("version"):
            self.version = self.versions.get("version")
            self._model = self.models.get(self.version)
        return self._model

    def partial_fit(self, x, y):
        """
        Fit a single example
        * Best to run as a single process from a queue 
        """
        new_model = self._model.partial_fit(x, y)
        new_model_id = int(time.time())
        self.models[new_model_id] = new_model
        self.versions["version"] = new_model_id

    # can scale horizontally by adding more workers 
    def predict(self, x):
        return self.model.predict(x)

Advance machine learning example

Let’s say we have a model that predicts the probability of a user watching a video based on their history.

  • We train a user-video embeddings every day.

  • We update the model every hour.

  • The last 3 movies are updated in real-time.

import numpy as np
from datetime import datetime as dt
from spoonbill.datastores import RedisStore, SafetensorsStore, FilesystemStore

# Whenever a user watches a video, we update the last 3 movies list.
recently_watched_store = RedisStore.open("redis://last_3_movies")  # {1: [1, 2, 3]}
video_embedding = SafetensorsStore.open("video_embedding.db")  # {1: [0.1, 0.2, 0.3]}
models = FilesystemStore.open("s3://models/22-02-2022/")  # {"v1": SuperNN(),...}

# updates
model = models.get(dt.now().hour, "default_model")  # update the model every hour
video_embedding.load('s3://video_embeddings/day')  # load the video embeddings every day


def get_user_embedding(user):
    # online feature engineering 
    default_embedding, most_popular_movies = [0.1, 0.1, 0.1], [1, 2, 3]
    last_3_movies = recently_watched_store.get(user, most_popular_movies)
    return np.mean([video_embedding.get(movie, default_embedding) for movie in last_3_movies], axis=0)


def predict(user):
    return model.predict(get_user_embedding.get(user))

  • This can be even more efficient if we save the already computed average embeddings straight in redis, but this example should get the point across.