Feature Store

Feature Store - Feast

미니스탑 2024. 7. 3. 17:02

Feature store 개념과 역할

Feature Store는 머신 러닝 모델에서 사용되는 Feature 데이터를 저장, 관리, 검색하고 배포하는 중요한 도구

Feature Store
- 데이터 관리
  - 머신러닝 모델을 훈련시키기 위해서는 많은 양의 데이터가 필요하고, 데이터를 관리하고 저장하는 것은 매우 중요
  - Feature store는 이러한 데이터를 효율적으로 저장하고 관리 가능
- 데이터 일관성
  - 기계 학습 모델을 훈련시키기 위해서는 일관된 데이터가 필요
  - Feature store는 모든 데이터를 일관된 방식으로 저장하고 관리함으로써 일관성을 유지할 수 있음
- 데이터 재사용
  - Feature store를 사용하면 여러 모델에서 데이터를 재사용 가능.
  - 모델 개발과 유지 관리를 더욱 용이하게 만듬

Feature Pipeline

Feature Store 필요성

PainPoint(Data Scientist)
- 모델 개발을 시작하기전에 데이터 변환에 과도한 시간을 소비
- 데이터가 학습하기 좋은 형태로 되어 있지 않음
- 사용하는 데이터의 관리 및 트래킹이 되지 않음
- Batch, Real-time 등의 데이터를 처리하기 어려움
- 각각의 데이터를 이용해 모델을 생성하게 되면 여러 분석 시스템에서 데이터를 중복 사용

Feature Store 사용 이유

compute once, used by many

데이터 사이언티스트 팀에서 한번만 Feature Engineering을 통해 Feature를 정의 해 놓으면 여러 ML 엔지니어 팀이 편하게 Feature를 불러와서 여러 모델을 개발하는 일이 좀 더 수월
shared expertise예를들어 추천시스템을 개발하는 조직이 있다고 할때 product, review, user account 등 각 파트를 담당하는 전문 팀들이 각자의 데이터를 분석하여 그 결과를 Feature Store에 올리기 때문에 각 분야의 경험 및 전문지식이 서로 공유가 될 수 있음
easier data quality guarantees
위와 연장선상에서 각 데이터의 전문가들이 Feature Engineering을 수행하기 때문에 데이터의 품질이 어느정도 보장이 되며, ML 엔지니어 팀에서는 각 데이터 사이언티스트 팀에서 나온 여러 Feature를 잘 활용하기만 하면 됨
focus on respective roles
데이터 사이언티스트는 데이터 분석 및 Feature Engineering에만 집중하면 되고 ML 엔지니어는 데이터의 도메인 지식이 부족하더라도 모델 학습 및 배포에 집중

----------->

Feature Store 구성요소 및 작동 방식

Feature Store 구성요소

피쳐 벡터(Feature Vector) or Feature Group: 엔티티 식별자와 해당 요소를 어떤 시점에서 설명하는 속성 또는 특성 집합을 포함하는 데이터.
예를 들어, 엔티티 식별자는 사용자 ID일 수 있으며, 속성에는 다음과 같은 값이 포함될 수 있습니다: (가입 후 경과 시간, 구매 횟수, 수명 가치, 무료 체험 여부, 월별 평균 구매 횟수, 누적 구매 횟수, 마지막 구매 시간 등)

오프라인 스토어(Offline Store): 데이터 과학 실험 또는 배치 생산 작업과 같은 오프라인 워크로드에 대한 피쳐 벡터를 저장 및 제공할 수 있는 분석용 데이터베이스
일반적으로 각 행은 엔티티 ID와 주어진 타임스탬프에 의해 고유하게 식별되는 피쳐 벡터를 포함
일반적으로 S3, Redshift, BigQuery, Hive 등.

온라인 스토어(Online Store): 핫 데이터로도 불리며, 이 스토리지 계층은 낮은 지연 시간 예측 서비스를 위한 피쳐를 제공하기 위해 만들어졌습니다.
이 데이터베이스는 이제 밀리초 속도로 Feature를 가져올 수 있도록 사용
Redis, DynamoDB 또는 Cassandra가 이 역할을 수행하는 일반적. Key-Value 데이터베이스는 복잡한 쿼리와 조인이 실행 시 자주 필요하지 않기 때문에 가장 적합한 옵션

Feature 카탈로그 또는 레지스트리(Feature Catalog or Registry): Feature 및 Train 데이터셋을 검색할 수 있는 UI로 제공되는 것이 이상적

Feature Store SDK: 이것은 온라인 및 오프라인 스토어에 대한 액세스 패턴을 추상화하는 Python 라이브러리

메타데이터 관리(Metadata Management): 다른 사용자나 파이프라인, 인계 프로세스, 스키마 변경 및 이러한 유형의 정보를 추적하는 데 사용

오프라인 및 온라인 서빙 API: Feature 액세스를 용이하게 하는 SDK와 온라인 및 오프라인 Feature 하드웨어 사이에 위치한 프록시 서비스

Online / Offline Feature Store

- online store
  - 모델 예측에 필요한 실시간 기능 데이터를 저장하고 관리
  - 데이터는 모델 예측 시점에 필요하며, 실시간으로 업데이트되는 데이터 저장
  - 실시간으로 사용자의 위치 정보나 클릭 이벤트 정보를 가져와서 모델에 입력하는 경우에 사용
  - 모델의 실시간 예측 속도를 향상시키는 데 중요한 역할을 수행
  - 가장 최신 feature 값들을 담는 데이터베이스
  - 최신 feature 값들은 feast materialize 명령어에 의해 업데이트
  - Streaming환경에서 Feature를 빠른 시간안에 서빙하고 inference하기 위함
  - 예) Cloud DB for Redis
- offline store
  - 모델 훈련에 필요한 정적인 기능 데이터를 저장하고 관리
  - 모델 훈련 시점에 사용되며, 변경될 가능성이 적은 데이터
  - 고객의 성별, 연령, 교육 수준 등의 데이터와 같이 변경될 가능성이 적은 데이터 저장
  - 모델의 정확도를 향상시키는 데 중요한 역할을 수행
  - feast의 컴포넌트는 아니지만, feast가 이를 데이터 소스로 이용
  - 모델이 실제로 Production에서 서비스를 위해 쓰이기 전 단계 즉 모델 학습, 데이터 분석 등 "대용량 배치 처리"를 위한 용도로써의 Feature 데이터 저장소가 offline Store
  - 예) ObjectStorage
참고
- Feature Store for ML : https://www.featurestore.org/

관련 Feature Store 오픈소스
- Feast, Hopsworks

Feature Store 핵심 개념
- Feature Key

Feast Registry: 객체 저장소 기반의 레지스트리로, feature store에 등록된 feature 정의를 저장하는 데 사용. 시스템은 Feast SDK를 통해 레지스트리와 상호 작용하여 feature 데이터를 발견.

Feast Python SDK/CLI: 주요 사용자 인터페이스 SDK.

버전 관리되는 feature 정의 관리
온라인 스토어에 feature 값 materialize(로드)
오프라인 스토어에서 훈련 데이터셋 빌드 및 검색
온라인 feature 검색

Stream Processor: 스트림에서 feature 데이터를 온/오프라인 스토어에 저장.

Batch Materialization Engine: Batch Materialization Engine은 오프라인 스토어에서 온라인 스토어로 데이터를 로드하는 프로세스

Online Store: 각 엔티티에 대한 최신 feature 값만 저장하는 데이터베이스

Offline Store: Feast에 입력된 배치 데이터를 저장. 이 데이터는 학습 데이터셋을 생성하는 데 사용.

Feast는 오프라인 스토어를 직접 관리하지 않지만, 해당 스토어에서 쿼리를 실행하여 feature 검색 및 materialization을 수행

offline store는 Feast가 제공하는 서비스 기능의 로깅 기능을 구성하는 경우 write를 지원하도록 구성 가능

Feast Data Source

hive (https://github.com/baineng/feast-hive)
objectstorage
...

Batch data sources: ideally, these live in data warehouses (BigQuery, Snowflake, Redshift), but can be in data lakes (S3, GCS, etc). Feast supports ingesting and querying data across both.
Stream data sources: Feast does not have native streaming integrations. It does however facilitate making streaming features available in different environments. There are two kinds of sources:
- Push sources allow users to push features into Feast, and make it available for training / batch scoring ("offline"), for realtime feature serving ("online") or both.
- [Alpha] Stream sources allow users to register metadata from Kafka or Kinesis sources. The onus is on the user to ingest from these sources, though Feast provides some limited helper methods to ingest directly from Kafka / Kinesis topics.

Feast Online Store

Feast는 온라인 스토어를 사용하여 낮은 레이턴시로 feature를 제공합니다. feature 값은 materialize 명령을 통해 데이터 소스에서 온라인 스토어로 로드됩니다.

Feast 지원 가능 온라인 스토어(기능별)

	Sqlite	Redis	DynamoDB	Snowflake	Datastore	Postgres	Hbase	Cassadra
write feature values to the online store	yes	yes	yes	yes	yes	yes	yes	yes
read feature values from the online store	yes	yes	yes	yes	yes	yes	yes	yes
update infrastructure (e.g. tables) in the online store	yes	yes	yes	yes	yes	yes	yes	yes
teardown infrastructure (e.g. tables) in the online store	yes	yes	yes	yes	yes	yes	yes	yes
generate a plan of infrastructure changes	yes	no	no	no	no	no	no	yes
support for on-demand transforms	yes	yes	yes	yes	yes	yes	yes	yes
readable by Python SDK	yes	yes	yes	yes	yes	yes	yes	yes
readable by Java	no	yes	no	no	no	no	no	no
readable by Go	yes	yes	no	no	no	no	no	no
support for entityless feature views	yes	yes	yes	yes	yes	yes	yes	yes
support for concurrent writing to the same key	no	yes	no	no	no	no	no	no
support for ttl (time to live) at retrieval	no	yes	no	no	no	no	no	no
support for deleting expired data	no	yes	no	no	no	no	no	no
collocated by feature view	yes	no	yes	yes	yes	yes	yes	yes
collocated by feature service	no	no	no	no	no	no	no	no
collocated by entity key	no	yes	no	no	no	no	no	no

NCP 사용 상품 예

- Cloud DB for Redis

- Cloud DB for Postgres

Feast Offline Store

Feast 지원 가능 오프라인 스토어(기능별)

Text	File	BigQuery	Snowflake	Redshift	Postgres	Spark	Trino
get_historical_features	yes	yes	yes	yes	yes	yes	yes
pull_latest_from_table_or_query	yes	yes	yes	yes	yes	yes	yes
pull_all_from_table_or_query	yes	yes	yes	yes	yes	yes	yes
offline_write_batch	yes	yes	yes	yes	no	no	no
write_logged_features	yes	yes	yes	yes	no	no	no

Text	File	BigQuery	Snowflake	Redshift	Postgres	Spark	Trino
export to dataframe	yes	yes	yes	yes	yes	yes	yes
export to arrow table	yes	yes	yes	yes	yes	yes	yes
export to arrow batches	no	no	no	yes	no	no	no
export to SQL	no	yes	yes	yes	yes	no	yes
export to data lake (S3, GCS, etc.)	no	no	yes	no	yes	no	no
export to data warehouse	no	yes	yes	yes	yes	no	no
export as Spark dataframe	no	no	yes	no	no	yes	no
local execution of Python-based on-demand transforms	yes	yes	yes	yes	yes	no	yes
remote execution of Python-based on-demand transforms	no	no	no	no	no	no	no
persist results in the offline store	yes	yes	yes	yes	yes	yes	no
preview the query plan before execution	yes	yes	yes	yes	yes	yes	yes
read partitioned data	yes	yes	yes	yes	yes	yes	yes

Feast Functionality

배치 feature 생성: Spark 및 SQL과 같은 ELT/ETL 시스템은 배치 스토어에서 데이터를 변환하는 데 사용됩니다.

스트림 feature 생성: 스트림 feature는 Kafka 또는 Kinesis와 같은 스트리밍 서비스에서 생성되며, Push API를 통해 Feast에 직접 푸시할 수 있습니다.

Feast Apply: 사용자(또는 CI)는 feast apply를 사용하여 버전 관리되는 feature 정의를 게시합니다. 이 CLI 명령은 인프라를 업데이트하고 객체 저장소 레지스트리에서 정의를 저장합니다.

Feast Materialize: 사용자(또는 스케줄러)는 feast materialize를 실행하여 오프라인 스토어에서 온라인 스토어로 feature를 로드합니다.

Model Training: 모델 훈련 파이프라인이 시작됩니다. Feast Python SDK를 사용하여 모델 훈련에 사용할 수 있는 훈련 데이터셋을 검색합니다.

Historical feature 가져오기: Feast는 모델 훈련 파이프라인에서 제공한 feature 목록 및 entity dataframe에 따라 정확한 시점의 훈련 데이터셋을 export합니다.

Deploy Model: 훈련된 모델 바이너리(및 feature 목록)는 모델 서빙 시스템에 배포됩니다. 이 단계는 Feast에 의해 실행되지 않습니다.

Predict: 백엔드 시스템은 모델 서빙 서비스에서 예측 요청을 합니다.

Online feature 가져오기: 모델 서빙 서비스는 Feast Online Serving 서비스에 대해 Feast SDK를 사용하여 온라인 feature를 요청합니다.

ReferenceArchitecture

Feast 를 이용한 ML Pipeline

SDK : Kubeflow notebook

Offline Store : Cloud Hadoop - trino, hive 등

Online Store : Cloud DB for Redis

Registry : ObjectStorage, Cloud DB for postgres

Feast 간단 활용 예제

Feast 예제 진행 순서

Feast sdk 설치
Feast Config 설정
Feast Store 정의(Feast View 정의)
Feast 등록
Online Store를 활용한 모델 학습

1. Feast SDK 설치 및 init

pip를 통한 feast 설치 및 init 진행

init 과정의 m 옵션은 초기 기본 생성되는 feature 정의 파일 없이 기본 config yaml 파일만 생성하기 위함

pip install feast[aws]
feast init -m feast_test

or Helm 설치(kubernetes)

helm repo add feast-charts https://feast-helm-charts.storage.googleapis.com
helm repo update
helm install feast-release feast-charts/feast-feature-server \
    --set feature_store_yaml_base64=$(base64 feature_store.yaml)

2. Feast Config 설정

설정내용

project: Feature Store project repository
registry: 모든 Feature들의 정의와 메타데이터 정보를 가진 파일
provider: Offline Store, Online Store, Infra, Computing등을 활용할 provider 종류(local, aws, gcp)
online_store: online feature에 대한 정보를 가진 파일
offline_store : offline feature에 대한 정보를 가진 파일

feature_store.yaml 샘플

feature_store.yaml 
project: feast_test
provider: local
registry:
    type : s3
    path : s3://xxxx/registry.db
    region_name: KR
    endpoint_url: https://kr.object.ncloudstorage.com
online_store:
    type: redis
    connection_string: ${REDIS_CONNECTION_STRING}
entity_key_serialization_version: 2

3. Feast Store 정의(Feast View 정의)

Feast는 데이터별로 최소 하나 이상의 Feature View를 보유하고 있다.

Feature View 안에는 Feature(schema), Entity, Data Source가 포함되어야 한다.

Feature View를 통해서 오프라인(학습)데이터와 온라인(추론) 환경 모두에서 일관된 방식으로 Feature 데이터를 모델링 할 수 있게 해준다.

Breast Cancer 예제 파일을 활용한 예제 진행

- 데이터 불러오기

Data Loading

from sklearn import datasets
import pandas as pd
 
# Loading a toy dataset into a DataFrame
data = datasets.load_breast_cancer()
data_df = pd.DataFrame(data=data.data, columns=data.feature_names)
data_df.head()

- 데이터 가공
. Feast의 기능 활용에 좀더 초점을 맞추기 위해 target 컬럼을 제외한 30개의 feature를 그대로 활용.

다만 각 분야의 데이터 사이언티스트 팀이 Feature Engineering을 거쳐 Feature Store에 다양한 feature를 등록한다는 컨셉을 따라가기 위해 breast_cancer 원본 데이터 프레임을 4개의 데이터프레임으로 나누어 진행

. Feast는 timestamp를 사용하여 다양한 데이터로부터 feature가 올바른 시간순으로 결합되도록 함.

기본적으로는 학습 및 예측에 오래된 데이터의 사용을 방지하기 위함이며 뒤에서 이 timestamp로 online store에서 최근에 추가된 데이터 혹은 구간데이터를 추출

breast cancer 데이터셋에 timestamp가 없기 때문에 하루 간격을 가지는 event_timestamp 컬럼을 추가

. breast cancer 데이터셋에는 key로 사용할 컬럼이 없기 때문에 각 행을 한명의 환자로 간주하여 patient_id 라는 컬럼을 생성해 각 행에 key값을 추가

Splitting data

# Splitting the dataset into arbitrary sets of features
data_df1 = data_df[data.feature_names[:5]]
data_df2 = data_df[data.feature_names[5:10]]
data_df3 = data_df[data.feature_names[10:17]]
data_df4 = data_df[data.feature_names[17:30]]
target_df = pd.DataFrame(data=data.target, columns=["target"])
 
# Creating timestamps for the data
timestamps = pd.date_range(
    end=pd.Timestamp.now(),
    periods=len(data_df),
    freq='D').to_frame(name="event_timestamp", index=False)
 
# Adding the timestamp column to each DataFrame
data_df1 = pd.concat(objs=[data_df1, timestamps], axis=1)
data_df2 = pd.concat(objs=[data_df2, timestamps], axis=1)
data_df3 = pd.concat(objs=[data_df3, timestamps], axis=1)
data_df4 = pd.concat(objs=[data_df4, timestamps], axis=1)
target_df = pd.concat(objs=[target_df, timestamps], axis=1)
 
# Creating a list of arbitrary IDs for feature rows
patient_ids = pd.DataFrame(data=list(range(len(data_df))), columns=["patient_id"])
 
# Adding the timestamp column to each DataFrame
data_df1 = pd.concat(objs=[data_df1, patient_ids], axis=1)
data_df2 = pd.concat(objs=[data_df2, patient_ids], axis=1)
data_df3 = pd.concat(objs=[data_df3, patient_ids], axis=1)
data_df4 = pd.concat(objs=[data_df4, patient_ids], axis=1)
target_df = pd.concat(objs=[target_df, patient_ids], axis=1)
 
data_df1
.head()

View Definition

Feast는 데이터별로 최소 하나 이상의 Feature View를 보유하고 있음.

Feature View 안에는 Feature(schema), Entity, Data Source가 포함되어야 하며, Feature View를 통해서 오프라인(학습)데이터와 온라인(추론) 환경 모두에서 일관된 방식으로 Feature 데이터를 모델링 할 수 있게 해준다.

feature view

# Importing dependencies
from feast import Entity, Feature, FeatureView, Field, FileSource, ValueType
from datetime import timedelta
from feast.types import Int32, Int64, Float32
 
# Declaring an entity for the dataset
patient = Entity(
    name="patient_id",
    join_keys=["patient_id"],
    description="The ID of the patient")
 
# Declaring the source of the first set of features
f_source1 = FileSource(
    path="/home/jovyan/feast_test/feature_repo/data/data_df1.parquet",
    event_timestamp_column="event_timestamp"
)
 
# Defining the first set of features
df1_fv = FeatureView(
    name="df1_feature_view",
    ttl=timedelta(days=1),
    entities=[patient],
    schema=[
        Field(name="mean radius", dtype=Float32),
        Field(name="mean texture", dtype=Float32),
        Field(name="mean perimeter", dtype=Float32),
        Field(name="mean area", dtype=Float32),
        Field(name="mean smoothness", dtype=Float32)
        ],   
    source=f_source1
)
 
# Declaring the source of the second set of features
f_source2 = FileSource(
    path="/home/jovyan/feast_test/feature_repo/data/data_df2.parquet",
    event_timestamp_column="event_timestamp"
)
 
# Defining the second set of features
df2_fv = FeatureView(
    name="df2_feature_view",
    ttl=timedelta(days=1),
    entities=[patient],
    schema=[
        Field(name="mean compactness", dtype=Float32),
        Field(name="mean concavity", dtype=Float32),
        Field(name="mean concave points", dtype=Float32),
        Field(name="mean symmetry", dtype=Float32),
        Field(name="mean fractal dimension", dtype=Float32)
        ],   
    source=f_source2
)
 
# Declaring the source of the third set of features
f_source3 = FileSource(
    path="/home/jovyan/feast_test/feature_repo/data/data_df3.parquet",
    event_timestamp_column="event_timestamp"
)
 
# Defining the third set of features
df3_fv = FeatureView(
    name="df3_feature_view",
    ttl=timedelta(days=1),
    entities=[patient],
    schema=[
        Field(name="radius error", dtype=Float32),
        Field(name="texture error", dtype=Float32),
        Field(name="perimeter error", dtype=Float32),
        Field(name="area error", dtype=Float32),
        Field(name="smoothness error", dtype=Float32),
        Field(name="compactness error", dtype=Float32),
        Field(name="concavity error", dtype=Float32)
        ],   
    source=f_source3
)
 
# Declaring the source of the fourth set of features
f_source4 = FileSource(
    path="/home/jovyan/feast_test/feature_repo/data/data_df4.parquet",
    event_timestamp_column="event_timestamp"
)
 
# Defining the fourth set of features
df4_fv = FeatureView(
    name="df4_feature_view",
    ttl=timedelta(days=1),
    entities=[patient],
    schema=[
        Field(name="concave points error", dtype=Float32),
        Field(name="symmetry error", dtype=Float32),
        Field(name="fractal dimension error", dtype=Float32),
        Field(name="worst radius", dtype=Float32),
        Field(name="worst texture", dtype=Float32),
        Field(name="worst perimeter", dtype=Float32),
        Field(name="worst area", dtype=Float32),
        Field(name="worst smoothness", dtype=Float32),
        Field(name="worst compactness", dtype=Float32),
        Field(name="worst concavity", dtype=Float32),
        Field(name="worst concave points", dtype=Float32),
        Field(name="worst symmetry", dtype=Float32),
        Field(name="worst fractal dimension", dtype=Float32),       
        ],   
    source=f_source4
)
 
# Declaring the source of the targets
target_source = FileSource(
    path="/home/jovyan/feast_test/feature_repo/data/target_df.parquet",
    event_timestamp_column="event_timestamp"
)
 
# Defining the targets
target_fv = FeatureView(
    name="target_feature_view",
    entities=[patient],
    ttl=timedelta(days=1),
    schema=[
        Field(name="target", dtype=Int64)       
        ],   
    source=target_source
)

name: Feature View name으로 feast.FeatureStore.get_feature_view 등과 같은 함수에서 활용한다.
entities: Feature View에서 사용될 key값인 entity, 만약 FeatureView가 특별한 entity와 관계가 없는 feature 들만 포함한다면 entities가 없이 (entities=[]) 구성될 수 있다.
ttl: 학습데이터를 불러올 때 timestamp컬럼으로부터 ttl에 입력한 기간 전까지의 데이터를 허용한다.
schema: 등록하고자 하는 모든 Feature(Field)를 등록한다. (feature name 정확히 기재)
source: feature 데이터의 FileSource를 정의하여 입력한다. FileSource 함수에는 반드시 path와 timestamp_field 파라미터가 입력되어야 한다.
Feature Store에 담을 데이터는 project안에 data폴더에 있기 때문에 local path를 적어주고 timestamp_field는 timestamp 컬럼인 event_timestamp를 적어주었다.

4. Feast 등록(Apply)

apply 명령을 사용할때는 현재경로가 project 경로안에 있어야 한다.

feast apply

5. Online Store를 활용한 모델 학습

저작자표시 비영리 변경금지 (새창열림)