I am working on RecSys
to generate product recommendations for ABI’s B2B platform BEEs
. Some of the challenges involved in the project include building AutoML
for best hyper-parameter selection, distributed model training. feature store integration, building a python library for curated ML models with default configs, deployment of models in cloud native compute and many more. Super excited to work in this work stream with an amazing team.
Algorithm related challenges:
- Cross validation: How to perform cross validation for
RecSys
. How to link statistical metrics with business KPIs. Determining weighage between model goodness of fit and business KPIs. How to create a scoring function which can compare between different models during cross validation. Managing splitting strategy to ensure that models are comparable. - Model selection: Single model or market-based model or hybrid model - combined of two or more models? Time/Sequence based models(
LSTM
/GRU
)? - Hyper parameter tuning: What can be the preferable hyper-parameter tuning framework, which can support
GPU
(Wide and Deep), Spark (ALS
) and CPU (SAR
etc.). - KPIs: Evaluate existing KPIs such as
Map@K
,NDCG@K
and improve if possible. - Hybrid model or mixture of model: Also, what type of hybrid - sequential, parallel or weighted? As of now, two use case (conceptually)
- AutoML: Example of AutoML for multi-country setup (including hybrid model, hyper-parameter tuning) with recommended tech stack.
- Model drift, data drift, retraining and model monitoring: How to build a framework which can be integrated with the python library to detect model drift, data drift, retraining requirements and monitor generated results in online and offline models.
- Others: Backtesting, AB testing, linking online and offline evaluation.
Programming & Infra related challenges:
- Code spaces: How a developer can use code space for CPU based workflow for day-to-day development. Managing multiple envs base of
Spark/GPU/CPU
dependencies usingdevcontainer
. Can the same image be used inAML/ADB
? AML
+ VS Code/Code Space Integration: AttachingAML
compute to VS Code as terminal andjupyter
kernel. Run experiments in AML without leavingVS Code/Code spaces
. Triggering multiple concurrent jobs (not always same as distributed model training. Some of our models are classical models which we are running multiple times as embarrassingly parallel workload) inAML
fromVS Code
which can scale in multiple nodes to run different models and return results in a fan out fan in pattern to a cloud storage. (One additional information here is we want to leverage all the cores within a node usingjoblib
, hence the auto scaling we are expecting is at node level for a given threshold)mlflow
integration with VS Code and AML.ADB
+ VS Code/Code Space Integration: Run experiment in ADB without leaving VS Code/Code spaces.- Debugging: Using VS Code visual debugger in a distributed workflow in
AML
&ADB
. - Observability: Monitoring aggregated logs from different nodes in
VS Code
. - Testing: How to run property based testing for ML models in distributed compute environments.
- Library: Managing multiple dependencies such as
pyspark
,GPU
andCPU
level system dependencies. Usage ofJIT
within and across models taking execution infra into account. Making library infra agnostic.
If you are excited about solving above mentioned challenges feel free to reach out to me.