I am doing link prediction in neptune ml but facing an error in model training step

I am doing link prediction in neptune ml but facing an error in model training step. model-hpo-configuration.json generated in data processing step is : { "models": [ { "model": "rgcn", "task_type": "link_predict", "eval_metric": { "metric": "mrr", "global_ranking_metrics": true, "include_retrieval_metrics": false }, "eval_frequency": { "type": "evaluate_every_pct", "value": 0.05 }, "1-tier-param": [ { "param": "num-hidden", "range": [ 16, 128 ], "type": "int", "inc_strategy": "power2" }, { "param": "num-epochs", "range": [ 3, 100 ], "inc_strategy": "linear", "inc_val": 1, "type": "int", "edge_strategy": "perM" }, { "param": "lr", "range": [ 0.001, 0.01 ], "type": "float", "inc_strategy": "log" }, { "param": "num-negs", "range": [ 4, 32 ], "type": "int", "inc_strategy": "power2" } ], "2-tier-param": [ { "param": "dropout", "range": [ 0.0, 0.5 ], "inc_strategy": "linear", "type": "float", "default": 0.3 }, { "param": "layer-norm", "type": "bool", "default": true }, { "param": "regularization-coef", "range": [ 0.0001, 0.01 ], "type": "float", "inc_strategy": "log", "default": 0.001 } ], "3-tier-param": [ { "param": "batch-size", "range": [ 128, 512 ], "inc_strategy": "power2", "type": "int", "default": 256 }, { "param": "sparse-lr", "range": [ 0.001, 0.01 ], "inc_strategy": "log", "type": "float", "default": 0.001 }, { "param": "fanout", "type": "int", "options": [ [ 10, 30 ], [ 15, 30 ], [ 15, 30 ] ], "default": [ 10, 15, 15 ] }, { "param": "num-layer", "range": [ 1, 3 ], "inc_strategy": "linear", "inc_val": 1, "type": "int", "default": 2 }, { "param": "num-bases", "range": [ 0, 8 ], "inc_strategy": "linear", "inc_val": 2, "type": "int", "default": 0 } ], "fixed-param": [ { "param": "neg-share", "type": "bool", "default": true }, { "param": "use-self-loop", "type": "bool", "default": true }, { "param": "low-mem", "type": "bool", "default": true }, { "param": "enable-early-stop", "type": "bool", "default": true }, { "param": "window-for-early-stop", "type": "bool", "default": 3 }, { "param": "concat-node-embed", "type": "bool", "default": true }, { "param": "per-feat-name-embed", "type": "bool", "default": true }, { "param": "use-edge-features", "type": "bool", "default": false }, { "param": "edge-num-hidden", "type": "int", "default": 16 }, { "param": "weighted-link-prediction", "type": "bool", "default": false }, { "param": "link-prediction-remove-targets", "type": "bool", "default": false }, { "param": "l2norm", "type": "float", "default": 0 } ] } ] }

the error is Training is finished { "processingJob": { "name": "socialux-autotrainer-2024-07-10-17-42-7780000", "arn": "arn:aws:sagemaker:us-east-1:975049964909:processing-job/socialux-autotrainer-2024-07-10-17-42-7780000", "status": "Failed", "outputLocation": "s3://neptunefoml/neptune-ml-social-network-recommendation/training/socialux-autotrainer-2024-07-10-17-42-7780000/autotrainer-output", "failureReason": "AlgorithmError: , exit code: 1" }, "hpoJob": { "name": "socialux-neptune-ml-240710-1744", "arn": "arn:aws:sagemaker:us-east-1:975049964909:hyper-parameter-tuning-job/socialux-neptune-ml-240710-1744", "status": "Failed", "failureReason": "No objective metrics found after running 2 training jobs. Please ensure that the custom algorithm is emitting the objective metric as defined by the regular expression provided." }, "id": "social-link-prediction-1720633344", "status": "Failed" }

Taylor-AWS
a month ago
The model-hpo-configuration.json is auto-generated as part of the dataprocessing stage of the Neptune ML workflow. Are you doing something differently leading up to the dataprocessing stage that could causing issues here? An example of the workflow for link prediction can be found here: https://github.com/aws/graph-notebook/blob/main/src/graph_notebook/notebooks/03-Neptune-ML/01-Gremlin/04-Introduction-to-Link-Prediction-Gremlin.ipynb
Harshith
a month ago
The model-hpo-configuration.json is auto-generated, yes. I am not doing anything differently, I am following the same process described in the reference link you provided, still the jobs are failing. I observed that same model-hpo-configuration.json file is being generated for different graph data but how same configuration fits for every data, it will not, right? Is it constant for all kinds of data or do we need to change the values as per the data, export params, if yes, how to vary the metrics and what metrics to vary, can you send a doc or blog in a detailed way?

Topics

Database Machine Learning & AI Compute

Relevant content

How to find the right hyperparameters according to the graph data and target & features defined in export params in link prediction in AWS Neptune ML service?
Harshith
asked a month ago
Training Hugging face model
Alex Wolff
asked 9 months ago
how to create a custom inference file using sagemaker sdk that allows me to call a custom predict function(combination rule based method and ML prediction) after model training.
Accepted Answer
Nandhini
asked 6 months ago
Best way to do low-latency lookups for ML model prediction caching?
Accepted Answer
EXPERT
Olivier_CR
asked 4 years ago
How do I troubleshoot issues when I bring my custom container to Amazon SageMaker for training or inference?
AWS OFFICIALUpdated 2 years ago
How do I resolve Amazon S3 AccessDenied errors in Amazon SageMaker training jobs?
AWS OFFICIALUpdated 2 years ago
How do I resolve processing errors when I use Amazon Neptune Bulk Loader?
AWS OFFICIALUpdated 14 days ago
How do I resolve "Model Validation Failed" errors in CloudFormation?
AWS OFFICIALUpdated 2 years ago
Build and Deploy Models Leveraging Cancer Gene Expression Data With SageMaker Pipelines and SageMaker Multi-Model Endpoints
EXPERT
Joshua_B
published 2 years ago
Train large language model using Hugging Face and AWS Trainium
EXPERT
Kamran Khan
published 2 years ago