I am doing link prediction in neptune ml but facing an error in model training step

0

I am doing link prediction in neptune ml but facing an error in model training step. model-hpo-configuration.json generated in data processing step is : { "models": [ { "model": "rgcn", "task_type": "link_predict", "eval_metric": { "metric": "mrr", "global_ranking_metrics": true, "include_retrieval_metrics": false }, "eval_frequency": { "type": "evaluate_every_pct", "value": 0.05 }, "1-tier-param": [ { "param": "num-hidden", "range": [ 16, 128 ], "type": "int", "inc_strategy": "power2" }, { "param": "num-epochs", "range": [ 3, 100 ], "inc_strategy": "linear", "inc_val": 1, "type": "int", "edge_strategy": "perM" }, { "param": "lr", "range": [ 0.001, 0.01 ], "type": "float", "inc_strategy": "log" }, { "param": "num-negs", "range": [ 4, 32 ], "type": "int", "inc_strategy": "power2" } ], "2-tier-param": [ { "param": "dropout", "range": [ 0.0, 0.5 ], "inc_strategy": "linear", "type": "float", "default": 0.3 }, { "param": "layer-norm", "type": "bool", "default": true }, { "param": "regularization-coef", "range": [ 0.0001, 0.01 ], "type": "float", "inc_strategy": "log", "default": 0.001 } ], "3-tier-param": [ { "param": "batch-size", "range": [ 128, 512 ], "inc_strategy": "power2", "type": "int", "default": 256 }, { "param": "sparse-lr", "range": [ 0.001, 0.01 ], "inc_strategy": "log", "type": "float", "default": 0.001 }, { "param": "fanout", "type": "int", "options": [ [ 10, 30 ], [ 15, 30 ], [ 15, 30 ] ], "default": [ 10, 15, 15 ] }, { "param": "num-layer", "range": [ 1, 3 ], "inc_strategy": "linear", "inc_val": 1, "type": "int", "default": 2 }, { "param": "num-bases", "range": [ 0, 8 ], "inc_strategy": "linear", "inc_val": 2, "type": "int", "default": 0 } ], "fixed-param": [ { "param": "neg-share", "type": "bool", "default": true }, { "param": "use-self-loop", "type": "bool", "default": true }, { "param": "low-mem", "type": "bool", "default": true }, { "param": "enable-early-stop", "type": "bool", "default": true }, { "param": "window-for-early-stop", "type": "bool", "default": 3 }, { "param": "concat-node-embed", "type": "bool", "default": true }, { "param": "per-feat-name-embed", "type": "bool", "default": true }, { "param": "use-edge-features", "type": "bool", "default": false }, { "param": "edge-num-hidden", "type": "int", "default": 16 }, { "param": "weighted-link-prediction", "type": "bool", "default": false }, { "param": "link-prediction-remove-targets", "type": "bool", "default": false }, { "param": "l2norm", "type": "float", "default": 0 } ] } ] }

the error is Training is finished { "processingJob": { "name": "socialux-autotrainer-2024-07-10-17-42-7780000", "arn": "arn:aws:sagemaker:us-east-1:975049964909:processing-job/socialux-autotrainer-2024-07-10-17-42-7780000", "status": "Failed", "outputLocation": "s3://neptunefoml/neptune-ml-social-network-recommendation/training/socialux-autotrainer-2024-07-10-17-42-7780000/autotrainer-output", "failureReason": "AlgorithmError: , exit code: 1" }, "hpoJob": { "name": "socialux-neptune-ml-240710-1744", "arn": "arn:aws:sagemaker:us-east-1:975049964909:hyper-parameter-tuning-job/socialux-neptune-ml-240710-1744", "status": "Failed", "failureReason": "No objective metrics found after running 2 training jobs. Please ensure that the custom algorithm is emitting the objective metric as defined by the regular expression provided." }, "id": "social-link-prediction-1720633344", "status": "Failed" }

  • The model-hpo-configuration.json is auto-generated as part of the dataprocessing stage of the Neptune ML workflow. Are you doing something differently leading up to the dataprocessing stage that could causing issues here? An example of the workflow for link prediction can be found here: https://github.com/aws/graph-notebook/blob/main/src/graph_notebook/notebooks/03-Neptune-ML/01-Gremlin/04-Introduction-to-Link-Prediction-Gremlin.ipynb

  • The model-hpo-configuration.json is auto-generated, yes. I am not doing anything differently, I am following the same process described in the reference link you provided, still the jobs are failing. I observed that same model-hpo-configuration.json file is being generated for different graph data but how same configuration fits for every data, it will not, right? Is it constant for all kinds of data or do we need to change the values as per the data, export params, if yes, how to vary the metrics and what metrics to vary, can you send a doc or blog in a detailed way?

1 Answer
-1

Not sure if this is the exact issue, but you have link_predict for task_type and that should be link_prediction.
https://docs.aws.amazon.com/neptune/latest/userguide/machine-learning-customizing-hyperparams.html

profile pictureAWS
answered a month ago
  • No, it's not working