Why doesn't my SageMaker Studio Classic notebook in VPC only mode connect with my KernelGateway app?

7 minute read
0

There are connectivity issues between my Amazon SageMaker Studio Classic notebook in VPC only mode and my KernelGateway app.

Short description

When you use SageMaker Studio Classic in VPC only mode and you can't launch a kernel, you see these error messages:

  • "SageMaker Studio Classic is unable to connect KernelGateway App. In VPCOnly mode, please ensure that security groups allow TCP traffic within the security group."
  • "Failed to start kernelFailed to launch app [None]. SageMaker Studio Classic is unable to reach SageMaker endpoint. Please ensure your VPC has connectivity to SageMaker via Internet or VPC Endpoint. If you are using VPC Endpoints, please ensure Security Groups allows traffic between Studio Classic and VPC endpoints."

These error messages appear because there are connection issues between SageMaker instances, your VPC domain endpoints, or the internet. To resolve the issue, confirm the following:

  • The security groups for SageMaker Studio Classic are configured correctly.
  • Your subnet has the correct VPC endpoints.
  • Your domain is connected to a private subnet, with an active NAT gateway added to your route table.
  • You set up your SageMaker Studio Classic to connect to private subnets.

Resolution

Configure the security groups for SageMaker Studio Classic

You must allow the Network File System (NFS) traffic between the domain and Amazon Elastic File System (Amazon EFS) volume over TCP on port 2049. Amazon EFS is used to store your SageMaker Studio Classic data. You must have the rules to allow inbound and outbound connections to store the data.

To allow inbound traffic from SageMaker Studio Classic to Amazon EFS, complete the following steps:

  1. Open the Amazon VPC console.

  2. In the navigation pane, choose Security Groups.

  3. Select the security group that's attached to your Amazon EFS mount target.

  4. Choose Actions, and then choose Edit inbound rules.

  5. Choose Add rule, and then complete these actions:

    For Type, choose NFS.

    For Source, choose Custom, and then enter the security group that's attached to the SageMaker Studio Classic domain.

  6. Choose Save rules.

You must allow TCP traffic within the security group for connectivity between the JupyterServer and KernelGateway apps. Because you created the Studio Classic domain in VPC only mode, you must specify at least one security group for your SageMaker Studio Classic domain resources. This security group must allow inbound traffic over TCP on ports 8192-65535 and all outbound traffic to 0.0.0.0/0.

To allow connectivity between the JupyterServer and KernelGateway apps, complete the following steps:

  1. Open the Amazon VPC console.

  2. In the navigation pane, choose Security Groups.

  3. Select the security group that you want to update.

  4. Choose Actions, and then choose Edit inbound rules.

  5. Choose Add rule, and then complete these steps:

    For Type, choose Custom TCP.

    For Port range, enter 8192-65535.

    For Source, choose Custom, and then enter the security group ID of the security group that you're editing.

  6. Choose Save rules.

When you access resources in your Amazon VPC from your SageMaker Studio Classic notebook, SageMaker service account traffic goes through your elastic network interface. Both JupyterServer and KernelGateway apps are in your SageMaker service account VPC. They communicate with each other through the elastic network interfaces that are attached to your VPC.

These apps are part of the SageMaker Studio Classic domain service account, but they run on different Amazon Elastic Compute Cloud (Amazon EC2) instances. These apps use the ephemeral ports to establish a connection between each other. There's no specific port that the apps use to connect to each other. It's a best practice to allow all the TCP ports to be open in self-referencing security groups. For more information, see Dive deep into Amazon SageMaker Studio Classic notebooks architecture.

Confirm that your subnet has the correct VPC endpoints

If your SageMaker Studio Classic resources don't require access to the internet, then you don't need to add a NAT gateway. However, a Studio Classic notebook requires the following endpoints to run. Note that for these examples, your-aws-region represents the AWS Region that you're using:

  • SageMaker API: com.amazonaws.your-aws-region.sagemaker.api
  • SageMaker runtime: com.amazonaws.your-aws-region.sagemaker.runtime

Create the following endpoints to access Amazon Simple Storage Service (Amazon S3) and Project templates. Note that for these examples, your-aws-region represents the AWS Region that you're using:

  • For Amazon S3: com.amazonaws.your-aws-region.s3
  • For Amazon SageMaker Project templates: com.amazonaws.your-aws-region.servicecatalog

To associate the security groups for your VPC with these VPC endpoints, complete these steps:

  1. Open the Amazon VPC console.

  2. In the navigation pane, choose Endpoints.

  3. Choose the endpoint that you want to update.

  4. Choose Actions, and then choose Manage security groups.

  5. Select the security group that must be associated with this endpoint.

  6. Choose Save.

For more information, see Give SageMaker training jobs access to resources in your Amazon VPC and VPC only communication with the internet.

Connect your domain to a private subnet and active NAT gateway

If your SageMaker Studio Classic resources require access to the internet, first configure your SageMaker Studio Classic to connect to private subnets. Then, create a NAT gateway and allow the traffic from the NAT gateway through your private subnet's route table. For more information, see How do I set up a NAT gateway for a private subnet in Amazon VPC?

Note: The SageMaker Studio Classic domain that's connected to a public subnet doesn't allow you to connect to the internet.

Confirm that your VPC meets the requirements

If you launched your SageMaker Studio Classic in VPC only mode, then your VPC must meet the following requirements:

  • Subnets must have enough available IP addresses for the instance.
  • To allow internet access, associate your SageMaker domain with a private subnet during domain creation. Use NAT gateway for internet access.
  • If you use a VPC endpoint to run SageMaker APIs, then the attributes Enable DNS hostnames and Enable DNS Support must be set to true for your VPC. This is required for your VPC to connect to the SageMaker API endpoint when it runs the kernel.

Use AWS Command Line Interface (AWS CLI) commands to make sure that the correct security groups are attached to the domain. To update your Studio Classic domain's DefaultUserSettings to use the new security group, run the following update-domain command:

aws sagemaker update-domain -;domain-id  --default-user-settings SecurityGroups=

Note: To run this command, you must delete all the apps with InService status from your user profiles.

To reconfigure the domain, recreate the domain that's attached to the necessary security groups. The output against the SecurityGroups parameter lists all the security groups for the VPC that Studio Classic uses for communication.

After the update-domain command is successful, use the describe-domain command to check your domain:

$ aws sagemaker describe-domain --domain-id d-xyzxyz

Then, launch SageMaker Studio Classic and confirm that the notebook runs correctly. To test the internet connectivity, run !curl amazon.com from within a notebook cell.

Note: If you get errors when you run AWS Command Line Interface (AWS CLI) commands, then see Troubleshoot AWS CLI errors. Make sure that you're using the most recent AWS CLI version.

For the updated settings to take effect, delete the JupyterServer app, and then start a new one. Use your SageMaker Studio Classic user profile after you update the Amazon VPC settings. For more information, see Connect SageMaker Studio Classic notebooks in a VPC to external resources.

Additional troubleshooting

If only one user experiences this issue, then check whether the default app launched before the VPC updates completed. In this case, the default JupyterServer app didn't automatically update to use the new VPC configuration.

If the default JupyterServer app launched weeks or months before, then there might be large log files and temp files in the app. Recreate the default app to free up space or to make sure that the app uses the updated VPC configuration.

If SageMaker Studio Classic users are configured with a different action role, then the connectivity issue also occurs. Be sure that the action role permissions for the users include the required policies. These policies must turn on the action role to run the DescribeApp action that's required to create Studio Classic notebooks. After you update these permissions for the action role, provision Studio Classic notebooks in VPC only mode.

AWS OFFICIAL
AWS OFFICIALUpdated a month ago