Trying to setup a VPN with Site-to-Site and Client connection to the site via VPN

Here is the current state of the VPN - Config of VPN

Here are the issues I have to deal with -

The AWS Transfer family needs to still have clear access to the internet for other sites not currently using a VPN tunnel (if I can get this working, maybe we swap to a transit gateway)
The Client connection for the VPN needs to allow traffic to the On-Premise network and the internet
The On-Premise network should route all traffic (internet or AWS service) through the VPN.

The HMI uses SES for email and the AWS Transfer Family FTP service for delivering log data to S3. The local HMI, PLC, and Web cams all route through the on-prem router with static local IPs in terms of devices that would need to communicate with the internet. There are a few other devices that route through the local network and don't communicate with the internet.

The on prem router is a Trendnet TI-W100 (which has woefully bad documentation) and I can get the IPSec connection up. Thus far I am having a hard time with forcing (or even getting any of the local connections to work via) the on prem through the VPN or I think I am anyway. The expectation of the customer is that the router's external IP isn't communicating without the VPN client to users over the internet. The AWS services need to be able to still work for this as well and so far the FTP/SES are not working when I do get the IPSec on the router connected.

Here are the router screens and what I see: Static Routing Config

As far as I can tell Destination IP should be the Inside IP Next Hop and the Gateway is the inside IP Customer Gateway.

First Part of Tunnel Setup

This is the first portion of the tunnel setup and I think that is correct.

Second Part of Tunnel

This is the second part of the tunnel setup. Local Subnet is set to the local LAN IP range, Remote is set to the VPC subnet range (is that right?) the gateway is the tunnel outside IP address (which is the VPG IP).

Here is the tunnels showing up Tunnels Up

What I can't seem to get to, and maybe I am wrong about how this works or how this router works, is that I can still remote into the router from the static IP that is internet facing without being on the AWS VPN client. I wouldn't expect to be able to and when I am on the Client I have no connection to the internet and I cannot seem to connect to any of the local devices on the router or the router itself.

I suspect I have something in the router wrong and/or some route table issues. The VPC route table for the public subnet does have the on prem IP range propagated to the VGW, an IGW with 0.0.0.0/0 and the local IP range. The private has a NAT with 0.0.0.0/0 and the local CIDR. Security group is allowing 21 (For the FTP LB)/443/80/819-8200(FTP control plane) inbound and outbound 0.0.0.0/0 to ALL.

Here are the various IP CIDR's:

Client Endpoint - 192.0.0.0/16 Local IPv4 network CIDR - 172.20.104.0/24 (On Prem LAN Range) Remote IPv4 network CIDR - 172.31.0.0/16 (VPC Range)

I have watched about 20 hours of Youtube and at least double that of googling for examples. Maybe there is a tactical crayon version of this out there that can help or someone can explain it in such a way that I can figure out why this isn't working the way I expect.

Topics

Networking & Content Delivery

Tags

AWS Virtual Private Network (VPN)AWS Client VPN

Language

English

Mav

asked 2 months ago370 views

7 Answers

Newest
Most votes
Most comments

Are these answers helpful? Upvote the correct answer to help the community benefit from your knowledge.

Before starting to think the routing topic through, to be sure I understand this right, is the symbol between the private and public subnets a NAT gateway? Or does it depict the transit gateway? If it's a NAT gateway, in which subnet is it located?

The AWS client VPN endpoint has a route table attached to it (https://docs.aws.amazon.com/vpn/latest/clientvpn-admin/cvpn-working-routes.html). Which routes does it contain, and do all the client VPN endpoint routes have the public subnet in the diagram as their next hop?

VPC routing has some special rules, but in your case, all the VPC routes (meaning the ones in the route tables attached to your subnets, not the TGW) are evaluated based on the longest prefix match rule. 0.0.0.0/0 will not match any packets for which a more specific route, like a /16, /24, or /27, exists. Route metrics are hardwired into the VPC routing rules and cannot be modified, but they also won't prevent anything in your setup from working.

EXPERT

Leo K

answered 2 months ago

Mav
2 months ago
There is a NAT Gateway in the Private Subnet. It has a 0.0.0.0/0 route that points to it. Nothing from the VPN goes to the private subnet. The client endpoint has the following routes : Default - 172.31.0.0/16, add-route - 172.20.104.0/24 (I added this for the on prem network, add-route - 0.0.0.0/0 (without this there seems to be no ability to get to the internet). All three are Nat type. I am currently connected to the client end point as I type this.

The TGW ALSO has a RTB. There is only the VPC CIDR listed as propagated. I can add the on prem CIDR as a static route and I have done that which seems to work as stated for about 24 hours.
Mav
2 months ago
DPD is another potential issue. I am wondering if the DPD settings on the tunnels aren't allowing the reconnection after traffic is not flowing. I still have to address FTP/SES communication once I get this stable.

Currently, the DPD is set to default settings for AWS and I matched the on prem router for timeout of 40 and delay of 10. I have no control over the number of retries in the on prem router though. I am wondering if I need to change the startup action to Start instead of Add or the DPD timeout action to Restart.
Mav
2 months ago
DPD seems to have been the issue; I got that setup fully correct and now I didn't lose the tunnels after 24 hours. I added an SES enpoint in the VPC but still not seeing emails; I am going to add 587/465 to the security as an extra measure, but I have all traffic as long as it is inside that security group set. Maybe the VPN isn't counted as inside, so we'll see.,
Mav
2 months ago
We made it 48 hours this time; last night wasn't able to connect to the on prem. This morning it is back and I am able to connect. I have gone through the tunnel logs and there is this (right about the time I tried to connect) 0 2024-06-17T18:27:52.000-07:00 {"event_timestamp":1718674072,"details":"AWS tunnel is sending DPD request","dpd_enabled":true,"nat_t_detected":true,"ike_phase1_state":"established","ike_phase2_state":"established"}

2024-06-17T18:27:52.000-07:00 {"event_timestamp":1718674072,"details":"AWS tunnel is sending DPD request","dpd_enabled":true,"nat_t_detected":false,"ike_phase1_state":"down","ike_phase2_state":"down"}

Then again

2024-06-17T18:28:12.000-07:00 {"event_timestamp":1718674092,"details":"AWS tunnel is sending DPD request","dpd_enabled":true,"nat_t_detected":true,"ike_phase1_state":"established","ike_phase2_state":"established"}

2024-06-17T18:28:12.000-07:00 {"event_timestamp":1718674092,"details":"AWS tunnel is sending DPD request","dpd_enabled":true,"nat_t_detected":false,"ike_phase1_state":"down","ike_phase2_state":"down"}

This happens between 1827 and 1829 local yesterday evening. Nothing I can see as to why yet.

If I'm understanding your setup correctly, there are two site-to-site VPN tunnels between the on-premises location and the VGW in AWS. In the diagram, I believe it should show as the on-prem router being the "customer gateway" with the tunnels leading to the virtual private gateway (VGW) in AWS.

It won't be possible to route traffic from the on-premises network to the internet via AWS this way. Even if routing were set up to send traffic from on-prem to the internet via the VPN tunnel (which it doesn't seem to be), when the 172.20.104.0/20 private IPs would be used as source IP addresses in the packets sent towards the internet, and they would arrive at the IGW, they wouldn't get any further, because there was nothing along the way to translate the private source IPs to internet-routable public IPv4 addresses.

It's possible to attach a VPC route table to a VGW to determine where it sends traffic coming in through the VPN tunnels, but it doesn't seem to support NAT gateways as targets.

I think you can make your setup work by getting rid of the VGW and setting up a transit gateway (TGW) instead. Attach the VPN tunnel to the TGW. Set the TGW's VPN attachment to use a TGW route table that has its default route pointed to a VPC attachment, which is connected to private subnets in some appropriate VPC (let's call it "outbound VPC") and where the route tables of those private subnets have the default route pointed to NAT gateways. The NAT gateways will need to be in public subnets, with the default route pointed to an IGW, and private IPs (including the on-prem private IPs) pointed towards the TGW attachment. Make sure to configure routes in the same TGW route table for the private IPs in AWS needing on-prem connectivity (like those containing the AWS Transfer service endpoints and NLB) towards the VPC attachments in the VPC shown in your diagram.

This way, routing on AWS's side should have everything properly configured. If a packet to an internal VPC destination arrives through the VPN, the TGW route table will send it to the VPC attachment in the VPC in your diagram. If a packet to a public internet destination arrives, the TGW will send it to one of the private subnets of the "outbound VPC". The VPC route table of the private subnet of the outbound VPC will take over, sending the packet to the NAT gateway in the public subnet. After the NAT gateway has translated the packet's source IP to the public Elastic IP (EIP) of the NAT gateway, the packet will be sent by the public subnet's route table through the IGW to the public internet. This address translation is the specific part that was missing from the original design altogether.

That leaves the on-prem side still to be configured. Currently, you're only sending the VPC's internal IPs to the VPN tunnel, so internet traffic won't get sent to the tunnel at all, as you observed. I'm not sure if your on-prem router will accept 0.0.0.0/0 as a remote network for the tunnel, but you can try. It may not accept it or may not work even if it does, because the VPN tunnel itself has to stay outside the tunnel. If setting 0.0.0.0/0 as a remote network doesn't work, you could try splitting all internet IPs to smaller CIDRs, starting with 0.0.0.0/1 (=0.0.0.0 through 127.255.255.255, assuming you aren't using any private IPs under 10.0.0.0/8), 128.0.0.0/3 next, 160.0.0.0/5, and so on. 169.254.0.0/16 might, again, need to be carved out, since it's used in the /30 subnets for the tunnel. If that configuration is accepted but also doesn't work, that may be because the VPN tunnel endpoint IPs might attempt to get sent to the tunnel. You might be able to override that by configuring static /32 routes for AWS's VPN tunnel endpoint IPs pointed towards the on-prem default gateway (probably the ISP's CPE or edge router), or by subtracting them from the "all public IPs" CIDRs configured as remote networks for the tunnel. Whether any of these workarounds are needed would depend on how your on-prem device is (or isn't) designed to work.

On another note, you should use IKEv2 and not IKEv1 for the tunnel. IKEv2 is simpler, more secure, and performs better. IKEv1 is less secure, overly complicated, ambiguously defined, and performs more poorly than IKEv2.

EXPERT

Leo K

answered 2 months ago

Mav
2 months ago
Well the site to site setup has Tunnel 1 and Tunnel two because it says that is the best way to setup HA. The router I am using on prem (customer gateway) has a terrible interface and worse documentation but I did (seemingly) get those tunnels up into AWS. I have a client endpoint up and working. The reason for the ability to connect to the internet for the client endpoint connected devices is due to the fact that I am remote from another state and there are times when I need to be able to remote in to the on-prem while maintaining an internet connection. I am hoping that I can add an auth rule for client to 0.0.0.0/0 and it will only let client devices using the VPN client connect to the internet and not the on-prem. After that, I just have to figure out how to get traffic from the on-prem to the load balancer.

I'd love to use v2, but I cannot seem to get that option to be available. Everything I create is stuck in ipsec.1; nor can I find a good guide to setting it up that way. A lot of the documentation/guides must assume that the person setting up the VPN already knows how to do it.

As for my other questions - you're right there is no real feasible way to not allow connection to the router device from the internet and it shouldn't even be required.

I have redrawn the drawing though and I think I only have to figure out if I can allow internet access to all devices or if they are going to require I isolate the on-prem network to only allow traffic to the VPN for any of the local network devices.

New Setup Diagram

I am guessing that they will require that traffic only go through the VPN and not be able to touch the public internet even through the VPN. I will continue to work on trying to find a guide that outlines this kind of setup.

The questions:

Is there a way to setup the authorization rule such that anything coming from the customer gateway cannot access the internet, but anything not coming from that IP range can
If I do that, how do I get the traffic to the NLB (and then to transfer family) from a VPN connection that has no internet route

Mav

answered 2 months ago

Mav
2 months ago
Ok, I got the IKEv2 working. So that is good; I now have to figure out the configuration of the on-premises devices to route the traffic correctly to the VPN.

Is the AWS Transfer service supposed to be accessible only by the on-premises devices and client VPN users, or also from everywhere on the public internet? I'm just asking since it's a major source of complexity in this setup and seems odd to have plaintext FTP permitted from the internet at the same time as requiring elaborate VPN protections for the on-premises and client VPN access.

There's a simple but unfortunately somewhat costly option. You'd set up a second AWS Transfer server for your internal users. You'd configure it with the VPC endpoint type, the same as the existing Transfer server, but instead of configuring the public IP address of the NLB as the PassiveIp parameter, as you probably had to do for the existing server, you'd use the default AUTO configuration for PassiveIp to have clients establish the FTP data channels to the same private IPs of the Transfer server endpoints that would be used for the FTP control channel.

This would make the new Transfer server accessible only inside your VPC, and it would be reachable from both the on-premises network and the client VPN with what you've set up already. Internal users could either connect to the native DNS name of the Transfer service endpoint, or if they're using DNS names to connect currently, you could set up an internal DNS alias for the existing name that you would point to the private IP(s) of the new Transfer server.

The major drawback of this simple solution/workaround would be that the per-hour fixed fee is charged separately for each Transfer server/protocol combination, so this would double your fixed Transfer service costs.

If the Transfer service is only used by the clients on premises and behind client VPN, then you should get rid of the NLB, set the Transfer server to use AUTO for PassiveIp, and point the clients (or the DNS name they might be using) to the native, private IPs inside the VPC of the Transfer server's VPC endpoints rather than the public IP of the NLB.

Restricting access from the on-premises network to the public internet would have to be done on the Trendnet device. I haven't got access to one and as you said, documentation is nonexistent, but since it has IPsec VPN capability, I would expect there to be some form of ACL support or local firewalling feature as well.

EXPERT

Leo K

answered 2 months ago

Mav
2 months ago
Let me cover this in bullet points:

The HMI that are being used in this industrial setup have the ability to ONLY communicate with a plain FTP. It's a sore spot for me about these, but these are inherited systems that I have no control to change. CMore EA9 is the HMI.

The other systems which use the same FTP server all have different routers, but the same HMI. I could check and see if we can put them all on the VPN and use a transit gateway instead of a single virtual gateway. This kind of setup would probably require a much more involved VPC and Subnet setup in order to keep traffic separated between the systems. The reason they need to be separated is because they are different companies upstream but the same operator. If we do that then we can probably not have an NLB and externally facing FTP.

I'll make a new drawing after I see what I can see about multi-site VPN setups.

If you configure the CIDR range of your VPC, 172.31.0.0/16, as the "remote" CIDR range (=the AWS side of the tunnel) for each of the different companies' VPN tunnels, they will only be able to send traffic to and receive it from that single CIDR range. Traffic from one tunnel won't be able to reach any of the other tunnels. This setup will work with a VGW as well as a TGW (transit gateway).

Since a VPC can only have one VGW attached to it, terminating all the tunnels to one VGW isn't very scalable. Each VPN can only have one CIDR configured for each side of the tunnel, so if you ever needed to have more than the one /16 in AWS, the VGW-only design wouldn't allow that.

Since you're already thinking of switching to a TGW, I'd suggesting switching to it from the start. It adds massive scalability in allowing up to 5,000 attachments (VPCs, VPNs, and Direct Connects combined). Traffic from each VPN tunnel or any type of attachment is also routed based on the TGW route table that is associated with the VPN attachment. Even if you allowed 0.0.0.0/0 in the VPN configuration for both sides of all the VPN tunnels but configured only one route for 172.31.0.0/16 in the TGW route table associated with all the VPN attachments, all the tunnels would only be able to reach that one VPC CIDR, despite 0.0.0.0/0 being allowed at the IPsec level for all the tunnels. Packets received from tunnel A with a destination behind tunnel B would never reach B. While IPsec would allow them to pass through the tunnel, they'd be dropped by the TGW, because all destinations other than 172.31.0.0/16 wouldn't have a matching route in the TGW route table associated with the source VPN.

With a TGW, you could also allow certain tunnels, like those belonging to the same customer, to talk to one another by creating a TGW route table that has routes for all the destinations that the company's tunnels are allowed to talk with and associating that route table with all the VPN tunnels of that customer. Or, if you wanted some or all VPNs to reach more than one VPC in AWS, you could associate all those tunnels with a TGW route table that recognises those destinations. Neither of these variations would be possible at all with a VGW.

The TGW implementation is rather simple. You just create the TGW and two route tables for it. One TGW route table is for your central VPC attachment, containing routes to all the site-to-site VPNs that need to be able to receive traffic from the VPC. This route table you'll attach to the VPC attachment leading to the central VPC. The second TGW route table is for all the site-to-site VPNs and will only contain a single route to the central VPC. You can attach this second TGW route table to all the VPN attachments of the different customers that only need to be able to talk to and receive traffic over the VPN from the 172.31.0.0/16 VPC.

Even if you connected dozens or hundreds of different customers' VPNs to the TGW, as long as they don't need to have connectivity with the other VPNs and only need to talk to the central VPC containing the Transfer server, you'd only need these two TGW route tables for the whole setup, regardless of the number of customer VPNs, up to the maximum of 5,000. Even that limit could be overcome by peering another TGW with the first one. So, no overly complex routing design or configuration is needed.

EXPERT

Leo K

answered 2 months ago

Mav
2 months ago
There would also be a client endpoint to deal with because there is the necessity to remotely control these HMIs, so the client endpoint probably needs a bunch of different route authorizations. The good thing is that anyone connecting to the client endpoint is expected to be able to connect to all the sites anyway. Though I suppose it might be a good idea to create multiple client endpoints or somehow figure out how to create a group for each site so that you have a different client profile for each site. I will start setting up the Transit gateway model.

Ok, Here is where I am right now with this

Latest VPN

I changed from the VGW to the TGW. I setup all the tunnels again, firewall/security group rules, etc. Yesterday I was able to connect to the VPN client and then ping and connect using the programming software between my local machine and the remote on-premise device. However, today I am not able to and I have made absolutely no changes. I am a little as a loss to why, but I suspect that the problem must revolve around the fact that I have an IGW in the same subnet and as part of the route table for the subnet. So, I think that the routing order in the subnet matters in this case but I don't know how to set the routing metric. I would think that it should be Metric 1 - 172.31.0.0/16 -> Local, Metric 2 - 172.20.104.0/24 -> TGW, Metric 3 - 0.0.0.0/0 -> IGW in order to get traffic to route properly. I don't see a way to set the metric in a route table though. The TGW does not have 0.0.0.0/0 in it's route table at this point. Even if I were to move the TGW to the Private Subnet in the same VPC, the NAT Gateway would still have the same issue of 0.0.0.0/0 routed to the NAT Gateway.

Mav

answered 2 months ago

I wanted to come back to this and post the answer: Async VPN tunnel traffic. The router on site, when both tunnels were up, would send and receive traffic sometimes on one tunnel but sometimes it would split it between the tunnels and that would make it fail. So, using only one tunnel at a time is necessary with this particular router. It is likely I never had anything setup wrong, just didn't know the particulars of this router.

Mav

answered 20 days ago

Relevant content

Conflict between AWS site-to-site VPN (to a VPC) and non-AWS client VPN
takempa
asked 5 years ago
Architecting for large number of site-to-site VPN connections
Accepted Answer
EXPERT
john_l
asked 3 years ago
Setup VPN Site to Site backup DirectConnect
rePost-User-2684239
asked 2 years ago
Site to Site VPN Setup
Accepted Answer
rePost-User-4727447
asked 2 years ago
Why is my AWS Site-to-Site VPN down?
AWS OFFICIALUpdated a year ago
How do I create a backup VPN for my AWS Site-to-Site VPN connection?
AWS OFFICIALUpdated a year ago
How do I check the current status of my VPN tunnel?
AWS OFFICIALUpdated 2 years ago
How can I configure a Site-to-Site VPN connection with dynamic routing between AWS and Microsoft Azure?
AWS OFFICIALUpdated a year ago
Configuration of Dynamic Routing (BGP) - Based AWS Site-to-Site VPN with MikroTik Router for Secure Data Transmission
EXPERT
Vinay Gugueoth
published a year ago
Usage of Security Group associated with AWS Client VPN
EXPERT
Karthikiran Ilango
published 10 months ago