Fixing my NFS Proxy

You may or may not have read that I set up a k8s VM on my NAS. One of the primary reasons for the VM is to proxy nfs and http connections to my NAS. This means I essentially get free (as in no convoluted config) tls encryption and connectivity from any device that is connected to my Nebula vpn. And I don't have to worry at all about sideloading software or exposing the Synology to the internet.

For nfs I had to do a few special things with the nginx ingress controller, read the nginx ingress docs for more info on that.

At some point over the past 1-12 months, the nfs proxy stopped working. I'm not really sure when since I haven't actually done much with the nfs proxy since proving it worked early last year.

root@rke1:~# ss -ltnp | grep 2049
root@rke1:~#

As you can see, it's no longer listening on the nfs port. I'm torn between tracking down the cause and just putting the config in code so I can just redeploy if it happens again. As a starting point I built a test VM just to prove out my original config really did work. And it does, in fact I actually dumped the configs from the original server into the VM to configure it.

Investigation

I suspect it might be related to the CIS profile. I applied the CIS-1.6 profile to my test server and it's still working.

Note to self, put these sysctl settings and the etcd user creation in code.

kernel.panic = 10
kernel.panic_on_oops = 1
vm.overcommit_memory = 1

The rke2 versions are a tiny bit different, going to upgrade test from 1.24.2 to 1.24.10 to see if it makes a difference. It still seems operational but kubectl commands are throwing an error about metrics.k8s.io/v1beta1.

E0217 22:54:52.400025  917397 memcache.go:255] couldn't get resource list for metrics.k8s.io/v1beta1: the server is currently unable to handle the request

Just kidding, that error went away after about a minute, must have been something getting upgraded.

Success! The test server is no longer listening on the nfs port.

brad@testrke1:~$ ss -ltnp | grep 2049
brad@testrke1:~$

I guess that means I need to go looking through release notes and bug reports.

ingress-nginx is on 4.1.0 in rke2-1.24.10 and 1.24.2, but maybe there have been minor changes not accounted for in the rke2 release notes. This change in v1.24.3+rke2r1 looks interesting. Adding port 2049 to the network policy in prod did not make an immediate difference, also restarting rke2-server did not help either.

Downgrading testrke1 to 1.24.2 and tcp ports are working again. I guess we'll upgrade 1 minor revision at a time.

That was sooner than expected, it broke in 1.24.3. It looks like the network policies don't get added until 1.24.4. Staying on that network policies train of thought, I see this documentation update as well. I thought hostNetwork was set to true as an rke2 default. Rolling back to 1.24.2 to check hostNetwork setting for the nginx ingress controller. And sure enough it was true in 1.24.2, but false in 1.24.3.

It looks like flipping that back to true did the trick in test. Prod is taking a really long time to terminate the original nginx pod. I'm impatiently restarting rke2. Bigger hammer running the rke2-killall.sh script that I love so much.

Winning!

root@rke1:~# ss -ltnp | grep 2049
LISTEN    0         511                0.0.0.0:2049             0.0.0.0:*        users:(("nginx",pid=603263,fd=20),("nginx",pid=603262,fd=20),("nginx",pid=603261,fd=20),("nginx",pid=603260,fd=20),("nginx",pid=603256,fd=20))
LISTEN    0         511                   [::]:2049                [::]:*        users:(("nginx",pid=603263,fd=21),("nginx",pid=603262,fd=21),("nginx",pid=603261,fd=21),("nginx",pid=603260,fd=21),("nginx",pid=603256,fd=21))

So another piece I need to put in code but here is the magic that makes the above possible (at least for the ingress controller part)

root@rke1:~# cat /var/lib/rancher/rke2/server/manifests/rke2-ingress-nginx-config.yaml
---
apiVersion: helm.cattle.io/v1
kind: HelmChartConfig
metadata:
  name: rke2-ingress-nginx
  namespace: kube-system
spec:
  valuesContent: |-
    controller:
      extraArgs:
        tcp-services-configmap: "kube-system/tcp-services"
      hostNetwork: true

Closing Notes

This turned out to be a relatively straightfoward issue. It would have started a few months ago when I setup the rke2 update controller to automatically upgrade the cluster as new releases came out. Now I have a test server, so I can start putting more of this config in code. I'm going with fluxcd, I think the gitops style will be easier to manage than the infrequently used ansible code I have currently.