# DevOps / Deploy Saga

This document explains how we deploy Galactus (GitHub Actions → cPanel users) and why certain relay/PM2/SSH choices exist after a difficult redeploy. Use it as a troubleshooting reference and to avoid removing fixes that make the relay work under PM2.

---

## Architecture (short)

- **Branch mapping:** `dev` → dv.gbuff.com (cPanel user dvgbuff, relay port 8888); `main` → kt.gbuff.com (cPanel user ktgbuff, relay port 8787).
- **Dashboard:** Static files in `~/public_html/dashboard/` per user.
- **Relay:** Source and runtime in `~/galactus-relay/` per user; started with PM2 via `ecosystem.config.cjs`.
- **Proxy:** nginx proxies relay paths (e.g. `/dashboard/relay/http`, `/dashboard/relay/ws`) to `localhost:PORT` for that user.

---

## Saga summary (what happened and why it's fixed)

### Workflow paths

Initially workflow paths assumed a parent repo containing `galactus-one`. We fixed all paths to be repo-root-relative: this repo _is_ galactus-one. No `galactus-one/` prefix in rsync or step paths; no extra `working-directory` for the main build/deploy steps.

### SSH auth

"Permission denied (publickey)" led to trying `StrictHostKeyChecking=no`, explicit key file, etc. Root cause was **server-side**: home and `.ssh` permissions (e.g. `chmod 700 ~/.ssh`, `chmod 600 authorized_keys`) and using an **RSA PEM** key that the server accepted. After fixing server permissions and key format, we reverted to `webfactory/ssh-agent` and dropped StrictHostKeyChecking/UserKnownHostsFile overrides. For the exact one-time steps (key format, permissions, and full server checklist), see **Server prerequisites (one-time)** in [deployment.md](deployment.md).

### Relay deploy

We rsync relay source + `packages/shared` + built `dist`; we **do not** rsync `node_modules` or `.env`. We create `~/galactus-relay/packages` before rsync so `packages/shared` exists. We run `npm install ./packages/shared` then `npm install` (no `--production`) so dotenv and all deps are present.

### Remote .env

The workflow does not commit `.env`. We **inject** it on the server from GitHub Secrets (`RELAY_ENV_DEV` / `RELAY_ENV_PROD`) and append `PORT=$RELAY_PORT` so the relay gets the right port and CORS etc.

### Relay under PM2 – the main pain

**Symptom:** PM2 showed relay "online" but no logs, nothing listening on 8888/8787, health check "connection refused", dashboard 502.

**Causes and fixes:**

1. **Working directory:** PM2 uses the shell cwd at `pm2 start` time. If the process was ever started from another directory (e.g. home), that wrong cwd was saved, so `.env` and `node_modules` didn't resolve. **Fix:** Use an **ecosystem file** with `cwd: __dirname` so the app always runs with cwd = `~/galactus-relay`.

2. **Entry guard:** The ESM "run only when executed as entry script" check used `process.argv[1]` vs `__filename`. Under PM2, `argv[1]` can differ, so the guard failed, `main()` was never called, and the process exited (or appeared stuck) without binding. **Fix:** Also run `main()` when `process.env.NODE_APP_INSTANCE !== undefined` (PM2 sets this).

3. **.env loading:** Relying only on `import 'dotenv/config'` from inside the app tied loading to process cwd. **Fix:** In the ecosystem file, set `node_args: '-r dotenv/config'` and `env: { DOTENV_CONFIG_PATH: path.join(__dirname, '.env') }` so the env file is loaded by path, independent of cwd.

We **always** start the relay with `pm2 start ecosystem.config.cjs` (never `pm2 start dist/index.js --name galactus-relay` alone).

### PM2 persistence

For the relay to survive reboots, run `pm2 startup` (and the printed command) once per account, then `pm2 save` after starting the relay (see [apps/relay/README.md](../apps/relay/README.md)).

---

## References

- [deployment.md](deployment.md) – build/deploy checklist and env vars.
- [.github/workflows/deploy.yml](../.github/workflows/deploy.yml) – actual CI steps.
- [apps/relay/ecosystem.config.cjs](../apps/relay/ecosystem.config.cjs) – required for PM2; do not strip `cwd`, `node_args`, or `env.DOTENV_CONFIG_PATH`.
- [apps/relay/src/index.ts](../apps/relay/src/index.ts) – entry guard; do not remove NODE_APP_INSTANCE check.
