- FFB.md: Fix typos, normalize tags/categories to lowercase, replace *** with --- for horizontal rules, trim trailing whitespace - contacting-me.md: Add mailto link for email, improve spacing and formatting - domainandemail.md: Fix typos (regitrar, programmaticaly, Malgun, teir), normalize tags/categories, clean up nested list formatting, remove redundant horizontal rules - rss.md: Reword intro sentence, clarify RSSHub description, fix trailing whitespace, remove duplicate selfhosted tag - stack.md: Convert frontmatter from TOML (+++) to YAML (---), normalize tags/categories, minor copy edits - rosterhash.md: New post about RosterHash fantasy football schedule viewer project - rr-rss-nag.md: New post explaining how to bypass RoyalRoad piracy nags in FreshRSS feeds using CSS selectors - security-groups.md: New post walking through the evolution of formatting AWS security groups in Terraform across three iterations
408 lines
13 KiB
Markdown
408 lines
13 KiB
Markdown
---
|
|
title: "Formatting AWS Security Groups for a VMware Migration"
|
|
date: "2025-02-05"
|
|
tags:
|
|
- terraform
|
|
- aws
|
|
- migration
|
|
categories:
|
|
- selfhosted
|
|
---
|
|
|
|
# The Problem
|
|
|
|
At work we're in the middle of a large lift and shift migration from VMware to AWS (for the same reason everyone is). Hundreds of servers across multiple departments, moved in waves.
|
|
|
|
The firewall rules for these servers come from everywhere. Palo Alto firewalls, host-based firewalls, department-specific switches, department-specific IT teams, random appliances that predate much of the current staff. Years of accumulated rules from multiple sources, and now they all need to become AWS security groups.
|
|
|
|
I needed to figure out how to format these rules in Terraform so that:
|
|
1. Coworkers completely new to IaC could read them
|
|
2. I could maintain them without losing my mind as rule counts climbed
|
|
3. PRs were reviewable
|
|
|
|
This is how the format evolved over three iterations.
|
|
|
|
# Iteration 1: Inline Rules
|
|
|
|
The most straightforward way to write a security group. Everything in one block.
|
|
|
|
```hcl
|
|
resource "aws_security_group" "web_server" {
|
|
name = "web-server"
|
|
description = "SG for web-server"
|
|
vpc_id = var.vpc_id
|
|
|
|
ingress {
|
|
description = "HTTPS from campus"
|
|
from_port = 443
|
|
to_port = 443
|
|
protocol = "tcp"
|
|
cidr_blocks = ["10.0.0.0/24"]
|
|
}
|
|
|
|
ingress {
|
|
description = "SSH from admin subnet"
|
|
from_port = 22
|
|
to_port = 22
|
|
protocol = "tcp"
|
|
cidr_blocks = ["10.100.0.0/24"]
|
|
}
|
|
|
|
egress {
|
|
description = "Allow all outbound"
|
|
from_port = 0
|
|
to_port = 0
|
|
protocol = "-1"
|
|
cidr_blocks = ["0.0.0.0/0"]
|
|
}
|
|
}
|
|
```
|
|
|
|
This works fine for a server with 3-4 rules and is the first example you usually come across if you search for "ec2 firewalls". It's easy to read and easy to explain to someone who's never seen Terraform before.
|
|
|
|
The problem is that any change to any inline rule forces Terraform to evaluate the entire security group. Add a CIDR to one ingress block and the plan output gets noisy. It also doesn't play well with `for_each` if you want to loop over CIDRs for a single port.
|
|
|
|
# Iteration 2: Separate Rule Resources
|
|
|
|
Breaking the rules out into their own resources using `aws_vpc_security_group_ingress_rule` and `aws_vpc_security_group_egress_rule`.
|
|
|
|
```hcl
|
|
resource "aws_security_group" "web_server" {
|
|
description = "SG for web-server"
|
|
vpc_id = var.vpc_id
|
|
|
|
tags = {
|
|
Name = "web-server"
|
|
Source = "Palo Alto Firewall"
|
|
}
|
|
}
|
|
|
|
# Egress
|
|
resource "aws_vpc_security_group_egress_rule" "web_server_allow_all_outbound" {
|
|
security_group_id = aws_security_group.web_server.id
|
|
ip_protocol = "-1"
|
|
cidr_ipv4 = "0.0.0.0/0"
|
|
|
|
tags = {
|
|
Name = "allow-all-outbound"
|
|
}
|
|
}
|
|
|
|
# HTTPS from campus
|
|
resource "aws_vpc_security_group_ingress_rule" "web_server_https_443" {
|
|
for_each = var.https_443_cidrs
|
|
security_group_id = aws_security_group.web_server.id
|
|
cidr_ipv4 = each.key
|
|
description = each.value
|
|
ip_protocol = "tcp"
|
|
from_port = 443
|
|
to_port = 443
|
|
|
|
tags = {
|
|
Name = "HTTPS-443-${replace(each.key, "/", "-")}"
|
|
Rule = "tcp-443"
|
|
}
|
|
}
|
|
|
|
# SSH from admin subnet
|
|
resource "aws_vpc_security_group_ingress_rule" "web_server_ssh_22" {
|
|
for_each = var.ssh_22_cidrs
|
|
security_group_id = aws_security_group.web_server.id
|
|
cidr_ipv4 = each.key
|
|
description = each.value
|
|
ip_protocol = "tcp"
|
|
from_port = 22
|
|
to_port = 22
|
|
|
|
tags = {
|
|
Name = "SSH-22-${replace(each.key, "/", "-")}"
|
|
Rule = "tcp-22"
|
|
}
|
|
}
|
|
```
|
|
|
|
With variables like:
|
|
|
|
```hcl
|
|
variable "https_443_cidrs" {
|
|
type = map(string)
|
|
default = {
|
|
"10.0.0.0/24" = "Campus network"
|
|
"10.100.0.0/24" = "Admin subnet"
|
|
}
|
|
}
|
|
|
|
variable "ssh_22_cidrs" {
|
|
type = map(string)
|
|
default = {
|
|
"10.100.0.0/24" = "Admin subnet"
|
|
}
|
|
}
|
|
```
|
|
|
|
This is better. Each rule is its own resource so Terraform plans are cleaner. Adding a CIDR to a port only shows that one rule changing. The `for_each` on a map of CIDR-to-description means you can see at a glance what each IP range is for.
|
|
|
|
I used this format for the 2nd wave. It worked. But by the next few waves we were moving more servers per wave and each server had its own set of variables. The variable files were getting long and hard to cross-reference with the rules.
|
|
|
|
Everything was also moved into a `$WORKSPACE/modules/security-groups/` directory to keep it organized. One file per server's rules, one file per server's variables.
|
|
|
|
# Iteration 3: Locals with Structured Data
|
|
|
|
By the time we were moving double digit servers per wave, the variable-per-port approach was getting hard to maintain. Too many variable files, too much scrolling back and forth to understand what a server's rules actually looked like.
|
|
|
|
I switched to using `locals` with a structured list. All the rules for a server live in one block. Each entry defines the port, protocol, and every CIDR that needs access on that port.
|
|
|
|
```hcl
|
|
locals {
|
|
web_server_ports = [
|
|
# HTTPS
|
|
{
|
|
protocol = "tcp"
|
|
from = 443
|
|
to = 443
|
|
name = "https-443"
|
|
cidrs = {
|
|
"10.0.0.0/24" = "Campus network"
|
|
"10.100.0.0/24" = "Admin subnet"
|
|
}
|
|
},
|
|
# SSH
|
|
{
|
|
protocol = "tcp"
|
|
from = 22
|
|
to = 22
|
|
name = "ssh-22"
|
|
cidrs = {
|
|
"10.100.0.0/24" = "Admin subnet"
|
|
}
|
|
},
|
|
# RDP
|
|
{
|
|
protocol = "tcp"
|
|
from = 3389
|
|
to = 3389
|
|
name = "rdp-3389"
|
|
cidrs = {
|
|
"10.100.0.0/24" = "Admin subnet"
|
|
}
|
|
},
|
|
# HTTP
|
|
{
|
|
protocol = "tcp"
|
|
from = 80
|
|
to = 80
|
|
name = "http-80"
|
|
cidrs = {
|
|
"10.0.0.0/24" = "Campus network"
|
|
}
|
|
},
|
|
]
|
|
|
|
# Flatten into individual rules
|
|
web_server_rules = flatten([
|
|
for port_config in local.web_server_ports : [
|
|
for cidr, description in port_config.cidrs : {
|
|
key = "${port_config.name}-${replace(cidr, "/", "-")}"
|
|
protocol = port_config.protocol
|
|
from_port = port_config.from
|
|
to_port = port_config.to
|
|
cidr = cidr
|
|
description = description
|
|
rule_name = port_config.name
|
|
}
|
|
]
|
|
])
|
|
|
|
# How many rules total
|
|
web_server_total_rule_count = length(local.web_server_rules)
|
|
|
|
# How many SGs needed (AWS has a rules-per-SG limit)
|
|
web_server_sg_count = max(1, ceil(local.web_server_total_rule_count / var.max_rules_per_sg))
|
|
|
|
# Chunk rules across SGs
|
|
web_server_rules_chunked = {
|
|
for sg_index in range(local.web_server_sg_count) : sg_index => [
|
|
for rule_index in range(
|
|
sg_index * var.max_rules_per_sg,
|
|
min((sg_index + 1) * var.max_rules_per_sg, local.web_server_total_rule_count)
|
|
) : local.web_server_rules[rule_index]
|
|
]
|
|
}
|
|
}
|
|
```
|
|
|
|
The security group itself handles overflow automatically. If a server has more rules than AWS allows per SG, it creates additional SGs and distributes the rules across them. Neither I nor anyone in my team had to count rules to make sure they were split across security groups evenly. It all gets generated dynamically.
|
|
|
|
```hcl
|
|
# Primary SG
|
|
resource "aws_security_group" "web_server" {
|
|
name = "web-server"
|
|
description = "SG for web-server"
|
|
vpc_id = var.vpc_id
|
|
|
|
lifecycle {
|
|
create_before_destroy = true
|
|
}
|
|
|
|
tags = {
|
|
Name = "web-server"
|
|
}
|
|
}
|
|
|
|
# Overflow SGs (created only if needed)
|
|
resource "aws_security_group" "web_server_overflow" {
|
|
for_each = { for idx in range(1, local.web_server_sg_count) : idx => idx }
|
|
|
|
name = "web-server-overflow-${each.value}"
|
|
description = "SG for web-server (Overflow ${each.value})"
|
|
vpc_id = var.vpc_id
|
|
|
|
lifecycle {
|
|
create_before_destroy = true
|
|
}
|
|
|
|
tags = {
|
|
Name = "web-server-overflow-${each.value}"
|
|
}
|
|
}
|
|
|
|
# Egress (primary SG only)
|
|
resource "aws_vpc_security_group_egress_rule" "web_server_allow_all_outbound" {
|
|
security_group_id = aws_security_group.web_server.id
|
|
ip_protocol = "-1"
|
|
cidr_ipv4 = "0.0.0.0/0"
|
|
|
|
tags = {
|
|
Name = "allow-all-outbound"
|
|
}
|
|
}
|
|
|
|
# Ingress for primary SG
|
|
resource "aws_vpc_security_group_ingress_rule" "web_server_ingress" {
|
|
for_each = {
|
|
for rule in local.web_server_rules_chunked[0] :
|
|
rule.key => rule
|
|
}
|
|
|
|
security_group_id = aws_security_group.web_server.id
|
|
cidr_ipv4 = each.value.cidr
|
|
description = each.value.description
|
|
ip_protocol = each.value.protocol
|
|
from_port = each.value.protocol == "-1" ? null : each.value.from_port
|
|
to_port = each.value.protocol == "-1" ? null : each.value.to_port
|
|
|
|
tags = {
|
|
Name = each.value.key
|
|
Rule = each.value.rule_name
|
|
}
|
|
}
|
|
|
|
# Ingress for overflow SGs
|
|
resource "aws_vpc_security_group_ingress_rule" "web_server_overflow_ingress" {
|
|
for_each = merge([
|
|
for sg_index, sg in aws_security_group.web_server_overflow : {
|
|
for rule in local.web_server_rules_chunked[sg_index] :
|
|
"${sg_index}-${rule.key}" => {
|
|
sg_id = sg.id
|
|
cidr = rule.cidr
|
|
description = rule.description
|
|
protocol = rule.protocol
|
|
from_port = rule.from_port
|
|
to_port = rule.to_port
|
|
key = rule.key
|
|
rule_name = rule.rule_name
|
|
}
|
|
}
|
|
]...)
|
|
|
|
security_group_id = each.value.sg_id
|
|
cidr_ipv4 = each.value.cidr
|
|
description = each.value.description
|
|
ip_protocol = each.value.protocol
|
|
from_port = each.value.protocol == "-1" ? null : each.value.from_port
|
|
to_port = each.value.protocol == "-1" ? null : each.value.to_port
|
|
|
|
tags = {
|
|
Name = each.value.key
|
|
Rule = each.value.rule_name
|
|
}
|
|
}
|
|
```
|
|
|
|
Adding a new server means copying the template, doing a find-and-replace on the server name, and filling in the `ports` list. The SG resource, egress, overflow, and ingress logic are all identical across servers. The only thing that changes is the data in `locals`.
|
|
|
|
The big win for PR reviews is that the `ports` local reads like a table. You can look at it and immediately see what ports are open and to whom without having to mentally reconstruct it from scattered variable files.
|
|
|
|
|
|
# Standard Security Groups
|
|
|
|
While all the above handles per-server rules, we noticed early on that a lot of rules were the same across every server. RDP from the admin subnet, SSH from the admin subnet, ICMP from campus, etc. Every single server had these and we were duplicating them everywhere.
|
|
|
|
So we created a separate shared module: `$ROOT_OF_MONOREPO/modules/standard-securitygroups`. It only takes a `vpc_id` as input and creates a set of reusable security groups that any server can reference.
|
|
|
|
It does stuff like create our 3 admin groups:
|
|
|
|
- **default_admin** — ICMP and monitoring/backup access. No remote access.
|
|
- **linux_admin** - SSH mostly
|
|
- **windows_admin** - All the lovely SCCM/WSUS/SMB cruft from admin networks.
|
|
|
|
The key difference from per-server groups is that it uses managed prefix lists to centralize the IP ranges. Instead of hardcoding CIDRs in every rule, the rules reference a prefix list.
|
|
|
|
```hcl
|
|
resource "aws_ec2_managed_prefix_list" "linux_admin_access" {
|
|
name = "server-admin-access"
|
|
address_family = "IPv4"
|
|
max_entries = 5
|
|
|
|
entry {
|
|
cidr = "10.0.0.0/24"
|
|
description = "Dept A linux Admin"
|
|
}
|
|
|
|
entry {
|
|
cidr = "10.100.0.0/24"
|
|
description = "Dept B linux Admin"
|
|
}
|
|
}
|
|
```
|
|
|
|
Then the rules reference the prefix list instead of individual CIDRs:
|
|
|
|
```hcl
|
|
resource "aws_vpc_security_group_ingress_rule" "linux_admin_ssh" {
|
|
security_group_id = aws_security_group.linux_admin.id
|
|
prefix_list_id = aws_ec2_managed_prefix_list.server_admin_access.id
|
|
ip_protocol = "tcp"
|
|
from_port = 22
|
|
to_port = 22
|
|
|
|
tags = {
|
|
Name = "SSH-22-admin-access"
|
|
}
|
|
}
|
|
```
|
|
|
|
When a new admin subnet needs access, you add one entry to the prefix list and every security group that references it picks it up. No touching individual server rules.
|
|
|
|
A server ends up with its per-server SG for application-specific rules and one or more standard SGs for the common stuff:
|
|
|
|
```hcl
|
|
vpc_security_group_ids = [
|
|
module.security_groups.web_server_sg_id,
|
|
module.standard_securitygroups.windows_admin_security_group_id
|
|
]
|
|
```
|
|
|
|
This keeps the per-server rule files focused on what's actually unique to that server.
|
|
|
|
# What's Next
|
|
|
|
The standard module handles the baseline admin access that every server gets. The next step is creating standard service-level and department-service-level SGs.
|
|
|
|
A generic `db-sg` would cover common database ports that most database servers need. But a `math-db-sg` would layer on department-specific rules for the math department's network ranges, their specific application servers, and their particular inter-database communication patterns. Same idea for web servers, app servers, etc.
|
|
|
|
The goal is to get to a point where standing up a new server means picking from a menu of standard SGs rather than writing rules from scratch every time.
|
|
|
|
# What I'd Do Differently
|
|
|
|
Not much honestly. The progression made sense given the constraints. We didn't know how many servers we'd be moving per wave at the start and the format evolved as the workload scaled. The template approach with find-and-replace is simple enough that even the folks brand new to Terraform are following along.
|