Skip to content

Security: add flags for TCP connection limits and timeouts#7518

Open
SungJin1212 wants to merge 4 commits into
cortexproject:masterfrom
SungJin1212:feat/memberlist-tcp-security-limits
Open

Security: add flags for TCP connection limits and timeouts#7518
SungJin1212 wants to merge 4 commits into
cortexproject:masterfrom
SungJin1212:feat/memberlist-tcp-security-limits

Conversation

@SungJin1212
Copy link
Copy Markdown
Member

@SungJin1212 SungJin1212 commented May 14, 2026

This PR adds TCP connection flags to address the security issues.

-memberlist.packet-read-timeout: Read deadline applied to every inbound packet connection. Connections that do not complete within this window are closed.
-memberlist.max-packet-size: Maximum size of a single inbound gossip packet. Enforced via io.LimitReader before io.ReadAll, preventing heap exhaustion from oversized payloads. Applies to packet-type messages only.
-memberlist.max-concurrent-connections: Maximum number of concurrent inbound TCP connections. Connections exceeding this limit are rejected immediately.

Which issue(s) this PR fixes:
Fixes #

Checklist

  • Tests updated
  • Documentation added
  • CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]
  • docs/configuration/v1-guarantees.md updated if this PR introduces experimental flags

Signed-off-by: SungJin1212 <tjdwls1201@gmail.com>
@SungJin1212 SungJin1212 force-pushed the feat/memberlist-tcp-security-limits branch from 6c101cc to ac76ac2 Compare May 14, 2026 12:23
Copy link
Copy Markdown
Member

@friedrichg friedrichg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the fix is not working as expected though.

I think this triggers the behavior I am seeing

// TestTCPTransport_StreamHoldsSlotUntilClose asserts that
// -memberlist.max-concurrent-connections bounds the number of *live* inbound
// TCP connections: once a stream conn has been handed off to memberlist via
// StreamCh(), its slot stays held until the conn is actually closed.
//
// NOTE: this test is expected to FAIL against the current implementation,
// which releases the slot as soon as handleConnection returns (i.e. on
// handoff, not on close). It documents the intended semantics of the flag.
func TestTCPTransport_StreamHoldsSlotUntilClose(t *testing.T) {
	logger := log.NewNopLogger()

	const maxConns = 2

	cfg := TCPTransportConfig{}
	flagext.DefaultValues(&cfg)
	cfg.BindAddrs = []string{"127.0.0.1"}
	cfg.BindPort = 0
	cfg.PacketReadTimeout = 5 * time.Second
	cfg.MaxConcurrentConnections = maxConns

	transport, err := NewTCPTransport(cfg, logger)
	require.NoError(t, err)
	defer transport.Shutdown() //nolint:errcheck

	port := transport.GetAutoBindPort()

	// Consumer goroutine: drains StreamCh and holds conns alive (never closes
	// them) — simulating memberlist actively using streams.
	var heldMu sync.Mutex
	var held []net.Conn
	done := make(chan struct{})
	go func() {
		for {
			select {
			case <-done:
				return
			case c := <-transport.StreamCh():
				heldMu.Lock()
				held = append(held, c)
				heldMu.Unlock()
			}
		}
	}()
	defer func() {
		close(done)
		heldMu.Lock()
		for _, c := range held {
			c.Close() //nolint:errcheck
		}
		heldMu.Unlock()
	}()

	openStreamConn := func() net.Conn {
		c, err := net.Dial("tcp", fmt.Sprintf("127.0.0.1:%d", port))
		require.NoError(t, err)
		_, err = c.Write([]byte{byte(stream)})
		require.NoError(t, err)
		return c
	}

	// Fill the semaphore with maxConns live stream handoffs.
	clients := make([]net.Conn, 0, maxConns+1)
	defer func() {
		for _, c := range clients {
			c.Close() //nolint:errcheck
		}
	}()
	for i := 0; i < maxConns; i++ {
		clients = append(clients, openStreamConn())
	}

	// Wait until memberlist side has observed all maxConns streams.
	require.Eventually(t, func() bool {
		return testutil.ToFloat64(transport.incomingStreams) == float64(maxConns)
	}, 2*time.Second, 10*time.Millisecond)

	// One extra stream conn. If the slot is correctly held for the conn's
	// real lifetime, the transport must reject this one because all slots
	// are still occupied by the held streams above.
	clients = append(clients, openStreamConn())

	require.Eventually(t, func() bool {
		return testutil.ToFloat64(transport.rejectedConnections) >= 1
	}, 2*time.Second, 10*time.Millisecond,
		"expected extra stream conn to be rejected while %d prior streams are held open, "+
			"but the transport released the slot on handoff — flag does not bound live connections",
		maxConns)
}

Comment thread pkg/ring/kv/memberlist/tcp_transport_test.go Outdated
Comment thread pkg/ring/kv/memberlist/tcp_transport.go Outdated
Signed-off-by: SungJin1212 <tjdwls1201@gmail.com>
Signed-off-by: SungJin1212 <tjdwls1201@gmail.com>
@SungJin1212
Copy link
Copy Markdown
Member Author

@friedrichg
Thanks for the review. I fixed these to the latest commit.

Copy link
Copy Markdown
Member

@friedrichg friedrichg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just one more nit. Pre-approved!

Comment thread pkg/ring/kv/memberlist/tcp_transport.go Outdated
@dosubot dosubot Bot added the lgtm This PR has been approved by a maintainer label May 15, 2026
Signed-off-by: SungJin1212 <tjdwls1201@gmail.com>
Comment on lines +85 to +87
f.DurationVar(&cfg.PacketReadTimeout, prefix+"memberlist.packet-read-timeout", 5*time.Second, "Timeout for reading packet data from inbound connections. 0 = no limit.")
f.Int64Var(&cfg.MaxPacketSize, prefix+"memberlist.max-packet-size", 1*1024*1024 /*1MB*/, "Maximum size in bytes of an inbound gossip packet. 0 = no limit.")
f.IntVar(&cfg.MaxConcurrentConnections, prefix+"memberlist.max-concurrent-connections", 100, "Maximum number of concurrent inbound TCP connections. 0 = no limit.")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add metrics for those? Feel that will be hard to know what is the correct number to set.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

component/memberlist lgtm This PR has been approved by a maintainer size/L type/security

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants