Project

General

Profile

Bug #9094

Varnish 6.0 package on Alpine 3.8 needs more patches

Added by Han Hui Teoh 9 months ago. Updated 8 months ago.

Status:
Closed
Priority:
Normal
Assignee:
-
Category:
-
Target version:
Start date:
07/16/2018
Due date:
% Done:

100%

Estimated time:
Affected versions:
Security IDs:

Description

Docker Compose setup for a reproducible crash here: https://github.com/teohhanhui/api-platform/tree/alpine-varnish-crash

Example log from within the Docker container:

Debug: Platform: Linux,4.17.5-1-MANJARO,x86_64,-junix,-smalloc,-sdefault,-hcritbit
Debug: Child (19) Started
Error: Child (19) not responding to CLI, killed it.
Error: Child (19) Pushing vcls failed:
CLI communication error (hdr)
Debug: Stopping Child
CLI result = 400
Error: Child (19) died signal=11 (core dumped)
Info: Child (19) said Child starts
Debug: Child cleanup complete

Associated revisions

Revision c6d52d06 (diff)
Added by Natanael Copa 9 months ago

main/varnish: fix stack overflow in waiter epoll

ref #9094

Revision a09f7ff0 (diff)
Added by Natanael Copa 9 months ago

main/varnish: fix stack overflow in waiter epoll

fixes #9094

(cherry picked from commit c6d52d062a5bbe29c59b6b49a4e823a423212581)

History

#1 Updated by Han Hui Teoh 9 months ago

To use the Docker Compose setup, just run:

docker-compose pull
docker-compose build cache-proxy
docker-compose up -d

#2 Updated by Natanael Copa 9 months ago

github appears to be down. Can you please paste or attach the Dockerfile here?

#3 Updated by Natanael Copa 9 months ago

  • Target version set to 3.8.1

#4 Updated by Han Hui Teoh 9 months ago

Dockerfile

FROM alpine:3.8

RUN apk add --no-cache varnish

COPY default.vcl /etc/varnish/default.vcl

WORKDIR /var/lib/varnish
RUN chown -R varnish:varnish /var/lib/varnish
VOLUME /var/lib/varnish

COPY docker-entrypoint.sh /usr/local/bin/docker-entrypoint
ENTRYPOINT ["docker-entrypoint"]

EXPOSE 80
CMD ["varnishd"]

docker-entrypoint.sh

#!/bin/sh
set -e

# first arg is `-f` or `--some-option`
if [ "${1#-}" != "$1" ]; then
    set -- varnishd "$@" 
fi

# allow the container to be started with `--user`
if [ "$1" = 'varnishd' ] && [ "$(id -u)" = '0' ]; then
    chown -R varnish:varnish .
fi

# default args
if [ "$1" = 'varnishd' ] && [ $# -eq 1 ]; then
    set -- varnishd \
        -F \
        -a :"${VARNISH_LISTEN:-80}" \
        -f "${VARNISH_VCL:-/etc/varnish/default.vcl}" \
        -s malloc,"${VARNISH_MEMORY:-100M}" \
        ${VARNISH_DAEMON_OPTS:-}
fi

exec "$@" 

default.vcl

vcl 4.0;

import std;

backend default {
  .host = "api";
  .port = "80";
  # Health check
  #.probe = {
  #  .url = "/";
  #  .timeout = 5s;
  #  .interval = 10s;
  #  .window = 5;
  #  .threshold = 3;
  #}
}

# Hosts allowed to send BAN requests
acl invalidators {
  "localhost";
  "php";
}

sub vcl_backend_response {
  # Ban lurker friendly header
  set beresp.http.url = bereq.url;

  # Add a grace in case the backend is down
  set beresp.grace = 1h;
}

sub vcl_deliver {
  # Don't send cache tags related headers to the client
  unset resp.http.url;
  # Uncomment the following line to NOT send the "Cache-Tags" header to the client (prevent using CloudFlare cache tags)
  #unset resp.http.Cache-Tags;
}

sub vcl_recv {
  # Remove the "Forwarded" HTTP header if exists (security)
  unset req.http.forwarded;

  # To allow API Platform to ban by cache tags
  if (req.method == "BAN") {
    if (client.ip !~ invalidators) {
      return(synth(405, "Not allowed"));
    }

    if (req.http.ApiPlatform-Ban-Regex) {
      ban("obj.http.Cache-Tags ~ " + req.http.ApiPlatform-Ban-Regex);

      return(synth(200, "Ban added"));
    }

    return(synth(400, "ApiPlatform-Ban-Regex HTTP header must be set."));
  }
}

sub vcl_hit {
  if (obj.ttl >= 0s) {
    # A pure unadulterated hit, deliver it
    return (deliver);
  }
  if (std.healthy(req.backend_hint)) {
    # The backend is healthy
    # Fetch the object from the backend
    return (miss);
  }
  # No fresh object and the backend is not healthy
  if (obj.ttl + obj.grace > 0s) {
    # Deliver graced object
    # Automatically triggers a background fetch
    return (deliver);
  }
  # No valid object to deliver
  # No healthy backend to handle request
  # Return error
  return (synth(503, "API is down"));
}

#5 Updated by Han Hui Teoh 9 months ago

You will of course have to change the hostnames in the .vcl file for it to actually work. It worked just fine with Varnish 4.1 on Alpine 3.6 and Varnish 5.2 on Alpine 3.7

#6 Updated by Natanael Copa 9 months ago

I tried to reproduce it and I managed to make it run when I ran it as root at some point.

When running it with su-exec as varnish:varnish, I did manage to make it crash and generate a core dump. The backtrace of the dump:

Core was generated by `/usr/sbin/varnishd -F -a :8080 -f /etc/varnish/defaul.vcl -s malloc,100M'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x00005565c02b7dc4 in vwe_thread (priv=0x5565c0a78fc0)
    at waiter/cache_waiter_epoll.c:83
83      waiter/cache_waiter_epoll.c: No such file or directory.
[Current thread is 1 (LWP 894)]
(gdb) bt
#0  0x00005565c02b7dc4 in vwe_thread (priv=0x5565c0a78fc0)
    at waiter/cache_waiter_epoll.c:83
#1  0x00007f96e41546c2 in start (p=0x7f96de665ae8)
    at src/thread/pthread_create.c:150
#2  0x00007f96e41565b1 in __clone () at src/thread/x86_64/clone.s:21
Backtrace stopped: frame did not save the PC

#7 Updated by Han Hui Teoh 9 months ago

But I always run it as root, as you can see from my setup. It might run fine once in a while, but usually it crashes.

#8 Updated by Natanael Copa 9 months ago

I know what the problem is. It is a regression in varnish.

Problem is that they allocate a sizeof(struct epoll_event) * 8192 buffer on stack. Since the epoll_event struct is bigger than 10 bytes the buffer gets over the magic 80k limit (which is the musl default pthread stack size) and application crashes.

https://github.com/varnishcache/varnish-cache/blob/master/bin/varnishd/waiter/cache_waiter_epoll.c#L72

This was introduced with https://github.com/varnishcache/varnish-cache/commit/433e86f030648db20a7a8f7d43f20ce9be7581e6 where NEEV was raised from 100 to 8192.

We can fix this 3 different ways:
- reduce the NEEV number.
- malloc the array instead of allocate it on stack
- explicitly set stack size on pthread_create

I think we should simply allocate the buffer on heap, using malloc, instead of placing it on the stack.

#9 Updated by Natanael Copa 9 months ago

  • % Done changed from 0 to 70

I have pushed a fix to alpine edge. Can you please help me test it?:

FROM alpine:edge

RUN apk upgrade -U -a && apk add varnish
...

I will backport the fix to alpine:3.8 once you confirmed that it actually fixes the issue.

#10 Updated by Han Hui Teoh 9 months ago

I can confirm that it fixes the problem. :)

#11 Updated by Natanael Copa 9 months ago

  • Status changed from New to Resolved
  • % Done changed from 70 to 100

#12 Updated by Natanael Copa 8 months ago

  • Status changed from Resolved to Closed

Also available in: Atom PDF