Kubesense

Traces & Metrics

KubeSense OpenTelemetry Collector for EC2

This guide covers collecting traces and metrics from your ECS EC2 instances using the OpenTelemetry Collector as a daemon service.

Installation

Step 1: Store Collector Configuration in Parameter Store

Store this configuration in AWS Parameter Store at /ecs/kubesense/otelcol-daemon.yaml:

extensions:
  health_check:
    endpoint: 0.0.0.0:13133

receivers:
  # App telemetry (OTLP)
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318

  # ECS task & container metrics
  awsecscontainermetrics:
    collection_interval: 30s

  # Host (EC2) metrics
  hostmetrics:
    collection_interval: 30s
    root_path: /rootfs
    scrapers:
      cpu:
      load:
      memory:
      network:
      disk:
      filesystem:
        metrics:
          system.filesystem.usage:
            enabled: true
        exclude_mount_points:
          match_type: regexp
          mount_points:
            - /rootfs/boot/.*
            - /rootfs/proc/.*
            - /rootfs/sys/.*

  # Docker container metrics
  docker_stats:
    endpoint: unix:///var/run/docker.sock
    collection_interval: 30s
    timeout: 20s

processors:
  # Safety
  memory_limiter:
    check_interval: 1s
    limit_mib: 1024
    spike_limit_mib: 256

  # ECS + EC2 metadata enrichment
  resourcedetection:
    detectors: [env, ecs, ec2]
    override: false

  # Docker metrics tag
  attributes/docker:
    actions:
      - key: kubesense.metric_source
        value: docker_stats
        action: insert

  # Custom labels
  resource:
    attributes:
      - key: kubesense.cluster
        value: <YOUR_CLUSTER_NAME>
        action: insert
      - key: kubesense.env_type
        value: <YOUR_ENV_TYPE>
        action: insert

  batch:
    timeout: 10s
    send_batch_size: 1000

exporters:
  # Traces
  otlphttp/kubesense-traces:
    endpoint: http://<KUBESENSE_ENDPOINT>:33443
    timeout: 30s
    tls:
      insecure: true

  # Metrics
  prometheusremotewrite:
    endpoint: http://<KUBESENSE_ENDPOINT>:30060/api/v1/write
    timeout: 30s
    resource_to_telemetry_conversion:
      enabled: true
    send_metadata: true

service:
  extensions: [health_check]

  pipelines:
    # Traces
    traces:
      receivers: [otlp]
      processors:
        - memory_limiter
        - resource
        - batch
      exporters: [otlphttp/kubesense-traces]

    # Metrics (ECS + Host + Docker)
    metrics:
      receivers:
        - otlp
        - hostmetrics
        - awsecscontainermetrics
        - docker_stats
      processors:
        - memory_limiter
        - resourcedetection
        - attributes/docker
        - resource
        - batch
      exporters: [prometheusremotewrite]

Placeholder Values

Replace the following placeholders in the configuration:

  • <KUBESENSE_ENDPOINT> - KubeSense ingestion endpoint hostname (provided by KubeSense platform)
  • <YOUR_CLUSTER_NAME> - Your ECS cluster identifier (provided by KubeSense platform)
  • <YOUR_ENV_TYPE> - Environment designation like production or staging (provided by KubeSense platform)

Configuration

Step 2: Create Daemon Service Task Definition

Create an ECS task definition for the collector daemon service that runs on every EC2 instance:

{
  "family": "ecs-otel-daemon-service",
  "networkMode": "host",
  "requiresCompatibilities": ["EC2"],
  "cpu": "1024",
  "memory": "2048",
  "pidMode": "host",
  "taskRoleArn": "arn:aws:iam::<ACCOUNT_ID>:role/otelCollectorTaskRole",
  "executionRoleArn": "arn:aws:iam::<ACCOUNT_ID>:role/ecsTaskExecutionRole",
  "containerDefinitions": [
    {
      "name": "otel-collector",
      "image": "otel/opentelemetry-collector-contrib:0.142.0",
      "cpu": 1024,
      "memory": 2048,
      "essential": true,
      "user": "0",
      "command": [
        "--config=env:OTEL_CONFIG"
      ],
      "environment": [
        {
          "name": "ECS_ENABLE_CONTAINER_METADATA",
          "value": "true"
        }
      ],
      "secrets": [
        {
          "name": "OTEL_CONFIG",
          "valueFrom": "arn:aws:ssm:<AWS_REGION>:<ACCOUNT_ID>:parameter/ecs/kubesense/otelcol-daemon.yaml"
        }
      ],
      "mountPoints": [
        {
          "sourceVolume": "proc",
          "containerPath": "/rootfs/proc",
          "readOnly": true
        },
        {
          "sourceVolume": "dev",
          "containerPath": "/rootfs/dev",
          "readOnly": true
        },
        {
          "sourceVolume": "al1_cgroup",
          "containerPath": "/rootfs/cgroup",
          "readOnly": true
        },
        {
          "sourceVolume": "al2_cgroup",
          "containerPath": "/rootfs/sys/fs/cgroup",
          "readOnly": true
        },
        {
          "sourceVolume": "boot",
          "containerPath": "/rootfs/boot/efi",
          "readOnly": true
        },
        {
          "sourceVolume": "docker-sock",
          "containerPath": "/var/run/docker.sock",
          "readOnly": true
        }
      ],
      "logConfiguration": {
        "logDriver": "awslogs",
        "options": {
          "awslogs-group": "/ecs/otel-collector",
          "awslogs-create-group": "true",
          "awslogs-region": "<AWS_REGION>",
          "awslogs-stream-prefix": "ecs"
        }
      },
      "portMappings": [
        {
          "containerPort": 4317,
          "hostPort": 4317,
          "protocol": "tcp"
        },
        {
          "containerPort": 4318,
          "hostPort": 4318,
          "protocol": "tcp"
        }
      ],
      "systemControls": [],
      "ulimits": [
        {
          "name": "nofile",
          "softLimit": 65535,
          "hardLimit": 65535
        }
      ],
      "volumesFrom": []
    }
  ],
  "volumes": [
    {
      "name": "proc",
      "host": {
        "sourcePath": "/proc"
      }
    },
    {
      "name": "dev",
      "host": {
        "sourcePath": "/dev"
      }
    },
    {
      "name": "al1_cgroup",
      "host": {
        "sourcePath": "/cgroup"
      }
    },
    {
      "name": "al2_cgroup",
      "host": {
        "sourcePath": "/sys/fs/cgroup"
      }
    },
    {
      "name": "boot",
      "host": {
        "sourcePath": "/boot/efi"
      }
    },
    {
      "name": "docker-sock",
      "host": {
        "sourcePath": "/var/run/docker.sock"
      }
    }
  ]
}

Placeholder Values:

  • <AWS_REGION> - Your AWS region
  • <ACCOUNT_ID> - Your AWS account ID

Key Configuration Details:

  • networkMode: host - Access EC2 host metrics and Docker socket
  • pidMode: host - Host process namespace for metrics collection
  • cpu: 0 or shared allocation - Better resource utilization (can burst)
  • user: 0 - Run as root to access host metrics and Docker socket
  • ECS_ENABLE_CONTAINER_METADATA: true - Enable ECS metadata exposure
  • ulimits - Increase file descriptor limits (nofile: 65535) for high-volume metric collection
  • systemControls: [] - Empty for standard configuration
  • volumesFrom: [] - No volumes inherited from other containers
  • Mount Points:
    • /rootfs/proc - Process information
    • /rootfs/dev - Device information
    • /rootfs/cgroup and /rootfs/sys/fs/cgroup - Cgroup metrics (AL1 and AL2 compatibility)
    • /rootfs/boot/efi - EFI boot information
    • /var/run/docker.sock - Docker daemon socket (read-only) for container metrics
  • Port mappings with hostPort specified (important for daemon mode)

Deployment

Step 3: Create ECS Service as Daemon

Create an ECS service with the daemon scheduling strategy to run the collector on every EC2 instance in your cluster.

The daemon scheduling strategy automatically deploys one collector instance per EC2 node. When new nodes join the cluster, the collector automatically starts on them.

Step 4: Instrument Your Application

Add OpenTelemetry instrumentation to your application based on your programming language:

Rebuild your application container image with the instrumentation.

Step 5: Configure Collector Endpoint

Update your application container's entry point to discover the EC2 instance IP address at runtime:

{
  "name": "your-application",
  "image": "your-image:latest",
  "essential": true,
  "entryPoint": [
    "sh",
    "-c",
    "export OTEL_EXPORTER_OTLP_ENDPOINT=\"http://$(curl http://169.254.169.254/latest/meta-data/local-ipv4):4318\"; <YOUR_APPLICATION_START_COMMAND>"
  ],
  "environment": [
    {
      "name": "OTEL_SERVICE_NAME",
      "value": "<SERVICE_NAME>"
    },
    {
      "name": "OTEL_EXPORTER_OTLP_PROTOCOL",
      "value": "http/protobuf"
    },
    {
      "name": "OTEL_RESOURCE_ATTRIBUTES",
      "value": "kubesense.env_type=<YOUR_ENV_TYPE>,kubesense.cluster=<YOUR_CLUSTER_NAME>"
    }
  ],
  "portMappings": [
    {
      "containerPort": 3000,
      "protocol": "tcp"
    }
  ]
}

Placeholder Values:

  • <YOUR_APPLICATION_START_COMMAND> - Command to start your application (e.g., node app.js or python app.py)
  • <SERVICE_NAME> - Your service/application name (e.g., frontend-nodejs, api-service)
  • <YOUR_ENV_TYPE> - Environment designation (e.g., production, staging, legacy)
  • <YOUR_CLUSTER_NAME> - Your ECS cluster identifier (provided by KubeSense platform)

How It Works:

  1. curl http://169.254.169.254/latest/meta-data/local-ipv4 queries the EC2 metadata service to get the instance's private IP address
  2. The endpoint is constructed as http://<INSTANCE_IP>:4318
  3. Application connects to the collector running on the same EC2 host
  4. The sh -c entrypoint executes the command after setting the environment variable

Verify Setup

After deploying the daemon service, verify it's working correctly:

Task Status Check

  1. Navigate to ECS → Clusters → [Your Cluster] → Services → ecs-otel-daemon-service
  2. Check the following:
    • Task Status: Should be RUNNING (one task per EC2 instance)
    • Health Status: Should be HEALTHY
    • Container Status: All containers should be RUNNING

CloudWatch Logs Verification

  1. Navigate to CloudWatch → Log Groups → /ecs/otel-collector
  2. Verify you see logs from each EC2 instance
  3. Look for messages indicating:
    • Collector started successfully
    • Components loaded (receivers, processors, exporters)
    • Traces received from applications
    • Metrics collected from host and containers
    • Data exported to KubeSense

Metrics Collection Verification

Check that the collector is sending telemetry to KubeSense:

  1. Verify in KubeSense dashboard that you see traces from your applications
  2. Confirm metrics are flowing (host, container, ECS metrics)
  3. Check that metadata enrichment is working (cluster, env_type labels present)

Troubleshooting Installation

Common Issues

Task Not Starting:

  • Check ECS cluster has available capacity
  • Verify the container image can be pulled from the registry
  • Review CloudWatch logs for the failed tasks
  • Ensure EC2 instances are in ACTIVE state

Parameter Store Access Issues:

  • Ensure the IAM role has ssm:GetParameter permissions
  • Verify the parameter name matches exactly: /ecs/kubesense/otelcol-daemon.yaml
  • Check the parameter is in the same region as your ECS cluster
  • Confirm the IAM role is properly attached to the task

Docker Socket Access Issues:

  • Verify EC2 instances have Docker daemon running
  • Check mount point for Docker socket: /var/run/docker.sock
  • Ensure collector container runs as user: 0 (root)
  • Review CloudWatch logs for socket permission errors

Host Metrics Not Collecting:

  • Verify root filesystem is mounted at /rootfs
  • Check all required mount points exist on EC2 instances
  • Verify hostmetrics receiver is enabled in collector config
  • Review CloudWatch logs for metric collection errors

Host Metadata Enrichment Missing:

  • Ensure resourcedetection processor includes ec2 detector
  • Verify EC2 instances have proper IAM instance profile
  • Check IAM instance profile has EC2 describe permissions
  • Review CloudWatch logs for resource detection errors

Traces Not Appearing in KubeSense:

  • Verify applications are sending OTLP telemetry to http://127.0.0.1:4318
  • Check application container can reach collector on localhost
  • Verify collector exporter configuration has correct KubeSense endpoint
  • Review CloudWatch logs for export errors