Friday, October 7, 2022
HomeBig DataStream Amazon EMR on EKS logs to third-party suppliers like Splunk, Amazon...

Stream Amazon EMR on EKS logs to third-party suppliers like Splunk, Amazon OpenSearch Service, or different log aggregators


Spark jobs working on Amazon EMR on EKS generate logs which can be very helpful in figuring out points with Spark processes and in addition as a strategy to see Spark outputs. You’ll be able to entry these logs from a wide range of sources. On the Amazon EMR digital cluster console, you’ll be able to entry logs from the Spark Historical past UI. You even have flexibility to push logs into an Amazon Easy Storage Service (Amazon S3) bucket or Amazon CloudWatch Logs. In every technique, these logs are linked to the particular job in query. The widespread observe of log administration in DevOps tradition is to centralize logging by way of the forwarding of logs to an enterprise log aggregation system like Splunk or Amazon OpenSearch Service (successor to Amazon Elasticsearch Service). This allows you to see all of the relevant log knowledge in a single place. You’ll be able to determine key developments, anomalies, and correlated occasions, and troubleshoot issues quicker and notify the suitable individuals in a well timed vogue.

EMR on EKS Spark logs are generated by Spark and may be accessed through the Kubernetes API and kubectl CLI. Due to this fact, though it’s potential to put in log forwarding brokers within the Amazon Elastic Kubernetes Service (Amazon EKS) cluster to ahead all Kubernetes logs, which embody Spark logs, this could turn out to be fairly costly at scale since you get data that will not be essential for Spark customers about Kubernetes. As well as, from a safety perspective, the EKS cluster logs and entry to kubectl will not be obtainable to the Spark consumer.

To resolve this drawback, this publish proposes utilizing pod templates to create a sidecar container alongside the Spark job pods. The sidecar containers are capable of entry the logs contained within the Spark pods and ahead these logs to the log aggregator. This method permits the logs to be managed individually from the EKS cluster and makes use of a small quantity of assets as a result of the sidecar container is simply launched throughout the lifetime of the Spark job.

Implementing Fluent Bit as a sidecar container

Fluent Bit is a light-weight, extremely scalable, and high-speed logging and metrics processor and log forwarder. It collects occasion knowledge from any supply, enriches that knowledge, and sends it to any vacation spot. Its light-weight and environment friendly design coupled with its many options makes it very enticing to these working within the cloud and in containerized environments. It has been deployed extensively and trusted by many, even in massive and sophisticated environments. Fluent Bit has zero dependencies and requires solely 650 KB in reminiscence to function, as in comparison with FluentD, which wants about 40 MB in reminiscence. Due to this fact, it’s an excellent choice as a log forwarder to ahead logs generated from Spark jobs.

If you submit a job to EMR on EKS, there are no less than two Spark containers: the Spark driver and the Spark executor. The variety of Spark executor pods relies on your job submission configuration. Should you point out multiple spark.executor.cases, you get the corresponding variety of Spark executor pods. What we need to do right here is run Fluent Bit as sidecar containers with the Spark driver and executor pods. Diagrammatically, it appears to be like like the next determine. The Fluent Bit sidecar container reads the indicated logs within the Spark driver and executor pods, and forwards these logs to the goal log aggregator immediately.

Architecture of Fluent Bit sidecar

Pod templates in EMR on EKS

A Kubernetes pod is a gaggle of a number of containers with shared storage, community assets, and a specification for how you can run the containers. Pod templates are specs for creating pods. It’s a part of the specified state of the workload assets used to run the applying. Pod template information can outline the motive force or executor pod configurations that aren’t supported in normal Spark configuration. That being mentioned, Spark is opinionated about sure pod configurations and a few values within the pod template are all the time overwritten by Spark. Utilizing a pod template solely permits Spark to start out with a template pod and never an empty pod throughout the pod constructing course of. Pod templates are enabled in EMR on EKS if you configure the Spark properties spark.kubernetes.driver.podTemplateFile and spark.kubernetes.executor.podTemplateFile. Spark downloads these pod templates to assemble the motive force and executor pods.

Ahead logs generated by Spark jobs in EMR on EKS

A log aggregating system like Amazon OpenSearch Service or Splunk ought to all the time be obtainable that may settle for the logs forwarded by the Fluent Bit sidecar containers. If not, we offer the next scripts on this publish that can assist you launch a log aggregating system like Amazon OpenSearch Service or Splunk put in on an Amazon Elastic Compute Cloud (Amazon EC2) occasion.

We use a number of providers to create and configure EMR on EKS. We use an AWS Cloud9 workspace to run all of the scripts and to configure the EKS cluster. To arrange to run a job script that requires sure Python libraries absent from the generic EMR photographs, we use Amazon Elastic Container Registry (Amazon ECR) to retailer the personalized EMR container picture.

Create an AWS Cloud9 workspace

Step one is to launch and configure the AWS Cloud9 workspace by following the directions in Create a Workspace within the EKS Workshop. After you create the workspace, we create AWS Id and Entry Administration (IAM) assets. Create an IAM function for the workspace, connect the function to the workspace, and replace the workspace IAM settings.

Put together the AWS Cloud9 workspace

Clone the next GitHub repository and run the next script to arrange the AWS Cloud9 workspace to be prepared to put in and configure Amazon EKS and EMR on EKS. The shell script prepare_cloud9.sh installs all the required elements for the AWS Cloud9 workspace to construct and handle the EKS cluster. These embody the kubectl command line device, eksctl CLI device, jq, and to replace the AWS Command Line Interface (AWS CLI).

$ sudo yum -y set up git
$ cd ~ 
$ git clone https://github.com/aws-samples/aws-emr-eks-log-forwarding.git
$ cd aws-emr-eks-log-forwarding
$ cd emreks
$ bash prepare_cloud9.sh

All the required scripts and configuration to run this resolution are discovered within the cloned GitHub repository.

Create a key pair

As a part of this explicit deployment, you want an EC2 key pair to create an EKS cluster. If you have already got an present EC2 key pair, you could use that key pair. In any other case, you’ll be able to create a key pair.

Set up Amazon EKS and EMR on EKS

After you configure the AWS Cloud9 workspace, in the identical folder (emreks), run the next deployment script:

$ bash deploy_eks_cluster_bash.sh 
Deployment Script -- EMR on EKS
-----------------------------------------------

Please present the next data earlier than deployment:
1. Area (In case your Cloud9 desktop is in the identical area as your deployment, you'll be able to depart this clean)
2. Account ID (In case your Cloud9 desktop is working in the identical Account ID as the place your deployment can be, you'll be able to depart this clean)
3. Identify of the S3 bucket to be created for the EMR S3 storage location
Area: [xx-xxxx-x]: < Press enter for default or enter area > 
Account ID [xxxxxxxxxxxx]: < Press enter for default or enter account # > 
EC2 Public Key identify: < Present your key pair identify right here >
Default S3 bucket identify for EMR on EKS (don't add s3://): < bucket identify >
Bucket created: XXXXXXXXXXX ...
Deploying CloudFormation stack with the next parameters...
Area: xx-xxxx-x | Account ID: xxxxxxxxxxxx | S3 Bucket: XXXXXXXXXXX

...

EKS Cluster and Digital EMR Cluster have been put in.

The final line signifies that set up was profitable.

Log aggregation choices

There are a number of log aggregation and administration instruments available on the market. This publish suggests two of the extra common ones within the trade: Splunk and Amazon OpenSearch Service.

Possibility 1: Set up Splunk Enterprise

To manually set up Splunk on an EC2 occasion, full the next steps:

  1. Launch an EC2 occasion.
  2. Set up Splunk.
  3. Configure the EC2 occasion safety group to allow entry to ports 22, 8000, and 8088.

This publish, nonetheless, supplies an automatic strategy to set up Spunk on an EC2 occasion:

  1. Obtain the RPM set up file and add it to an accessible Amazon S3 location.
  2. Add the next YAML script into AWS CloudFormation.
  3. Present the required parameters, as proven within the screenshots beneath.
  4. Select Subsequent and full the steps to create your stack.

Splunk CloudFormation screen - 1

Splunk CloudFormation screen - 2

Splunk CloudFormation screen - 3

Alternatively, run an AWS CLI script like the next:

aws cloudformation create-stack 
--stack-name "splunk" 
--template-body file://splunk_cf.yaml 
--parameters ParameterKey=KeyName,ParameterValue="< Identify of EC2 Key Pair >" 
  ParameterKey=InstanceType,ParameterValue="t3.medium" 
  ParameterKey=LatestAmiId,ParameterValue="/aws/service/ami-amazon-linux-latest/amzn2-ami-hvm-x86_64-gp2" 
  ParameterKey=VPCID,ParameterValue="vpc-XXXXXXXXXXX" 
  ParameterKey=PublicSubnet0,ParameterValue="subnet-XXXXXXXXX" 
  ParameterKey=SSHLocation,ParameterValue="< CIDR Vary for SSH entry >" 
  ParameterKey=VpcCidrRange,ParameterValue="172.20.0.0/16" 
  ParameterKey=RootVolumeSize,ParameterValue="100" 
  ParameterKey=S3BucketName,ParameterValue="< S3 Bucket Identify >" 
  ParameterKey=S3Prefix,ParameterValue="splunk/splunk-8.2.5-77015bc7a462-linux-2.6-x86_64.rpm" 
  ParameterKey=S3DownloadLocation,ParameterValue="/tmp" 
--region < area > 
--capabilities CAPABILITY_IAM
  1. After you construct the stack, navigate to the stack’s Outputs tab on the AWS CloudFormation console and observe the interior and exterior DNS for the Splunk occasion.

You utilize these later to configure the Splunk occasion and log forwarding.

Splunk CloudFormation output screen

  1. To configure Splunk, go to the Sources tab for the CloudFormation stack and find the bodily ID of EC2Instance.
  2. Select that hyperlink to go to the particular EC2 occasion.
  3. Choose the occasion and select Join.

Connect to Splunk Instance

  1. On the Session Supervisor tab, select Join.

Connect to Instance

You’re redirected to the occasion’s shell.

  1. Set up and configure Splunk as follows:
$ sudo /choose/splunk/bin/splunk begin --accept-license
…
Please enter an administrator username: admin
Password should include no less than:
   * 8 complete printable ASCII character(s).
Please enter a brand new password: 
Please verify new password:
…
Performed
                                                           [  OK  ]

Ready for internet server at http://127.0.0.1:8000 to be obtainable......... Performed
The Splunk internet interface is at http://ip-xx-xxx-xxx-x.us-east-2.compute.inside:8000
  1. Enter the Splunk web site utilizing the SplunkPublicDns worth from the stack outputs (for instance, http://ec2-xx-xxx-xxx-x.us-east-2.compute.amazonaws.com:8000). Notice the port variety of 8000.
  2. Log in with the consumer identify and password you offered.

Splunk Login

Configure HTTP Occasion Collector

To configure Splunk to have the ability to obtain logs from Fluent Bit, configure the HTTP Occasion Collector knowledge enter:

  1. Go to Settings and select Information enter.
  2. Select HTTP Occasion Collector.
  3. Select International Settings.
  4. Choose Enabled, maintain port quantity 8088, then select Save.
  5. Select New Token.
  6. For Identify, enter a reputation (for instance, emreksdemo).
  7. Select Subsequent.
  8. For Accessible merchandise(s) for Indexes, add no less than the primary index.
  9. Select Evaluate after which Submit.
  10. Within the listing of HTTP Occasion Accumulate tokens, copy the token worth for emreksdemo.

You utilize it when configuring the Fluent Bit output.

splunk-http-collector-list

Possibility 2: Arrange Amazon OpenSearch Service

Your different log aggregation choice is to make use of Amazon OpenSearch Service.

Provision an OpenSearch Service area

Provisioning an OpenSearch Service area may be very easy. On this publish, we offer a easy script and configuration to provision a primary area. To do it your self, confer with Creating and managing Amazon OpenSearch Service domains.

Earlier than you begin, get the ARN of the IAM function that you simply use to run the Spark jobs. Should you created the EKS cluster with the offered script, go to the CloudFormation stack emr-eks-iam-stack. On the Outputs tab, find the IAMRoleArn output and replica this ARN. We additionally modify the IAM function in a while, after we create the OpenSearch Service area.

iam_role_emr_eks_job

Should you’re utilizing the offered opensearch.sh installer, earlier than you run it, modify the file.

From the basis folder of the GitHub repository, cd to opensearch and modify opensearch.sh (you can even use your most popular editor):

[../aws-emr-eks-log-forwarding] $ cd opensearch
[../aws-emr-eks-log-forwarding/opensearch] $ vi opensearch.sh

Configure opensearch.sh to suit your atmosphere, for instance:

# identify of our Amazon OpenSearch cluster
export ES_DOMAIN_NAME="emreksdemo"

# Elasticsearch model
export ES_VERSION="OpenSearch_1.0"

# Occasion Sort
export INSTANCE_TYPE="t3.small.search"

# OpenSearch Dashboards admin consumer
export ES_DOMAIN_USER="emreks"

# OpenSearch Dashboards admin password
export ES_DOMAIN_PASSWORD='< ADD YOUR PASSWORD >'

# Area
export REGION='us-east-1'

Run the script:

[../aws-emr-eks-log-forwarding/opensearch] $ bash opensearch.sh

Configure your OpenSearch Service area

After you arrange your OpenSearch service area and it’s lively, make the next configuration adjustments to permit logs to be ingested into Amazon OpenSearch Service:

  1. On the Amazon OpenSearch Service console, on the Domains web page, select your area.

Opensearch Domain Console

  1. On the Safety configuration tab, select Edit.

Opensearch Security Configuration

  1. For Entry Coverage, choose Solely use fine-grained entry management.
  2. Select Save adjustments.

The entry coverage ought to appear like the next code:

{
  "Model": "2012-10-17",
  "Assertion": [
    {
      "Effect": "Allow",
      "Principal": {
        "AWS": "*"
      },
      "Action": "es:*",
      "Resource": "arn:aws:es:xx-xxxx-x:xxxxxxxxxxxx:domain/emreksdemo/*"
    }
  ]
}
  1. When the area is lively once more, copy the area ARN.

We use it to configure the Amazon EMR job IAM function we talked about earlier.

  1. Select the hyperlink for OpenSearch Dashboards URL to enter Amazon OpenSearch Service Dashboards.

Opensearch Main Console

  1. In Amazon OpenSearch Service Dashboards, use the consumer identify and password that you simply configured earlier within the opensearch.sh file.
  2. Select the choices icon and select Safety beneath OpenSearch Plugins.

opensearch menu

  1. Select Roles.
  2. Select Create function.

opensearch-create-role-button

  1. Enter the brand new function’s identify, cluster permissions, and index permissions. For this publish, identify the function fluentbit_role and provides cluster permissions to the next:
    1. indices:admin/create
    2. indices:admin/template/get
    3. indices:admin/template/put
    4. cluster:admin/ingest/pipeline/get
    5. cluster:admin/ingest/pipeline/put
    6. indices:knowledge/write/bulk
    7. indices:knowledge/write/bulk*
    8. create_index

opensearch-create-role-button

  1. Within the Index permissions part, give write permission to the index fluent-*.
  2. On the Mapped customers tab, select Handle mapping.
  3. For Backend roles, enter the Amazon EMR job execution IAM function ARN to be mapped to the fluentbit_role function.
  4. Select Map.

opensearch-map-backend

  1. To finish the safety configuration, go to the IAM console and add the next inline coverage to the EMR on EKS IAM function entered within the backend function. Substitute the useful resource ARN with the ARN of your OpenSearch Service area.
{
    "Model": "2012-10-17",
    "Assertion": [
        {
            "Sid": "VisualEditor0",
            "Effect": "Allow",
            "Action": [
                "es:ESHttp*"
            ],
            "Useful resource": "arn:aws:es:us-east-2:XXXXXXXXXXXX:area/emreksdemo"
        }
    ]
}

The configuration of Amazon OpenSearch Service is full and prepared for ingestion of logs from the Fluent Bit sidecar container.

Configure the Fluent Bit sidecar container

We have to write two configuration information to configure a Fluent Bit sidecar container. The primary is the Fluent Bit configuration itself, and the second is the Fluent Bit sidecar subprocess configuration that makes positive that the sidecar operation ends when the primary Spark job ends. The instructed configuration offered on this publish is for Splunk and Amazon OpenSearch Service. Nonetheless, you’ll be able to configure Fluent Bit with different third-party log aggregators. For extra details about configuring outputs, confer with Outputs.

Fluent Bit ConfigMap

The next pattern ConfigMap is from the GitHub repo:

apiVersion: v1
sort: ConfigMap
metadata:
  identify: fluent-bit-sidecar-config
  namespace: sparkns
  labels:
    app.kubernetes.io/identify: fluent-bit
knowledge:
  fluent-bit.conf: |
    [SERVICE]
        Flush         1
        Log_Level     information
        Daemon        off
        Parsers_File  parsers.conf
        HTTP_Server   On
        HTTP_Listen   0.0.0.0
        HTTP_Port     2020

    @INCLUDE input-application.conf
    @INCLUDE input-event-logs.conf
    @INCLUDE output-splunk.conf
    @INCLUDE output-opensearch.conf

  input-application.conf: |
    [INPUT]
        Identify              tail
        Path              /var/log/spark/consumer/*/*
        Path_Key          filename
        Buffer_Chunk_Size 1M
        Buffer_Max_Size   5M
        Skip_Long_Lines   On
        Skip_Empty_Lines  On

  input-event-logs.conf: |
    [INPUT]
        Identify              tail
        Path              /var/log/spark/apps/*
        Path_Key          filename
        Buffer_Chunk_Size 1M
        Buffer_Max_Size   5M
        Skip_Long_Lines   On
        Skip_Empty_Lines  On

  output-splunk.conf: |
    [OUTPUT]
        Identify            splunk
        Match           *
        Host            < INTERNAL DNS of Splunk EC2 Occasion >
        Port            8088
        TLS             On
        TLS.Confirm      Off
        Splunk_Token    < Token as offered by the HTTP Occasion Collector in Splunk >

  output-opensearch.conf: |
[OUTPUT]
        Identify            es
        Match           *
        Host            < HOST NAME of the OpenSearch Area | No HTTP protocol >
        Port            443
        TLS             On
        AWS_Auth        On
        AWS_Region      < Area >
        Retry_Limit     6

In your AWS Cloud9 workspace, modify the ConfigMap accordingly. Present the values for the placeholder textual content by working the next instructions to enter the VI editor mode. If most popular, you should utilize PICO or a unique editor:

[../aws-emr-eks-log-forwarding] $  cd kube/configmaps
[../aws-emr-eks-log-forwarding/kube/configmaps] $ vi emr_configmap.yaml

# Modify the emr_configmap.yaml as above
# Save the file as soon as it's accomplished

Full both the Splunk output configuration or the Amazon OpenSearch Service output configuration.

Subsequent, run the next instructions so as to add the 2 Fluent Bit sidecar and subprocess ConfigMaps:

[../aws-emr-eks-log-forwarding/kube/configmaps] $ kubectl apply -f emr_configmap.yaml
[../aws-emr-eks-log-forwarding/kube/configmaps] $ kubectl apply -f emr_entrypoint_configmap.yaml

You don’t want to switch the second ConfigMap as a result of it’s the subprocess script that runs contained in the Fluent Bit sidecar container. To confirm that the ConfigMaps have been put in, run the next command:

$ kubectl get cm -n sparkns
NAME                         DATA   AGE
fluent-bit-sidecar-config    6      15s
fluent-bit-sidecar-wrapper   2      15s

Arrange a personalized EMR container picture

To run the pattern PySpark script, the script requires the Boto3 package deal that’s not obtainable in the usual EMR container photographs. If you wish to run your individual script and it doesn’t require a personalized EMR container picture, you could skip this step.

Run the next script:

[../aws-emr-eks-log-forwarding] $ cd ecr
[../aws-emr-eks-log-forwarding/ecr] $ bash create_custom_image.sh <area> <EMR container picture account quantity>

The EMR container picture account quantity may be obtained from The right way to choose a base picture URI. This documentation additionally supplies the suitable ECR registry account quantity. For instance, the registry account quantity for us-east-1 is 755674844232.

To confirm the repository and picture, run the next instructions:

$ aws ecr describe-repositories --region < area > | grep emr-6.5.0-custom
            "repositoryArn": "arn:aws:ecr:xx-xxxx-x:xxxxxxxxxxxx:repository/emr-6.5.0-custom",
            "repositoryName": "emr-6.5.0-custom",
            "repositoryUri": " xxxxxxxxxxxx.dkr.ecr.xx-xxxx-x.amazonaws.com/emr-6.5.0-custom",

$ aws ecr describe-images --region < area > --repository-name emr-6.5.0-custom | jq .imageDetails[0].imageTags
[
  "latest"
]

Put together pod templates for Spark jobs

Add the 2 Spark driver and Spark executor pod templates to an S3 bucket and prefix. The 2 pod templates may be discovered within the GitHub repository:

  • emr_driver_template.yaml – Spark driver pod template
  • emr_executor_template.yaml – Spark executor pod template

The pod templates offered right here shouldn’t be modified.

Submitting a Spark job with a Fluent Bit sidecar container

This Spark job instance makes use of the bostonproperty.py script. To make use of this script, add it to an accessible S3 bucket and prefix and full the previous steps to make use of an EMR personalized container picture. You additionally have to add the CSV file from the GitHub repo, which you might want to obtain and unzip. Add the unzipped file to the next location: s3://<your chosen bucket>/<first stage folder>/knowledge/boston-property-assessment-2021.csv.

The next instructions assume that you simply launched your EKS cluster and digital EMR cluster with the parameters indicated within the GitHub repo.

Variable The place to Discover the Info or the Worth Required
EMR_EKS_CLUSTER_ID Amazon EMR console digital cluster web page
EMR_EKS_EXECUTION_ARN IAM function ARN
EMR_RELEASE emr-6.5.0-latest
S3_BUCKET The bucket you create in Amazon S3
S3_FOLDER The popular prefix you need to use in Amazon S3
CONTAINER_IMAGE The URI in Amazon ECR the place your container picture is
SCRIPT_NAME emreksdemo-script or a reputation you favor

Alternatively, use the offered script to run the job. Change the listing to the scripts folder in emreks and run the script as follows:

[../aws-emr-eks-log-forwarding] cd emreks/scripts
[../aws-emr-eks-log-forwarding/emreks/scripts] bash run_emr_script.sh < S3 bucket identify > < ECR container picture > < script path>

Instance: bash run_emr_script.sh emreksdemo-123456 12345678990.dkr.ecr.us-east-2.amazonaws.com/emr-6.5.0-custom s3://emreksdemo-123456/scripts/scriptname.py

After you submit the Spark job efficiently, you get a return JSON response like the next:

{
    "id": "0000000305e814v0bpt",
    "identify": "emreksdemo-job",
    "arn": "arn:aws:emr-containers:xx-xxxx-x:XXXXXXXXXXX:/virtualclusters/upobc00wgff5XXXXXXXXXXX/jobruns/0000000305e814v0bpt",
    "virtualClusterId": "upobc00wgff5XXXXXXXXXXX"
}

What occurs if you submit a Spark job with a sidecar container

After you submit a Spark job, you’ll be able to see what is going on by viewing the pods which can be generated and the corresponding logs. First, utilizing kubectl, get a listing of the pods generated within the namespace the place the EMR digital cluster runs. On this case, it’s sparkns. The primary pod within the following code is the job controller for this explicit Spark job. The second pod is the Spark executor; there may be multiple pod relying on what number of executor cases are requested for within the Spark job setting—we requested for one right here. The third pod is the Spark driver pod.

$ kubectl get pods -n sparkns
NAME                                        READY   STATUS    RESTARTS   AGE
0000000305e814v0bpt-hvwjs                   3/3     Operating   0          25s
emreksdemo-script-1247bf80ae40b089-exec-1   0/3     Pending   0          0s
spark-0000000305e814v0bpt-driver            3/3     Operating   0          11s

To view what occurs within the sidecar container, observe the logs within the Spark driver pod and confer with the sidecar. The sidecar container launches with the Spark pods and persists till the file /var/log/fluentd/main-container-terminated is not obtainable. For extra details about how Amazon EMR controls the pod lifecycle, confer with Utilizing pod templates. The subprocess script ties the sidecar container to this identical lifecycle and deletes itself upon the EMR managed pod lifecycle course of.

$ kubectl logs spark-0000000305e814v0bpt-driver -n sparkns  -c custom-side-car-container --follow=true

Ready for file /var/log/fluentd/main-container-terminated to seem...
AWS for Fluent Bit Container Picture Model 2.24.0Start wait: 1652190909
Elapsed Wait: 0
Not discovered rely: 0
Ready...
Fluent Bit v1.9.3
* Copyright (C) 2015-2022 The Fluent Bit Authors
* Fluent Bit is a CNCF sub-project beneath the umbrella of Fluentd
* https://fluentbit.io

[2022/05/10 13:55:09] [ info] [fluent bit] model=1.9.3, commit=9eb4996b7d, pid=11
[2022/05/10 13:55:09] [ info] [storage] model=1.2.0, kind=memory-only, sync=regular, checksum=disabled, max_chunks_up=128
[2022/05/10 13:55:09] [ info] [cmetrics] model=0.3.1
[2022/05/10 13:55:09] [ info] [output:splunk:splunk.0] employee #0 began
[2022/05/10 13:55:09] [ info] [output:splunk:splunk.0] employee #1 began
[2022/05/10 13:55:09] [ info] [output:es:es.1] employee #0 began
[2022/05/10 13:55:09] [ info] [output:es:es.1] employee #1 began
[2022/05/10 13:55:09] [ info] [http_server] pay attention iface=0.0.0.0 tcp_port=2020
[2022/05/10 13:55:09] [ info] [sp] stream processor began
Ready for file /var/log/fluentd/main-container-terminated to seem...
Final heartbeat: 1652190914
Elapsed Time since after heartbeat: 0
Discovered rely: 0
listing information:
-rw-r--r-- 1 saslauth 65534 0 Could 10 13:55 /var/log/fluentd/main-container-terminated
Final heartbeat: 1652190918

…

[2022/05/10 13:56:09] [ info] [input:tail:tail.0] inotify_fs_add(): inode=58834691 watch_fd=6 identify=/var/log/spark/consumer/spark-0000000305e814v0bpt-driver/stdout-s3-container-log-in-tail.pos
[2022/05/10 13:56:09] [ info] [input:tail:tail.1] inotify_fs_add(): inode=54644346 watch_fd=1 identify=/var/log/spark/apps/spark-0000000305e814v0bpt
Outdoors of loop, main-container-terminated file not exists
ls: can not entry /var/log/fluentd/main-container-terminated: No such file or listing
The file /var/log/fluentd/main-container-terminated would not exist anymore;
TERMINATED PROCESS
Fluent-Bit pid: 11
Killing course of after sleeping for 15 seconds
root        11     8  0 13:55 ?        00:00:00 /fluent-bit/bin/fluent-bit -e /fluent-bit/firehose.so -e /fluent-bit/cloudwatch.so -e /fluent-bit/kinesis.so -c /fluent-bit/and many others/fluent-bit.conf
root       114     7  0 13:56 ?        00:00:00 grep fluent
Killing course of 11
[2022/05/10 13:56:24] [engine] caught sign (SIGTERM)
[2022/05/10 13:56:24] [ info] [input] pausing tail.0
[2022/05/10 13:56:24] [ info] [input] pausing tail.1
[2022/05/10 13:56:24] [ warn] [engine] service will shutdown in max 5 seconds
[2022/05/10 13:56:25] [ info] [engine] service has stopped (0 pending duties)
[2022/05/10 13:56:25] [ info] [input:tail:tail.1] inotify_fs_remove(): inode=54644346 watch_fd=1
[2022/05/10 13:56:25] [ info] [input:tail:tail.0] inotify_fs_remove(): inode=60917120 watch_fd=1
[2022/05/10 13:56:25] [ info] [input:tail:tail.0] inotify_fs_remove(): inode=60917121 watch_fd=2
[2022/05/10 13:56:25] [ info] [input:tail:tail.0] inotify_fs_remove(): inode=58834690 watch_fd=3
[2022/05/10 13:56:25] [ info] [input:tail:tail.0] inotify_fs_remove(): inode=58834692 watch_fd=4
[2022/05/10 13:56:25] [ info] [input:tail:tail.0] inotify_fs_remove(): inode=58834689 watch_fd=5
[2022/05/10 13:56:25] [ info] [input:tail:tail.0] inotify_fs_remove(): inode=58834691 watch_fd=6
[2022/05/10 13:56:25] [ info] [output:splunk:splunk.0] thread employee #0 stopping...
[2022/05/10 13:56:25] [ info] [output:splunk:splunk.0] thread employee #0 stopped
[2022/05/10 13:56:25] [ info] [output:splunk:splunk.0] thread employee #1 stopping...
[2022/05/10 13:56:25] [ info] [output:splunk:splunk.0] thread employee #1 stopped
[2022/05/10 13:56:25] [ info] [output:es:es.1] thread employee #0 stopping...
[2022/05/10 13:56:25] [ info] [output:es:es.1] thread employee #0 stopped
[2022/05/10 13:56:25] [ info] [output:es:es.1] thread employee #1 stopping...
[2022/05/10 13:56:25] [ info] [output:es:es.1] thread employee #1 stopped

View the forwarded logs in Splunk or Amazon OpenSearch Service

To view the forwarded logs, do a search in Splunk or on the Amazon OpenSearch Service console. Should you’re utilizing a shared log aggregator, you will have to filter the outcomes. On this configuration, the logs tailed by Fluent Bit are within the /var/log/spark/*. The next screenshots present the logs generated particularly by the Kubernetes Spark driver stdout that had been forwarded to the log aggregators. You’ll be able to evaluate the outcomes with the logs offered utilizing kubectl:

kubectl logs < Spark Driver Pod > -n < namespace > -c spark-kubernetes-driver --follow=true

…
root
 |-- PID: string (nullable = true)
 |-- CM_ID: string (nullable = true)
 |-- GIS_ID: string (nullable = true)
 |-- ST_NUM: string (nullable = true)
 |-- ST_NAME: string (nullable = true)
 |-- UNIT_NUM: string (nullable = true)
 |-- CITY: string (nullable = true)
 |-- ZIPCODE: string (nullable = true)
 |-- BLDG_SEQ: string (nullable = true)
 |-- NUM_BLDGS: string (nullable = true)
 |-- LUC: string (nullable = true)
…

|02108|RETAIL CONDO           |361450.0            |63800.0        |5977500.0      |
|02108|RETAIL STORE DETACH    |2295050.0           |988200.0       |3601900.0      |
|02108|SCHOOL                 |1.20858E7           |1.20858E7      |1.20858E7      |
|02108|SINGLE FAM DWELLING    |5267156.561085973   |1153400.0      |1.57334E7      |
+-----+-----------------------+--------------------+---------------+---------------+
solely exhibiting high 50 rows

The next screenshot reveals the Splunk logs.

splunk-result-driver-stdout

The next screenshots present the Amazon OpenSearch Service logs.

opensearch-result-driver-stdout

Optionally available: Embody a buffer between Fluent Bit and the log aggregators

Should you count on to generate plenty of logs due to excessive concurrent Spark jobs creating a number of particular person connects which will overwhelm your Amazon OpenSearch Service or Splunk log aggregation clusters, contemplate using a buffer between the Fluent Bit sidecars and your log aggregator. One choice is to make use of Amazon Kinesis Information Firehose because the buffering service.

Kinesis Information Firehose has built-in supply to each Amazon OpenSearch Service and Splunk. If utilizing Amazon OpenSearch Service, confer with Loading streaming knowledge from Amazon Kinesis Information Firehose. If utilizing Splunk, confer with Configure Amazon Kinesis Firehose to ship knowledge to the Splunk platform and Select Splunk for Your Vacation spot.

To configure Fluent Bit to Kinesis Information Firehose, add the next to your ConfigMap output. Discuss with the GitHub ConfigMap instance and add the @INCLUDE beneath the [SERVICE] part:

     @INCLUDE output-kinesisfirehose.conf
…

  output-kinesisfirehose.conf: |
    [OUTPUT]
        Identify            kinesis_firehose
        Match           *
        area          < area >
        delivery_stream < Kinesis Firehose Stream Identify >

Optionally available: Use knowledge streams for Amazon OpenSearch Service

Should you’re in a state of affairs the place the variety of paperwork grows quickly and also you don’t have to replace older paperwork, you might want to handle the OpenSearch Service cluster. This includes steps like making a rollover index alias, defining a write index, and defining widespread mappings and settings for the backing indexes. Think about using knowledge streams to simplify this course of and implement a setup that most accurately fits your time sequence knowledge. For directions on implementing knowledge streams, confer with Information streams.

Clear up

To keep away from incurring future expenses, delete the assets by deleting the CloudFormation stacks that had been created with this script. This removes the EKS cluster. Nonetheless, earlier than you try this, take away the EMR digital cluster first by working the delete-virtual-cluster command. Then delete all of the CloudFormation stacks generated by the deployment script.

Should you launched an OpenSearch Service area, you’ll be able to delete the area from the OpenSearch Service area. Should you used the script to launch a Splunk occasion, you’ll be able to go to the CloudFormation stack that launched the Splunk occasion and delete the CloudFormation stack. This removes take away the Splunk occasion and related assets.

You can even use the next scripts to scrub up assets:

Conclusion

EMR on EKS facilitates working Spark jobs on Kubernetes to realize very quick and cost-efficient Spark operations. That is made potential by way of scheduling transient pods which can be launched after which deleted the roles are full. To log all these operations in the identical lifecycle of the Spark jobs, this publish supplies an answer utilizing pod templates and Fluent Bit that’s light-weight and highly effective. This method provides a decoupled means of log forwarding based mostly on the Spark utility stage and never on the Kubernetes cluster stage. It additionally avoids routing by way of intermediaries like CloudWatch, lowering price and complexity. On this means, you’ll be able to handle safety considerations and DevOps and system administration ease of administration whereas offering Spark customers with insights into their Spark jobs in a cost-efficient and useful means.

When you’ve got questions or solutions, please depart a remark.


Concerning the Creator

Matthew Tan is a Senior Analytics Options Architect at Amazon Net Companies and supplies steerage to prospects creating options with AWS Analytics providers on their analytics workloads.                       

RELATED ARTICLES

LEAVE A REPLY

Please enter your comment!
Please enter your name here

5 × two =

Most Popular

Recent Comments