Skip to content

Apache Spark structured streaming and AWS EMR Memory issue

Vaquar Khan edited this page Jul 24, 2020 · 3 revisions
  1. Externalize all logs into S3 bucket.

  2. Reduce check point interval

  3. If in driver program doing operation and keeping in memory plz change logic

  4. http://spark.apache.org/docs/latest/monitoring.html#spark-configuration-options

  5. /etc/spark/log4j.properties set ur log level

  6. Event logs are kept by default on AWS EMR"hdfs:/var/log/spark/apps" set spark.eventLog.enabled to false

                   spark.eventLog.enabled = false
    
  7. Set spark.eventLog.enabled = false

  8. Make sure any state you keep (using functions like mapWithState) does not grow in unbounded fashion , keep delete logs /tmp/spark

  9. Disk usagePermalink , ou can run out of space on HDFS (thereby crasing your app) when you have a cluster up for a long time.

For example, logs under /var/log/spark may pile up, especially if you have loose logging settings and/or print a lot of stuff to STDOUT.

You can check your current disk usage using commands such as

      $ hadoop fs -df -h /

      $ hadoop fs -du -h /

Configuring RollingFileAppender and setting file location to YARNโ€™s log directory will avoid disk overflow caused by large log files, and logs can be accessed using YARNโ€™s log utility.

Clone this wiki locally