-
Notifications
You must be signed in to change notification settings - Fork 0
Apache Spark structured streaming and AWS EMR Memory issue
-
Externalize all logs into S3 bucket.
-
Reduce check point interval
-
If in driver program doing operation and keeping in memory plz change logic
-
http://spark.apache.org/docs/latest/monitoring.html#spark-configuration-options
-
/etc/spark/log4j.properties set ur log level
-
Event logs are kept by default on AWS EMR"hdfs:/var/log/spark/apps" set spark.eventLog.enabled to false
spark.eventLog.enabled = false -
Set spark.eventLog.enabled = false
-
Make sure any state you keep (using functions like mapWithState) does not grow in unbounded fashion , keep delete logs /tmp/spark
-
Disk usagePermalink , ou can run out of space on HDFS (thereby crasing your app) when you have a cluster up for a long time.
For example, logs under /var/log/spark may pile up, especially if you have loose logging settings and/or print a lot of stuff to STDOUT.
You can check your current disk usage using commands such as
$ hadoop fs -df -h /
$ hadoop fs -du -h /
Configuring RollingFileAppender and setting file location to YARNโs log directory will avoid disk overflow caused by large log files, and logs can be accessed using YARNโs log utility.