I know the recommended strategy is to use EMR Serverless or EMR. However, I have a particular use case where I only need to run a fairly small PySpark job and need quick results. I've already gotten my job working on EMR Serverless. I want to configure a Lambda to perform the same functionality based on my aforementioned needs. I've used the Spark on AWS Lambda example code as a guide. I'm trying to make my Lambda compatible with EMR /EMR Serverless 7.1.0 - using the same library versions as specified here.
I've been able to create a container image Lambda. I've gotten most of the runtime issues out of the way, but I'm encountering an issue where I have too many open files now:
2024-06-07T17:47:36.185Z ERROR StatusLogger Error creating converter for d
2024-06-07T17:47:36.185Z java.lang.reflect.InvocationTargetException
2024-06-07T17:47:36.185Z at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
2024-06-07T17:47:36.185Z at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)
2024-06-07T17:47:36.185Z at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
2024-06-07T17:47:36.185Z at java.base/java.lang.reflect.Method.invoke(Method.java:568)
2024-06-07T17:47:36.185Z at org.apache.logging.log4j.core.pattern.PatternParser.createConverter(PatternParser.java:590)
2024-06-07T17:47:36.185Z at org.apache.logging.log4j.core.pattern.PatternParser.finalizeConverter(PatternParser.java:657)
2024-06-07T17:47:36.185Z at org.apache.logging.log4j.core.pattern.PatternParser.parse(PatternParser.java:420)
2024-06-07T17:47:36.185Z at org.apache.logging.log4j.core.pattern.PatternParser.parse(PatternParser.java:177)
2024-06-07T17:47:36.185Z at org.apache.logging.log4j.core.layout.PatternLayout$SerializerBuilder.build(PatternLayout.java:473)
2024-06-07T17:47:36.185Z at org.apache.logging.log4j.core.layout.PatternLayout.<init>(PatternLayout.java:139)
2024-06-07T17:47:36.185Z at org.apache.logging.log4j.core.layout.PatternLayout.<init>(PatternLayout.java:60)
2024-06-07T17:47:36.185Z at org.apache.logging.log4j.core.layout.PatternLayout$Builder.build(PatternLayout.java:766)
2024-06-07T17:47:36.185Z at org.apache.logging.log4j.core.config.AbstractConfiguration.setToDefault(AbstractConfiguration.java:745)
2024-06-07T17:47:36.185Z at org.apache.logging.log4j.core.config.DefaultConfiguration.<init>(DefaultConfiguration.java:47)
2024-06-07T17:47:36.185Z at org.apache.logging.log4j.core.LoggerContext.<init>(LoggerContext.java:84)
2024-06-07T17:47:36.185Z at org.apache.logging.log4j.core.selector.ClassLoaderContextSelector.createContext(ClassLoaderContextSelector.java:254)
2024-06-07T17:47:36.185Z at org.apache.logging.log4j.core.selector.ClassLoaderContextSelector.locateContext(ClassLoaderContextSelector.java:218)
2024-06-07T17:47:36.185Z at org.apache.logging.log4j.core.selector.ClassLoaderContextSelector.getContext(ClassLoaderContextSelector.java:140)
2024-06-07T17:47:36.185Z at org.apache.logging.log4j.core.selector.ClassLoaderContextSelector.getContext(ClassLoaderContextSelector.java:123)
2024-06-07T17:47:36.185Z at org.apache.logging.log4j.core.impl.Log4jContextFactory.getContext(Log4jContextFactory.java:230)
2024-06-07T17:47:36.185Z at org.apache.logging.log4j.core.impl.Log4jContextFactory.getContext(Log4jContextFactory.java:47)
2024-06-07T17:47:36.185Z at org.apache.logging.log4j.LogManager.getContext(LogManager.java:176)
2024-06-07T17:47:36.185Z at org.apache.logging.log4j.LogManager.getLogger(LogManager.java:666)
2024-06-07T17:47:36.185Z at org.apache.logging.log4j.LogManager.getRootLogger(LogManager.java:700)
2024-06-07T17:47:36.185Z at org.apache.spark.internal.Logging.initializeLogging(Logging.scala:129)
2024-06-07T17:47:36.185Z at org.apache.spark.internal.Logging.initializeLogIfNecessary(Logging.scala:114)
2024-06-07T17:47:36.185Z at org.apache.spark.internal.Logging.initializeLogIfNecessary$(Logging.scala:108)
2024-06-07T17:47:36.185Z at org.apache.spark.deploy.SparkSubmit.initializeLogIfNecessary(SparkSubmit.scala:76)
2024-06-07T17:47:36.185Z at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:84)
2024-06-07T17:47:36.185Z at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1120)
2024-06-07T17:47:36.185Z at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1129)
2024-06-07T17:47:36.185Z at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
2024-06-07T17:47:36.185Z Caused by: java.lang.Error: java.io.FileNotFoundException: /usr/lib/jvm/java-17-amazon-corretto.x86_64/lib/tzdb.dat (Too many open files)
I think the problem is due to me including both AWS SDK for Java versions 2 and 1 which is noted in the above documentation:
2.23.18, 1.12.656
Which results in over a thousand jar files in my PySpark jar directory. I would like to cull this list, but it would be a tedious process to determine which jar files are needed. Unfortunately, EMR/EMR Serverless code is not public, so I don't know exactly which libs are needed. Does anyone know which libraries are needed or if I can limit what jar files I need to include? I cannot increase the number of open file descriptors because Lambda limits it to 1024.
Or is there another issue I should know about?
Hi Yokesh, I definitely understand that I cannot achieve EMR Spark functionalities on Lambda. My original aim was to get PySpark running on Lambda. The problem for me is I'm getting the aforementioned "too many open files" error which could be due to me adding Delta Lake libraries, etc. Is there any way of overcoming that?