Issues running PySpark on AWS Lambda

0

I know the recommended strategy is to use EMR Serverless or EMR. However, I have a particular use case where I only need to run a fairly small PySpark job and need quick results. I've already gotten my job working on EMR Serverless. I want to configure a Lambda to perform the same functionality based on my aforementioned needs. I've used the Spark on AWS Lambda example code as a guide. I'm trying to make my Lambda compatible with EMR /EMR Serverless 7.1.0 - using the same library versions as specified here.

I've been able to create a container image Lambda. I've gotten most of the runtime issues out of the way, but I'm encountering an issue where I have too many open files now:

2024-06-07T17:47:36.185Z	ERROR StatusLogger Error creating converter for d
2024-06-07T17:47:36.185Z	java.lang.reflect.InvocationTargetException
2024-06-07T17:47:36.185Z	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
2024-06-07T17:47:36.185Z	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)
2024-06-07T17:47:36.185Z	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
2024-06-07T17:47:36.185Z	at java.base/java.lang.reflect.Method.invoke(Method.java:568)
2024-06-07T17:47:36.185Z	at org.apache.logging.log4j.core.pattern.PatternParser.createConverter(PatternParser.java:590)
2024-06-07T17:47:36.185Z	at org.apache.logging.log4j.core.pattern.PatternParser.finalizeConverter(PatternParser.java:657)
2024-06-07T17:47:36.185Z	at org.apache.logging.log4j.core.pattern.PatternParser.parse(PatternParser.java:420)
2024-06-07T17:47:36.185Z	at org.apache.logging.log4j.core.pattern.PatternParser.parse(PatternParser.java:177)
2024-06-07T17:47:36.185Z	at org.apache.logging.log4j.core.layout.PatternLayout$SerializerBuilder.build(PatternLayout.java:473)
2024-06-07T17:47:36.185Z	at org.apache.logging.log4j.core.layout.PatternLayout.<init>(PatternLayout.java:139)
2024-06-07T17:47:36.185Z	at org.apache.logging.log4j.core.layout.PatternLayout.<init>(PatternLayout.java:60)
2024-06-07T17:47:36.185Z	at org.apache.logging.log4j.core.layout.PatternLayout$Builder.build(PatternLayout.java:766)
2024-06-07T17:47:36.185Z	at org.apache.logging.log4j.core.config.AbstractConfiguration.setToDefault(AbstractConfiguration.java:745)
2024-06-07T17:47:36.185Z	at org.apache.logging.log4j.core.config.DefaultConfiguration.<init>(DefaultConfiguration.java:47)
2024-06-07T17:47:36.185Z	at org.apache.logging.log4j.core.LoggerContext.<init>(LoggerContext.java:84)
2024-06-07T17:47:36.185Z	at org.apache.logging.log4j.core.selector.ClassLoaderContextSelector.createContext(ClassLoaderContextSelector.java:254)
2024-06-07T17:47:36.185Z	at org.apache.logging.log4j.core.selector.ClassLoaderContextSelector.locateContext(ClassLoaderContextSelector.java:218)
2024-06-07T17:47:36.185Z	at org.apache.logging.log4j.core.selector.ClassLoaderContextSelector.getContext(ClassLoaderContextSelector.java:140)
2024-06-07T17:47:36.185Z	at org.apache.logging.log4j.core.selector.ClassLoaderContextSelector.getContext(ClassLoaderContextSelector.java:123)
2024-06-07T17:47:36.185Z	at org.apache.logging.log4j.core.impl.Log4jContextFactory.getContext(Log4jContextFactory.java:230)
2024-06-07T17:47:36.185Z	at org.apache.logging.log4j.core.impl.Log4jContextFactory.getContext(Log4jContextFactory.java:47)
2024-06-07T17:47:36.185Z	at org.apache.logging.log4j.LogManager.getContext(LogManager.java:176)
2024-06-07T17:47:36.185Z	at org.apache.logging.log4j.LogManager.getLogger(LogManager.java:666)
2024-06-07T17:47:36.185Z	at org.apache.logging.log4j.LogManager.getRootLogger(LogManager.java:700)
2024-06-07T17:47:36.185Z	at org.apache.spark.internal.Logging.initializeLogging(Logging.scala:129)
2024-06-07T17:47:36.185Z	at org.apache.spark.internal.Logging.initializeLogIfNecessary(Logging.scala:114)
2024-06-07T17:47:36.185Z	at org.apache.spark.internal.Logging.initializeLogIfNecessary$(Logging.scala:108)
2024-06-07T17:47:36.185Z	at org.apache.spark.deploy.SparkSubmit.initializeLogIfNecessary(SparkSubmit.scala:76)
2024-06-07T17:47:36.185Z	at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:84)
2024-06-07T17:47:36.185Z	at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1120)
2024-06-07T17:47:36.185Z	at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1129)
2024-06-07T17:47:36.185Z	at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
2024-06-07T17:47:36.185Z	Caused by: java.lang.Error: java.io.FileNotFoundException: /usr/lib/jvm/java-17-amazon-corretto.x86_64/lib/tzdb.dat (Too many open files)

I think the problem is due to me including both AWS SDK for Java versions 2 and 1 which is noted in the above documentation:

2.23.18, 1.12.656

Which results in over a thousand jar files in my PySpark jar directory. I would like to cull this list, but it would be a tedious process to determine which jar files are needed. Unfortunately, EMR/EMR Serverless code is not public, so I don't know exactly which libs are needed. Does anyone know which libraries are needed or if I can limit what jar files I need to include? I cannot increase the number of open file descriptors because Lambda limits it to 1024.

Or is there another issue I should know about?

asked 2 months ago677 views
1 Answer
1

Hello,

AFAIK, you can not achieve the EMR Spark functionalities in Lambda as it has its own customization which is compatible only with EMR flavored services. The given spark-on-aws-lambda example is not meant for EMR Spark. However, the pyspark code that you execute in Lambda will also work in EMR. Example mentioned here.

AWS
SUPPORT ENGINEER
answered 2 months ago
  • Hi Yokesh, I definitely understand that I cannot achieve EMR Spark functionalities on Lambda. My original aim was to get PySpark running on Lambda. The problem for me is I'm getting the aforementioned "too many open files" error which could be due to me adding Delta Lake libraries, etc. Is there any way of overcoming that?