Skip to main content

Questions tagged [apache-spark]

Apache Spark is an open source distributed data processing engine written in Scala providing a unified API and distributed data sets to users for both batch and streaming processing. Use cases for Apache Spark often are related to machine/deep learning and graph processing.

0 votes
0 answers
3 views

Config spark to default read dat from hdfs

I've installed HDFS and Spark. However, how can I configure Spark to read from hdfs://localhost:9000/ by default? Currently, to load a file into a Spark DataFrame, I need to write spark.read.load(&...
Trần Quang Đạt's user avatar
0 votes
0 answers
7 views

Spark History Server Parsing Logs Issues - version 3.5.1

We have upgraded spark from 3.3.2 to 3.5.1 and the spark history URL is not able to parse the logs completely. We have tried using Minio and DELL ECS storage where 1 log file is available. The same ...
Bharath Reddy's user avatar
0 votes
0 answers
8 views

Do I need Apache Spark to execute my Airflow DAG tasks?

I have a workflow with multiple DAGs. Every DAG has multiple tasks. These tasks are simple ETL tasks. It involves geo data in the form of kmls, csvs. An example task: We have meta data of road ...
ShariqHameed's user avatar
0 votes
0 answers
12 views

tune kafka high throughput

Im producing my pyspark dataframe into kafka cluster and i have been trying to optimize my time performance of the records sending . Actually my tests are done with a pyspark dataframe that has 2300 ...
Nabil Hadji's user avatar
0 votes
0 answers
8 views

sentiment analysis by twitter using spark streaming

if I use the x API v2 (Basic) I can extract data through spark streaming from twitter? Because when I use the API which is free and use the Python code, I get this problem Status Failed On, 403 ...
user26634592's user avatar
0 votes
1 answer
31 views

Catalyst rule return wrong logicalplan

def apply(plan: LogicalPlan): LogicalPlan = { plan transform { case unresolvedRelation: UnresolvedRelation => val tblSchemaName: Array[String] = unresolvedRelation.tableName.split(...
user2289345's user avatar
1 vote
1 answer
23 views

value of another column that is the same row as my last lag value

I have a timeseries dataset. I am looking to make a new column that represents the last reported (not null) value. I think I have this part figured out, using a combination of lag and last I would ...
smurphy's user avatar
  • 335
0 votes
0 answers
14 views

Do you still need to cache() before checkpoint()?

Going off docs/other posts online, you should cache() before checkpoint() because checkpoint() is done afterwards with a different action. However looking at spark query plan, this doesn't seem to be ...
kyl's user avatar
  • 549
2 votes
1 answer
55 views

Convert dataframe to nested json records

I have a spark dataframe as follows: ---------------------------------------------------------------------------------------------- | type | lctNbr | itmNbr | lastUpdatedDate | lctSeqId| ...
Teodoro Abarca's user avatar
0 votes
0 answers
26 views

Order By Spark with a variable

I am trying to use OrderBy on my dataframe but, I want to use a variable to refer to the columns for which I want to order the dataframe. The variable that I'm using is: city_col = "city_col"...
Dylan García's user avatar
0 votes
0 answers
22 views

partitionBy("INSERT_DATE") i'm doing this but it is not overwriting data without creating partition

# partitionBy("INSERT_DATE") I'm doing this but it is not overwriting data without creating partition // Enable dynamic partitioning spark.conf.set("hive.exec.dynamic.partition", &...
Rajasekhar KM's user avatar
0 votes
0 answers
20 views

java.lang.NoSuchFieldError: LZ4 : spark.sql()

I am trying to run one of my methods which has - spark.sql("DROP TABLE IF EXISTS " + dbNameString + "." + tableNameString) When I am running the method, the code breaks on the ...
ujjawal's user avatar
0 votes
1 answer
23 views

AQEShuffleRead in Spark Creating few partitions though advisoryPartitionSizeInBytes and initialPartitionNum is provided

I have added the spark.sql.adaptive.advisoryPartitionSizeInBytes=268435456 and spark.sql.adaptive.enabled=true. However my data size for each partition is more than 256 mb. I see the Dag where the ...
user3858193's user avatar
  • 1,448
1 vote
0 answers
15 views

Delta compatibility problem with PySpark in Dataproc GKE env

I've created a dataproc cluster using GKE and a custom image with pyspark 3.4.0. but can't get it to work with delta The custom image docker file is this: FROM us-central1-docker.pkg.dev/cloud-...
Pedro's user avatar
  • 11
1 vote
2 answers
47 views

Pyspark column ambiguous depending on the order of the join

With this example: data = [{'name': 'Alice', 'age': 1}, {'name': 'Casper', 'age': 2}, {'name': 'Agatha', 'age': 3}] df = spark.createDataFrame(data) df_1 = df.select("name", df["age&...
Erik Garcia's user avatar

15 30 50 per page
1
2 3 4 5
5510