Questions tagged [apache-spark]

Ask Question

Apache Spark is an open source distributed data processing engine written in Scala providing a unified API and distributed data sets to users for both batch and streaming processing. Use cases for Apache Spark often are related to machine/deep learning and graph processing.

82,637 questions

0 votes

0 answers

3 views

Config spark to default read dat from hdfs

I've installed HDFS and Spark. However, how can I configure Spark to read from hdfs://localhost:9000/ by default? Currently, to load a file into a Spark DataFrame, I need to write spark.read.load(&...

Trần Quang Đạt

asked 2 mins ago

0 votes

0 answers

7 views

Spark History Server Parsing Logs Issues - version 3.5.1

We have upgraded spark from 3.3.2 to 3.5.1 and the spark history URL is not able to parse the logs completely. We have tried using Minio and DELL ECS storage where 1 log file is available. The same ...

Bharath Reddy

asked 10 hours ago

0 votes

0 answers

8 views

Do I need Apache Spark to execute my Airflow DAG tasks?

I have a workflow with multiple DAGs. Every DAG has multiple tasks. These tasks are simple ETL tasks. It involves geo data in the form of kmls, csvs. An example task: We have meta data of road ...

ShariqHameed

asked 10 hours ago

0 votes

0 answers

12 views

tune kafka high throughput

Im producing my pyspark dataframe into kafka cluster and i have been trying to optimize my time performance of the records sending . Actually my tests are done with a pyspark dataframe that has 2300 ...

Nabil Hadji

asked 11 hours ago

0 votes

0 answers

8 views

sentiment analysis by twitter using spark streaming

if I use the x API v2 (Basic) I can extract data through spark streaming from twitter? Because when I use the API which is free and use the Python code, I get this problem Status Failed On, 403 ...

user26634592

asked 12 hours ago

0 votes

1 answer

31 views

Catalyst rule return wrong logicalplan

def apply(plan: LogicalPlan): LogicalPlan = { plan transform { case unresolvedRelation: UnresolvedRelation => val tblSchemaName: Array[String] = unresolvedRelation.tableName.split(...

user2289345

asked 2 days ago

1 vote

1 answer

23 views

value of another column that is the same row as my last lag value

I have a timeseries dataset. I am looking to make a new column that represents the last reported (not null) value. I think I have this part figured out, using a combination of lag and last I would ...

smurphy

asked 2 days ago

0 votes

0 answers

14 views

Do you still need to cache() before checkpoint()?

Going off docs/other posts online, you should cache() before checkpoint() because checkpoint() is done afterwards with a different action. However looking at spark query plan, this doesn't seem to be ...

kyl

asked 2 days ago

2 votes

1 answer

55 views

Convert dataframe to nested json records

Teodoro Abarca

asked 2 days ago

0 votes

0 answers

26 views

Order By Spark with a variable

I am trying to use OrderBy on my dataframe but, I want to use a variable to refer to the columns for which I want to order the dataframe. The variable that I'm using is: city_col = "city_col"...

Dylan García

asked 2 days ago

0 votes

0 answers

22 views

partitionBy("INSERT_DATE") i'm doing this but it is not overwriting data without creating partition

# partitionBy("INSERT_DATE") I'm doing this but it is not overwriting data without creating partition // Enable dynamic partitioning spark.conf.set("hive.exec.dynamic.partition", &...

Rajasekhar KM

asked 2 days ago

0 votes

0 answers

20 views

java.lang.NoSuchFieldError: LZ4 : spark.sql()

I am trying to run one of my methods which has - spark.sql("DROP TABLE IF EXISTS " + dbNameString + "." + tableNameString) When I am running the method, the code breaks on the ...

ujjawal

asked 2 days ago

0 votes

1 answer

23 views

AQEShuffleRead in Spark Creating few partitions though advisoryPartitionSizeInBytes and initialPartitionNum is provided

I have added the spark.sql.adaptive.advisoryPartitionSizeInBytes=268435456 and spark.sql.adaptive.enabled=true. However my data size for each partition is more than 256 mb. I see the Dag where the ...

user3858193

1,448

asked Aug 8 at 23:28

1 vote

0 answers

15 views

Delta compatibility problem with PySpark in Dataproc GKE env

I've created a dataproc cluster using GKE and a custom image with pyspark 3.4.0. but can't get it to work with delta The custom image docker file is this: FROM us-central1-docker.pkg.dev/cloud-...

Pedro

asked Aug 8 at 20:45

1 vote

2 answers

47 views

Pyspark column ambiguous depending on the order of the join

With this example: data = [{'name': 'Alice', 'age': 1}, {'name': 'Casper', 'age': 2}, {'name': 'Agatha', 'age': 3}] df = spark.createDataFrame(data) df_1 = df.select("name", df["age&...

Erik Garcia

asked Aug 8 at 16:14

15 30 50 per page

2 3 4 5

…

5510 Next

Collectives™ on Stack Overflow

Questions tagged [apache-spark]

Config spark to default read dat from hdfs

Spark History Server Parsing Logs Issues - version 3.5.1

Do I need Apache Spark to execute my Airflow DAG tasks?

tune kafka high throughput

sentiment analysis by twitter using spark streaming

Catalyst rule return wrong logicalplan

value of another column that is the same row as my last lag value

Do you still need to cache() before checkpoint()?

Convert dataframe to nested json records

Order By Spark with a variable

partitionBy("INSERT_DATE") i'm doing this but it is not overwriting data without creating partition

java.lang.NoSuchFieldError: LZ4 : spark.sql()

AQEShuffleRead in Spark Creating few partitions though advisoryPartitionSizeInBytes and initialPartitionNum is provided

Delta compatibility problem with PySpark in Dataproc GKE env

Pyspark column ambiguous depending on the order of the join

Hot Network Questions

Collectives™ on Stack Overflow

Questions tagged [apache-spark]

Related Tags