Questions tagged [apache-spark]
Apache Spark is an open source distributed data processing engine written in Scala providing a unified API and distributed data sets to users for both batch and streaming processing. Use cases for Apache Spark often are related to machine/deep learning and graph processing.
apache-spark
82,637
questions
0
votes
0
answers
3
views
Config spark to default read dat from hdfs
I've installed HDFS and Spark. However, how can I configure Spark to read from hdfs://localhost:9000/ by default? Currently, to load a file into a Spark DataFrame, I need to write spark.read.load(&...
0
votes
0
answers
7
views
Spark History Server Parsing Logs Issues - version 3.5.1
We have upgraded spark from 3.3.2 to 3.5.1 and the spark history URL is not able to parse the logs completely. We have tried using Minio and DELL ECS storage where 1 log file is available. The same ...
0
votes
0
answers
8
views
Do I need Apache Spark to execute my Airflow DAG tasks?
I have a workflow with multiple DAGs. Every DAG has multiple tasks.
These tasks are simple ETL tasks. It involves geo data in the form of kmls, csvs.
An example task:
We have meta data of road ...
0
votes
0
answers
12
views
tune kafka high throughput
Im producing my pyspark dataframe into kafka cluster and i have been trying to optimize my time performance of the records sending . Actually my tests are done with a pyspark dataframe that has 2300 ...
0
votes
0
answers
8
views
sentiment analysis by twitter using spark streaming
if I use the x API v2 (Basic) I can extract data through spark streaming from twitter?
Because when I use the API which is free and use the Python code, I get this problem Status Failed On, 403 ...
0
votes
1
answer
31
views
Catalyst rule return wrong logicalplan
def apply(plan: LogicalPlan): LogicalPlan = {
plan transform {
case unresolvedRelation: UnresolvedRelation =>
val tblSchemaName: Array[String] = unresolvedRelation.tableName.split(...
1
vote
1
answer
23
views
value of another column that is the same row as my last lag value
I have a timeseries dataset. I am looking to make a new column that represents the last reported (not null) value. I think I have this part figured out, using a combination of lag and last
I would ...
0
votes
0
answers
14
views
Do you still need to cache() before checkpoint()?
Going off docs/other posts online, you should cache() before checkpoint() because checkpoint() is done afterwards with a different action. However looking at spark query plan, this doesn't seem to be ...
2
votes
1
answer
55
views
Convert dataframe to nested json records
I have a spark dataframe as follows:
----------------------------------------------------------------------------------------------
| type | lctNbr | itmNbr | lastUpdatedDate | lctSeqId| ...
0
votes
0
answers
26
views
Order By Spark with a variable
I am trying to use OrderBy on my dataframe but, I want to use a variable to refer to the columns for which I want to order the dataframe.
The variable that I'm using is:
city_col = "city_col"...
0
votes
0
answers
22
views
partitionBy("INSERT_DATE") i'm doing this but it is not overwriting data without creating partition
# partitionBy("INSERT_DATE")
I'm doing this but it is not overwriting data without creating partition
// Enable dynamic partitioning
spark.conf.set("hive.exec.dynamic.partition", &...
0
votes
0
answers
20
views
java.lang.NoSuchFieldError: LZ4 : spark.sql()
I am trying to run one of my methods which has -
spark.sql("DROP TABLE IF EXISTS " + dbNameString + "." + tableNameString)
When I am running the method, the code breaks on the ...
0
votes
1
answer
23
views
AQEShuffleRead in Spark Creating few partitions though advisoryPartitionSizeInBytes and initialPartitionNum is provided
I have added the spark.sql.adaptive.advisoryPartitionSizeInBytes=268435456 and spark.sql.adaptive.enabled=true. However my data size for each partition is more than 256 mb. I see the Dag where the ...
1
vote
0
answers
15
views
Delta compatibility problem with PySpark in Dataproc GKE env
I've created a dataproc cluster using GKE and a custom image with pyspark 3.4.0. but can't get it to work with delta
The custom image docker file is this:
FROM us-central1-docker.pkg.dev/cloud-...
1
vote
2
answers
47
views
Pyspark column ambiguous depending on the order of the join
With this example:
data = [{'name': 'Alice', 'age': 1}, {'name': 'Casper', 'age': 2}, {'name': 'Agatha', 'age': 3}]
df = spark.createDataFrame(data)
df_1 = df.select("name", df["age&...