Questions tagged [parquet]
Apache Parquet is a columnar storage format for Hadoop.
parquet
4,097
questions
0
votes
1
answer
38
views
Why do I get the "is not a Parquet file" error when when creating a parquet reader
Trying to create a AvroParquetReader for a parquet file reading in blockBlob in azure storageaccount, but getting an error - Caused by: java.lang.RuntimeException: InputBuffer@7a70b9e9 is not a ...
2
votes
1
answer
47
views
Avoid writing partition column in Parquet file
I'm trying to export a DuckDB table to a Parquet file using hive partitioning. Unfortunately, DuckDb writes the partition both as hive partition and as a column in the file. This doesn't seem correct, ...
0
votes
1
answer
45
views
Pandas parquet file pyarrow.lib.ArrowMemoryError: malloc of size 106255424 failed
I am trying to run a python script in cPanel terminal. I am getting an error as the script tries to open the parquet file that is a 46.65 MB size. This worked on my home computer.
df = pd.read_parquet(...
0
votes
0
answers
40
views
How to reduce storage space used in redshift serverless?
I have S3 bucket which contains parquet files which are not more than 20 MB in size (all parquets combined).
I'm loading these files into AWS Redshift serverless tables using COPY command.
But after ...
0
votes
0
answers
33
views
Best way to find multiple ids in list of files in spark scala
I have a list of IDs that I want to find in my parquet files. For each of the IDs I do have an idea in which files they could be present i.e. I would have a mapping where I have
ID1 -> file1, ...
0
votes
0
answers
16
views
Inferring schema from a pyarrow dataframe created with parquet file is giving wrong data type for a column
I have a parquet file which is loaded into pandas dataframe as below.
df = pq.read_table("full_file.parquet")
and If I check the column types in the schema, as below, for some columns it is ...
0
votes
0
answers
15
views
greenplum pxf server create external tabe based on parquet hdfs file with partition in select
I have a parquet hdfs file with partitions and I want to create external table in greenplum with partition columns in it.
This is HDFS file
CREATE TABLE productshelf.funnel (
system_product_id ...
0
votes
0
answers
23
views
How does pyarrow handles date partitions?
I store files on s3 in the following format:
../country/state/city/date=2024-08-02/12-00-57/time_series.parquet
and the table contains multiple columns on of which is named date of type pa.date64(). ...
0
votes
0
answers
29
views
pd.to_parquet not owrking, but also is?
So I am trying to update the parquet file as well as google sheet at the same time, it works, but also doesn't it does seem to update the file at first glace from the reading in the file, but only if ...
0
votes
1
answer
34
views
Reading iceberg table in Dremio fails due to "is not Parquet file" and "expected magic number"
I've got a Spark Structured Streaming job that reads data from Kafka and writes them to S3 (NetApp StorageGRID appliance, on-prem) as an Apache Iceberg table (via Nessie catalog).
Afterwards I access ...
1
vote
1
answer
16
views
Writing Apache Parquet files using Parquet-GLib
Does anyone have any pointers towards a somewhat complete example or representative source code as to how to actually use parquet-glib (the C bindings to reading and writing Apache Parquet files)? The ...
0
votes
1
answer
32
views
How to automatically get table create statement for Redshift serverless from Pandas dataframe
I have an S3 bucket which contains parquet files.
I need to analyse that parquet file and create the required table in Redshift serverless.
import pyarrow.parquet as pq
df = pq.read_table(f"s3://{...
0
votes
3
answers
50
views
How update Parquet file after reading from it - refreshByPath not working
I need to persist certain information into parquet file to be accessed and updated during one batch job or the next (e.g. average values, slopes etc).
I created a little test for a prototype:
class ...
0
votes
1
answer
63
views
Spark Large single Parquet file to Delta Failure with Spark SQL
Cluster details
spark 3.4
5 executors
nodes with x16 cores and 112GB RAM
Parquet file details
provided via 3rd party
source file in adls
single 20GB .parquet file
68 million rows
1,599 columns
5034 ...
0
votes
1
answer
33
views
Open with Python an R data.table saved as metadata in a Parquet file
With R, I created a Parquet file containing a data.table as main data, and another data.table as metadata.
library(data.table)
library(arrow)
dt = data.table(x = c(1, 2, 3), y = c("a", "...