Skip to main content

Questions tagged [parquet]

Apache Parquet is a columnar storage format for Hadoop.

0 votes
1 answer
38 views

Why do I get the "is not a Parquet file" error when when creating a parquet reader

Trying to create a AvroParquetReader for a parquet file reading in blockBlob in azure storageaccount, but getting an error - Caused by: java.lang.RuntimeException: InputBuffer@7a70b9e9 is not a ...
Developer208's user avatar
2 votes
1 answer
47 views

Avoid writing partition column in Parquet file

I'm trying to export a DuckDB table to a Parquet file using hive partitioning. Unfortunately, DuckDb writes the partition both as hive partition and as a column in the file. This doesn't seem correct, ...
Erica's user avatar
  • 1,708
0 votes
1 answer
45 views

Pandas parquet file pyarrow.lib.ArrowMemoryError: malloc of size 106255424 failed

I am trying to run a python script in cPanel terminal. I am getting an error as the script tries to open the parquet file that is a 46.65 MB size. This worked on my home computer. df = pd.read_parquet(...
Shane S's user avatar
  • 2,173
0 votes
0 answers
40 views

How to reduce storage space used in redshift serverless?

I have S3 bucket which contains parquet files which are not more than 20 MB in size (all parquets combined). I'm loading these files into AWS Redshift serverless tables using COPY command. But after ...
Poreddy Siva Sukumar Reddy US's user avatar
0 votes
0 answers
33 views

Best way to find multiple ids in list of files in spark scala

I have a list of IDs that I want to find in my parquet files. For each of the IDs I do have an idea in which files they could be present i.e. I would have a mapping where I have ID1 -> file1, ...
Utkarsh Goel's user avatar
0 votes
0 answers
16 views

Inferring schema from a pyarrow dataframe created with parquet file is giving wrong data type for a column

I have a parquet file which is loaded into pandas dataframe as below. df = pq.read_table("full_file.parquet") and If I check the column types in the schema, as below, for some columns it is ...
Poreddy Siva Sukumar Reddy US's user avatar
0 votes
0 answers
15 views

greenplum pxf server create external tabe based on parquet hdfs file with partition in select

I have a parquet hdfs file with partitions and I want to create external table in greenplum with partition columns in it. This is HDFS file CREATE TABLE productshelf.funnel ( system_product_id ...
user_Dima's user avatar
0 votes
0 answers
23 views

How does pyarrow handles date partitions?

I store files on s3 in the following format: ../country/state/city/date=2024-08-02/12-00-57/time_series.parquet and the table contains multiple columns on of which is named date of type pa.date64(). ...
zipp's user avatar
  • 23
0 votes
0 answers
29 views

pd.to_parquet not owrking, but also is?

So I am trying to update the parquet file as well as google sheet at the same time, it works, but also doesn't it does seem to update the file at first glace from the reading in the file, but only if ...
kkkasza02's user avatar
0 votes
1 answer
34 views

Reading iceberg table in Dremio fails due to "is not Parquet file" and "expected magic number"

I've got a Spark Structured Streaming job that reads data from Kafka and writes them to S3 (NetApp StorageGRID appliance, on-prem) as an Apache Iceberg table (via Nessie catalog). Afterwards I access ...
chris922's user avatar
  • 366
1 vote
1 answer
16 views

Writing Apache Parquet files using Parquet-GLib

Does anyone have any pointers towards a somewhat complete example or representative source code as to how to actually use parquet-glib (the C bindings to reading and writing Apache Parquet files)? The ...
CptPicard's user avatar
  • 326
0 votes
1 answer
32 views

How to automatically get table create statement for Redshift serverless from Pandas dataframe

I have an S3 bucket which contains parquet files. I need to analyse that parquet file and create the required table in Redshift serverless. import pyarrow.parquet as pq df = pq.read_table(f"s3://{...
Poreddy Siva Sukumar Reddy US's user avatar
0 votes
3 answers
50 views

How update Parquet file after reading from it - refreshByPath not working

I need to persist certain information into parquet file to be accessed and updated during one batch job or the next (e.g. average values, slopes etc). I created a little test for a prototype: class ...
dermoritz's user avatar
  • 12.8k
0 votes
1 answer
63 views

Spark Large single Parquet file to Delta Failure with Spark SQL

Cluster details spark 3.4 5 executors nodes with x16 cores and 112GB RAM Parquet file details provided via 3rd party source file in adls single 20GB .parquet file 68 million rows 1,599 columns 5034 ...
Brian's user avatar
  • 139
0 votes
1 answer
33 views

Open with Python an R data.table saved as metadata in a Parquet file

With R, I created a Parquet file containing a data.table as main data, and another data.table as metadata. library(data.table) library(arrow) dt = data.table(x = c(1, 2, 3), y = c("a", "...
julien.leroux5's user avatar

15 30 50 per page
1
2 3 4 5
274