Apache Doris’ Post

View organization page for Apache Doris, graphic

2,347 followers

1mo

Why Apache Doris is worth taking a look at as a #log analysis solution❓ 🏠 Storage efficiency: Only requires 144GB of space to store 1TB raw log data. ✍️ Write throughput: Reaches a write speed of 500MB/s when ingesting 1TB log data with a cluster of 3 machines (16 Core, 64 GB each). (Dataset from Log and Telemetry Analytics #Benchmark by Microsoft #Azure) 📄 Text search: Provides inverted index that is fine-grained to the row, enabling efficient full-text searching. 🥪 Aggregation: A C++-based vectorized execution engine and MPP distributed architecture to enable high performance. 🧑💼 Well-established distributed cluster management 🕸️ Seamless online scaling 🦑 High cluster availability #opensource #Elasticsearch #ClickHouse #database #bigdataanalytics

To view or add a comment, sign in

More Relevant Posts

Anupam Arora

Technology Manager|Data Engg. and AI products|Financial compliance
2w
Report this post
🚀 Introducing Apache XTable 🚀 Apache XTable is transforming data management with its high-performance, scalable architecture for distributed systems. Key features include: - Distributed Architecture: Seamlessly scales for large datasets. - High Performance: Optimized for efficient data handling and retrieval. - Flexible Schema Management: Adapts to evolving data needs. - Advanced Query Capabilities: Supports complex analytics. Microsoft and Google have significantly contributed to XTable's development, enhancing its scalability and performance. XTable also excels in handling modern table formats, including: - Apache Iceberg: For managing large-scale tables with high performance and reliability. - Delta Lake: Provides ACID transactions and scalable metadata handling. Discover how Apache XTable, backed by industry leaders, can elevate your data management strategy! #ApacheXTable #BigData #DataManagement #Microsoft #Google #OpenSource #DataProcessing #Iceberg #DeltaLake #Analytics
Like Comment
To view or add a comment, sign in
Jatin Solanki

Founder/CEO @ Decube | Data Trust platform
1y
Report this post
Efficient data management tip: Apache Iceberg with AWS S3 1️⃣ Schema Evolution: Add, delete, or update columns without a sweat. No more full rewrites! 🔄 2️⃣ Boost Performance: Make your queries zoom like never before 🚀 3️⃣ Incremental Data Processing: Only mess with the data that's changed, not the whole pile ⏳ 4️⃣ Compatibility: Iceberg's cool with Spark, Flink, and friends 🔗 5️⃣ Rock-Solid Transactions: Handle ACID like a pro, keeping things smooth and tight 🔒 6️⃣ Scalability: Go big or go home, without slowing down 📈 7️⃣ Cost effective: Cut costs like a ninja with smart data pruning and partitioning 💰 Let me know what you think. #apacheiceberg #dataengineering
1 Comment
Like Comment
To view or add a comment, sign in
Blue Orange Digital

11,678 followers
2mo
Report this post
Want to build a real-time data pipeline with Kafka and Snowflake? ️Blue Orange shows you how! This three-part series dives into building a powerful Change Data Capture (CDC) pipeline to replicate your Postgres databases into Snowflake in real-time. Learn how to: 1. Deploy Confluent Kafka on Kubernetes (even on your local machine!) 2. Automate table replication, schema evolution, and deduplication 3. Turn your data into a high-performance data lake Don't miss this step-by-step guide for building a robust and scalable data pipeline! https://lnkd.in/eTfVUdjC #Kafka #Snowflake #CDC #DataPipeline #Kubernetes #Postgres #RealTimeData #StreamFlake

StreamFlake: Real-Time CDC Pipeline with Kafka and Snowflake

medium.com
Like Comment
To view or add a comment, sign in
Soumil S.

Lead Data Engineer | Big Data & AWS Expert | Apache Hudi Specialist | Spark & AWS Glue | YouTuber
10mo Edited
Report this post
Accelerating Data Processing: Leveraging Apache Hudi with DynamoDB for Faster Commit Time Retrieval Code https://lnkd.in/eQa3Rysk #DataProcessing #ApacheHudi #DynamoDB #Efficiency #StayTuned

2 Comments
Like Comment
To view or add a comment, sign in
Guilherme Santos

Senior Data/Analytics Engineer | Tech Lead | SQL Expert
2mo Edited
Report this post
Want to build a real-time data pipeline with Kafka and Snowflake? ⚡ Blue Orange Digital shows you how! This 3-part series dives into building a powerful Change Data Capture (CDC) pipeline to replicate your Postgres databases into Snowflake in real-time. Learn how to: - Deploy Confluent Kafka on Kubernetes (even on your local machine!) - Automate table replication, schema evolution, and deduplication - Turn your data into a high-performance data lake Don't miss this step-by-step guide for building a robust and scalable data pipeline! #Kafka #Snowflake #CDC #DataPipeline #Kubernetes #Postgres #RealTimeData #StreamFlake

StreamFlake: Real-Time CDC Pipeline with Kafka and Snowflake

medium.com
Like Comment
To view or add a comment, sign in
Ronen Ariely

Senior Consultant @ Ariely | Architect, Data platforms & applications.
3mo
Report this post
🔥 𝗔𝘇𝘂𝗿𝗲/𝗦𝗤𝗟 𝗣𝗲𝗿𝗳𝗼𝗿𝗺𝗮𝗻𝗰𝗲: 𝗜𝗻𝘁𝗲𝗿𝗻𝗮𝗹𝘀 𝗼𝗳 𝗮 𝗛𝗮𝘀𝗵 𝗦𝗽𝗶𝗹𝗹 with Hugo Kornelis 🔥 Join us on 𝗔𝗽𝗿𝗶𝗹 𝟮𝟱, 𝟮𝟬𝟮𝟰, 𝗮𝘁 𝟭𝟮:𝟬𝟬 𝗣𝗠 (𝗘𝗦𝗧) for an enlightening session on the “Five Stages of Grief - Internals of a Hash Spill” with SQL Server expert, Hugo Kornelis Do you know how SQL Server and Azure SQL fun the queries behind the scenes? Let's uncover the mysteries of the Execution plan's Hash Match operator, learn about dynamic role reversal, grace hash join, bail-out, bit-vector filtering, and more. Whether you’re into E-Learning, Database Development, Machine Learning, SQL Server, or SQL Azure, this session is a must-attend! See you there! 🚀 #DataDrivenCommunity #SQLServer #Azure #AzureSQL Remember to RSVP here: https://lnkd.in/dBhXQ5G5

Five stages of grief - internals of a hash spill - Hugo Kornelis, Thu, Apr 25, 2024, 12:00 PM | Meetup

meetup.com
Like Comment
To view or add a comment, sign in
InfluxData

19,429 followers
9mo Edited
Report this post
Since working with big data tends to be challenging, Apache Arrow makes analytics workloads more efficient for modern CPU and GPU hardware. In this article, InfoWorld explains how Apache Arrow does for OLAP workloads what ODBC/JDBC did for OLTP workloads by creating a common interface for different systems working with analytics data. https://bit.ly/47lb5So #ApacheArrow #InfluxDB #analytics

How Apache Arrow speeds big data processing

infoworld.com
Like Comment
To view or add a comment, sign in
Vivek Tiwari

Data Engineer || Data Analyst || Python || DataBricks || Sparks-SQL || ML || Big Data || PySpark || PowerBI || DP-203 Certified
5mo
Report this post
Caching in Spark: If we are running multiple queries on the same table then caching is important to avoid reading the data from the disk for every query. Syntax: Cache lazy table <cache_table_name> as select * from employee_db.new_employee_data_table NOTE: “Lazy is optional” If we will remove “Lazy” from the query then the cache will create a view and load the table into the memory but if we will use “Lazy” then it will delay the view creation and data loading & will do it later when we will use the table for 1st time. KEEP LEARNING 🙂 #spark #sparksql #dataengineering #azure
Like Comment
To view or add a comment, sign in
Rachana P.

Sr. Data Engineer @ Publicis Groupe | Python, Pyspark | DP 203
7mo Edited
Report this post
In the world of big data and distributed computing, we all know the crucial role of using one of the most important framework apache spark. However, when dealing with large amount of data, it is crucial to understand the concept of skewness. Data skewness in Apache spark refers to a condition of data where it is not uniformly distributed across all the partitions, which cost in decreasing the processing speed, inefficient use of resources and even out of memory errors. The reason could be the inadequate partitioning strategy, join or groupby operations. well below you can find out the strategies on how to handle it: 1. Custom partitioning: instead of relying on default partitioning, implement a custom partitioning strategy. 2. Salting: salting is a technique where random value (salt) is appended to the key. 3.Dynamic partition pruning: It is a technique to optimize join operations by skipping the scanning of irrelevant partitions in both datasets. 4. Splitting skewed data: This involves identifying the skewed keys and redistributing the data associated with the keys. 5. Avoid groupby for large datasets: avoid using groupby operations on large datasets with non-unique keys. Alternatives such as reducebykey can be more efficient. Keep these strategies in your toolkit to significantly improve the performance of your spark application. #apachespark #bigdata #dataengineer #dataengineering
1 Comment
Like Comment
To view or add a comment, sign in
Weblink Technology - Elasticsearch Experts

196 followers
5mo
Report this post
Need to handle petabytes of data? 🚀 Elasticsearch's distributed nature lets it scale across hundreds of servers with ease. 💪 It's a perfect match for businesses managing big data. 🏢 Plus, its RESTful operations support makes it a breeze to integrate with your existing applications, boosting operational efficiency. 💼 And let's not forget its powerful analytics capabilities - a game-changer for driving innovation and growth from your data insights. 📈 #Elasticsearch #BigData #BusinessGrowth #DataManagement #OperationalEfficiency #DataInsights #Innovation
Like Comment
To view or add a comment, sign in

2,347 followers

View Profile Follow

Apache Doris’ Post

More from this author

Less Components, Higher Performance: Apache Doris Instead of ClickHouse, MySQL, Presto, and HBase

Introduction to Apache Doris: A Next-Generation Real-Time Data Warehouse

LLM-Powered OLAP: the Tencent Experience with Apache Doris

Explore topics