Enhancing Java Application Logging: A Comprehensive Guide
How You Can Avoid a CrowdStrike Fiasco
Database Systems
In 2024, the focus around databases is on their ability to scale and perform in modern data architectures. It's not just centered on distributed, cloud-centric data environments anymore, but rather on databases built and managed in a way that allows them to be used optimally in advanced applications. This modernization of database architectures allows for developers and organizations to be more flexible with their data. With the advancements in automation and the proliferation of artificial intelligence, the way data capabilities and databases are built, managed, and scaled has evolved at an exponential rate.This Trend Report explores database adoption and advancements, including how to leverage time series databases for analytics, why developers should use PostgreSQL, modern, real-time streaming architectures, database automation techniques for DevOps, how to take an AI-focused pivot within database systems practices, and more. The goal of this Trend Report is to equip developers and IT professionals with tried-and-true practices alongside forward-looking industry insights to allow them to modernize and future-proof their database architectures.
Data Pipeline Essentials
Open Source Migration Practices and Patterns
Speak in your natural language, ask questions about your data, and have the answers returned to you in your natural language as well: that's the objective, and what I'll show in this quick blog and, as always, provide full src repos for as well. I'll leave the use cases up to you from there. You can learn more about these Oracle Database features here for the free cloud version and here for the free container/image version. Also, you can check out the Develop with Oracle AI and Database Services: Gen, Vision, Speech, Language, and OML workshop, which explains how to create this application and numerous other examples as well as the GitHub repos that contain all the src code. Now, let's get into it. First, I'll show the setup for the Select AI database side (which, in turn, calls Gen AI service), then the OCI Real-time Speech AI Transcription service, and finally the front-end Python app that brings it all together. Oracle Database NL2SQL/Select AI (With Gen AI) While Oracle Database version 23ai contains a number of AI features such as vector search, RAG, Spatial AI, etc., NL2SQL/Select AI was introduced in version 19. We have a stateless Python application, so we'll be making calls to: DBMS_CLOUD_AI.GENERATE( prompt => :prompt, profile_name => :profile_name, action => :action) Let's look at each of these three arguments. The prompt is the natural language string, starting with "select ai." The profile_name is the name of the AI Profile created in the database for OCI Generative AI (or whatever AI service is being used) with the credential info and optionally an object_list with meta info about the data. Oracle Autonomous Database supports models from OCI Generative AI, Azure OpenAI, OpenAI, and Cohere. In our sample app, we use the Llama 3 model provided by OCI Generative AI. Here is an example code to create_profile: PLSQL dbms_cloud_admin.enable_resource_principal(username => 'MOVIESTREAM'); dbms_cloud_ai.create_profile( profile_name => 'genai', attributes => '{"provider": "oci", "credential_name": "OCI$RESOURCE_PRINCIPAL", "comments":"true", "object_list": [ {"owner": "MOVIESTREAM", "name": "GENRE"}, {"owner": "MOVIESTREAM", "name": "CUSTOMER"}, {"owner": "MOVIESTREAM", "name": "PIZZA_SHOP"}, {"owner": "MOVIESTREAM", "name": "STREAMS"}, {"owner": "MOVIESTREAM", "name": "MOVIES"}, {"owner": "MOVIESTREAM", "name": "ACTORS"} ] }' ); Finally, the action is one of four options for the interaction/prompt and type/format of the answers that are returned to you from Oracle Database's Select AI feature. In our sample app, we use narrate; however, we could use others. narrate returns the reply as a narration in natural language. chat as a chat exchange in natural language showsql returns the raw SQL for the answer/query. runsql gets the SQL and then runs it and returns the raw query results. OCI Real-Time Speech Transcription OCI Real-time Speech Transcription is expected to be released within the month and includes Whisper model multilingual support with diarization capabilities. Using this service simply requires that certain policies are created to provide access for a given user/compartment/group/tenancy. These can be specified at various levels and would generally be more restricted than the following but this gives a list of the resources needed. Plain Text allow any-user to manage ai-service-speech-family in tenancy allow any-user to manage object-family in tenancy allow any-user to read tag-namespaces in tenancy allow any-user to use ons-family in tenancy allow any-user to manage cloudevents-rules in tenancy allow any-user to use virtual-network-family in tenancy allow any-user to manage function-family in tenancy The options for accessing the service from an external client are essentially the same as accessing any OCI/cloud service. In this case, we use an OCI config file and generate a security_token using the following. Shell oci session authenticate ; oci iam region list --config-file /Users/YOURHOMEDIR/.oci/config --profile MYSPEECHAIPROFILE --auth security_token From there it's just a matter of using the preferred SDK client libraries to call the speech service. In our case, we are using Python. The Python App Here is the output of our application where we can see: A printout of the words (natural language) spoken into the microphone and transcribed by the real-time transcription service. The trigger of a Select AI command, with "narrate" action, in response to the user saying "select ai." The results of the call to the Oracle database Select AI function returned in natural language. Let's take the application step by step. First, we see the Python imports: asyncio event processing loop getpass to get the database and wallet/ewallet.pem passwords from the application prompt pyaudio for processing microphone events/sound oracledb thin driver for accessing the Oracle database and making Select AI calls oci sdk core and speech libraries for real-time speech transcription calls Python import asyncio import getpass import pyaudio import oracledb import oci from oci.config import from_file from oci.auth.signers.security_token_signer import SecurityTokenSigner from oci.ai_speech_realtime import ( RealtimeClient, RealtimeClientListener, RealtimeParameters, ) Then we see the main loop where sound from the microphone is fed to the OCI teal-time speech transcription API client and to the cloud services via WebSocket. The client is created by specifying the OCI config mentioned earlier along with the URL of the speech service and the compartment ID. def message_callback(message): print(f"Received message: {message}") realtime_speech_parameters: RealtimeParameters = RealtimeParameters() realtime_speech_parameters.language_code = "en-US" realtime_speech_parameters.model_domain = ( realtime_speech_parameters.MODEL_DOMAIN_GENERIC ) realtime_speech_parameters.partial_silence_threshold_in_ms = 0 realtime_speech_parameters.final_silence_threshold_in_ms = 2000 realtime_speech_parameters.should_ignore_invalid_customizations = False realtime_speech_parameters.stabilize_partial_results = ( realtime_speech_parameters.STABILIZE_PARTIAL_RESULTS_NONE ) realtime_speech_url = "wss://realtime.aiservice.us-phoenix-1.oci.oraclecloud.com" client = RealtimeClient( config=config, realtime_speech_parameters=realtime_speech_parameters, listener=SpeechListener(), service_endpoint=realtime_speech_url, signer=authenticator(), compartment_id="ocid1.compartment.oc1..MYcompartmentID", ) loop = asyncio.get_event_loop() loop.create_task(send_audio(client)) loop.create_task(check_idle()) loop.run_until_complete(client.connect()) if stream.is_active(): stream.close() If the transcribed speech contains "select ai", the application waits for 2 seconds, and if there is no further speech, takes the command from "select ai" on, and sends it over to the database server using the Oracle Python driver. The following is the code for the connection creation and execution of this using DBMS_CLOUD_AI.GENERATE (prompt, profile_name, action) described earlier. Python pw = getpass.getpass("Enter database user password:") # Use this when making a connection with a wallet connection = oracledb.connect( user="moviestream", password=pw, dsn="selectaidb_high", config_dir="/Users/pparkins/Downloads/Wallet_SelectAIDB", wallet_location="/Users/pparkins/Downloads/Wallet_SelectAIDB" ) def executeSelectAI(): global cummulativeResult print(f"executeSelectAI called cummulative result: {cummulativeResult}") # for example prompt => 'select ai I am looking for the top 5 selling movies for the latest month please', query = """SELECT DBMS_CLOUD_AI.GENERATE( prompt => :prompt, profile_name => 'openai_gpt35', action => 'narrate') FROM dual""" with connection.cursor() as cursor: cursor.execute(query, prompt=cummulativeResult) result = cursor.fetchone() if result and isinstance(result[0], oracledb.LOB): text_result = result[0].read() print(text_result) else: print(result) # Reset cumulativeResult after execution cummulativeResult = "" Video A walkthrough of this content can also be viewed here: Concluding Notes The next logical step of course is to add text-to-speech (TTS) functionality for the reply and OCI has a new service for that as well. I'll post an updated example including this in the near future. Thank you for reading and please do not hesitate to contact me with any questions or feedback you may have. I'd love to hear from you.
During my 10+ years of experience in Agile product development, I have seen the difficulties of meeting the rapid requirements of the digital market. Manual procedures can slow down highly flexible software engineering and delivery teams, resulting in missed chances and postponed launches. With AI and Large Language Models (LLMs) becoming more prevalent, we are on the verge of a major change. Gartner points out a 25% increase in project success rates for those using predictive analytics (Gartner, 2021). These technologies are changing the way agile product development is optimized - by automating tasks, improving decision-making, and forecasting future trends. As stated in a report from McKinsey, companies using AI experience a 20% decrease in project costs (McKinsey & Company, 2023). In this article, I discuss how agile product development including any experiences and user journeys can be improved based on AI and LLM integrations across the development lifecycle. Also Read: "The Foundation of AI and Analytics Success: Why Architecture Matters" AI and LLM Integration Phases for Agile Product Development Automating User Story Generation Creating user stories is crucial for Agile development, although it can be time-consuming. LLMs, for example, such as GPT-4 from OpenAI are able to streamline the process by creating comprehensive user stories using available documentation and feedback. This speeds up the process while also enhancing precision and significance. Application Scenario For example, I focus on utilizing AI or LLM-based methods for streamlining, optimizing, and automating the creation of user stories. Integrating such methods with a comprehensive backlog has allowed me to improve product development lifecycles and any engineering prioritization. This significantly reduces user story creation time, which is also helpful for solutions architects and increases user satisfaction where there is more relevant and accurate feature development. Significance and Advantages The automation of generating user stories is essential as it reduces the monotonous job of creating stories by hand, enabling product managers and software engineers to concentrate on more strategic tasks. This process guarantees that user stories are created uniformly and in line with user requirements, resulting in improved prioritization and quicker development cycles. Assisting agile teams in sustaining their progress and releasing features that better align with user needs. Additionally, organizations that adopt AI for generating user stories usually see a 50% reduction in story creation time (Menzies & Zimmermann, 2022). Also Read: "User Story Reflections" Optimizing Backlog Prioritization Key to swift value delivery is effective prioritization of the backlog. AI algorithms analyze user feedback, market trends, and technical dependencies to forecast the most valuable features. This approach driven by data assists product managers in making well-informed choices. Application Scenario For example, during the development of a digital healthcare consumer platform, I utilized AI tools to review user feedback and determine which backlog items to focus on first. This was mapped across different prioritization techniques as well as how engineering would execute them based on complexity. As a result, there was a 40% rise in feature utilization and a 20% decrease in feature development duration, which also helped the software engineering team improve their metrics. Significance and Advantages It is crucial to prioritize backlog optimization in order to make informed decisions that improve the value of the product and customer satisfaction. Utilizing AI for prioritization aids agile teams in determining which features will yield the greatest benefit, enabling them to utilize resources effectively and concentrate on tasks with significant impact. Companies that have implemented AI for prioritizing their backlog have seen a 40% growth in feature adoption (Buch & Pokiya, 2020). Leveraging Predictive Analytics Predictive analytics offers insight to help shape development tactics. AI models can predict risks and estimate delivery times by examining historical data, helping teams address issues and align development efforts with market changes. Further, this can help agile product development teams assess how to staff across sprints and ensure workforce optimization to improve feature velocity. Application Scenario For example, I use predictive analytics in collaboration with engineering development and delivery teams to predict how new features would affect Sprint planning, Sprint allocation, and user engagement. The information assisted in determining which updates were most important as well as need execution in upcoming sprints and has allowed me to optimize MVPs, resulting in a ~25% rise in user retention and a ~15% increase in new user acquisition across two different products. Significance and Advantages Predictive analytics offer practical insights that steer strategic choices in flexible product development. Teams can prioritize new features that will have the greatest impact on user engagement and retention by predicting their effects. Businesses that use predictive analytics have observed a 25% rise in customer retention (Forrester, 2019). Improving Product Experiences and User Journeys AI and LLMs improve user journeys and product experiences through a more user-focused approach to development. Automated creation of user stories guarantees that features are developed according to genuine user requirements, resulting in products that are more instinctive and captivating. This alignment improves user satisfaction and involvement by customizing features to meet specific needs and desires. Use Case For example, I used LLMs to analyze user feedback and create features that directly addressed user pain points. This resulted in streamlining and optimizing how different product features are lined up along with tech debt for engineering execution. I have seen a ~35% increase in user engagement significant reduction in user churn rates. Significance and Advantages Improving product experiences and user journeys with AI and LLMs ensures a user-focused approach in product development, resulting in more user-friendly and personalized experiences. Aligning with user needs not only boosts satisfaction but also enhances engagement and retention. After incorporating AI-driven improvements, companies have experienced a 35% rise in user engagement (Ransbotham, Kiron, Gerbert, & Reeves, 2018). Supporting Agile Product Development and Product Management Incorporating AI and LLMs into agile product development changes how teams tackle and carry out projects, providing numerous advantages. To begin with, these technologies simplify the process of developing user stories, cutting down on manual work and allowing more time for strategic duties. This results in enhanced precision and significance in feature advancement. Also, by using AI to prioritize the backlog, teams can concentrate on important tasks, leading to better use of resources and increased overall productivity. Predictive analytics enhances value by predicting feature performance, allowing teams to make educated decisions that increase user retention and engagement. From my own experience, I've noticed that these advancements not only speed up the process of development but also make products better suited to user requirements, resulting in a more agile and adaptable development setting. The integration of AI in agile product development leads to improved product management, faster iterations, and enhanced user experience. For example, the global AI-assisted custom application development market is expected to grow up to $61Bn and from 21% to 28% by 2024 (Deloitte Insights, 2020). As a product manager working across multiple software engineering teams, AI and LLMs have helped me simplify decision-making by automating routine tasks and providing actionable insights. Automated user story generation and backlog prioritization free up time to focus on strategic aspects, while predictive analytics offers data-driven forecasts and trend analysis. This results in a more agile and responsive product management process, where decisions are guided by comprehensive data and real-time insights, ultimately leading to more successful product outcomes and better market alignment. Benefits of AI and LLMs for Agile Product Development Conclusion and Next Steps The incorporation of AI and LLMs in agile product development seems like a dynamic revolution. In my opinion, these tools have revolutionized the way tasks are done by automating them, streamlining processes, and forecasting trends accurately. They have made workflows more efficient and enhanced product experiences, resulting in more agile and responsive development cycles. As we further accept and improve these technologies, I look forward to witnessing how their developing abilities will continue to change our strategy for creating and providing outstanding products. The process of incorporating AI and LLMs into agile product development methods is indeed exciting and filled with potential. Key Takeaways Start using AI and LLM tools to automate and improve the generation of user stories and prioritize backlogs in your development processes. Utilize predictive analytics: Employ predictive analytics to gain insight into potential project risks and market trends, enabling proactive modifications. Prioritize user-centric development: Utilize AI-generated insights to enhance product experiences for better user satisfaction and retention.
Forget what you think you know about AI. It's not just for tech giants and universities with deep pockets and armies of engineers and grad students. The power to build useful intelligent systems is within your reach. Thanks to incredible advancements in Large Language Models (LLMs) – like the ones powering Gemini and ChatGPT – you can create AI-driven products that used to require a team of engineers. In this series, we'll demystify the process of building LLM-powered applications, starting with a delicious use case: creating a personalized AI meal planner. Our Use Case As an example use case for our journey, we're going to be building a meal-planning app. There’s no shortage of meal plans available online, including those customized for different needs (varying goals, underlying health conditions, etc.). The problem is that it’s often difficult (sometimes impossible) to find guidance tailored specifically for you without hiring a health professional. Let's consider a realistic example: Sarah, a 32-year-old software engineer, is training for her first marathon. She needs a meal plan that not only meets her increased caloric needs but also accounts for her lactose intolerance and preference for plant-based proteins. Traditional meal planning apps struggle with this level of customization, making this a perfect application of an LLM-powered solution that could easily generate a tailored plan, adjusting macronutrients and suggesting specific foods that meet all of Sarah's requirements. In this tutorial, we’ll aim to develop a model that can take in a variety of inputs (age, height, weight, activity level, dietary restrictions, personal preferences, etc.) and generate a delicious and nutritious meal plan tailored specifically to the user. What We’ll Cover In this article, we’ll walk step by step through the creation of the application. We’ll cover data preparation, the model lifecycle, and finally how to wrap it all together into a usable product. In “Part 1: The Right Ingredients: Dataset Creation,” we’ll set the foundation for the quality of our model by constructing a dataset specific to our use case. We’ll discuss why data is so important, the various ways of preparing a dataset, and how to avoid common pitfalls by cleaning your data. In “Part 2: Shake and Bake: Training and Deploying Your LLM,” we’ll actually go through the process of using our dataset to train a new model that we can actually interact with. Then, we’ll deploy the model on the cloud. In “Part 3: Taste Testing and Fine-tuning: Evaluating Your Meal Planning Bot,” we’ll explore the science of evaluating an LLM and determining whether it meets our goals or not. We’ll set up a rough evaluation that we’ll use to look at our own model. In “Part 4: Building an Interface: Presenting Your Masterpiece,” we’ll bring everything together in a working application that we’ll deploy to the cloud. We’ll also discuss how to think about exposing your model to the world and real users We’ll close with “Part 5: Beyond the Plate: Conclusion and Next Steps” where we’ll reflect on the experience of putting an LLM-powered application together and putting it out in the world. We’ll also consider some next-step actions we can take from there. Part 1, The Right Ingredients: Dataset Creation Software engineering is an excellent metaphor for modeling. We’ll even use it heavily later on in this post. However, when it comes to using data to modify model performance, there’s hardly a better analogy than sculpting. The process of creating a sculpture from solid material is generally rough shaping, followed by successive rounds of refinement, until the material has "converged" to the artist’s vision. In this way, modeling involves starting with a featureless blob of 1s and 0s and slowly tuning it until it behaves in the way that the modeler intends. Where the sculptor may pick up various chisels, picks, or hammers, however, the modeler’s tool is data. This particular tool is immensely versatile. It can be used to inject new knowledge and domain understanding into a model by training on subject matter content and examples or by connecting to external systems as in Retrieval Augmented Generation (RAG). It can also be debugged by teaching it to behave in a particular way in specific edge-case scenarios. Inversely, it can be used to "unlearn" certain behaviors that were introduced in prior rounds of training. Data is useful also as an experimentation tool to learn about model behavior or even user behavior. With these applications and more, it should be clear that data is nearly everything when it comes to modeling. In this section we’ll provide a comprehensive overview of how to create a dataset for your use case, including: Understanding how data is used across the model lifecycle Defining your requirements Creating the dataset Preparing the dataset for model training Data Across the Model Lifecycle It's tempting to believe that the power of a Large Language Model (LLM) rests solely on its size – the more parameters, the better. But that's only part of the story. While model size plays a role, it's the quality and strategic use of data that truly unlocks an LLM's potential. Think of it this way: you can give a master chef a mountain of ingredients, but without the right recipe and techniques, the result won't be a culinary masterpiece. Let's explore the key stages where data shapes the mind of an LLM, transforming it from a blank slate into a powerful and versatile AI: 1. Pretraining: Building a Broad Knowledge Base Pretraining is like sending your LLM to an all-you-can-eat buffet of knowledge. We flood the model with massive datasets of text and code, exposing it to the vastness of the internet and more. This is where the LLM learns fundamental language patterns, absorbs a wide range of concepts, and develops its impressive ability to predict what comes next in a sentence or piece of code. 2. Supervised Fine-Tuning (SFT): Developing Specialized Expertise Once the LLM has a solid foundation, it's time to hone its skills for specific tasks. In Supervised Fine-Tuning (SFT), we provide the model with carefully curated datasets of prompt-response pairs, guiding it toward the desired behavior. Want your LLM to translate languages? Feed it examples of translated text. Need it to summarize documents? Provide it with well-crafted summaries. SFT is where we mold the LLM from a generalist into a specialist. 3. Reinforcement Learning (RL): Refining Behavior Through Feedback Reinforcement Learning (RL) is all about feedback and optimization. We present the LLM with choices, observe its decisions, and provide rewards for responses that align with our goals. This iterative process helps the model learn which responses are most favorable, gradually refining its behavior and improving its accuracy. 4. In-Context Learning: Adapting to New Information Real-world conversations are full of surprises, requiring LLMs to adapt to new information on the fly. In-context learning allows LLMs to process novel information presented within a conversation, even if it wasn't part of their initial training data. This adaptability makes LLMs more dynamic and better equipped to handle the unexpected. 5. Retrieval Augmented Generation (RAG): Expanding Knowledge Horizons Sometimes, LLMs need access to information that extends beyond their training data. Retrieval Augmented Generation (RAG) bridges this gap by linking the LLM to external databases or knowledge repositories. This enables the model to retrieve up-to-date information, incorporate it into its responses, and provide more comprehensive and insightful answers. Data: The Key To Unlocking LLM Potential From its foundational understanding of language to its ability to adapt, learn, and access external knowledge, data shapes every facet of an LLM's capabilities. By strategically utilizing data throughout the model's lifecycle, we unlock its true potential and power the development of truly transformative AI applications. Defining Your Requirements You’ve read the theory and understand the importance of data and all the ways it can be used. Are we ready to start creating our dataset? Well, not quite so fast. We need to make sure we understand the problem space and use that to figure out what data we even need. User Experience Human-Centered Design is a principle that involves always starting with the user and their need in mind (instead of technology, policy, or other extraneous factors). This can be a very exciting and rewarding activity to better understand target users and how to serve them. Making sure the user experience expectations are clear can also de-risk a modeling project by making sure everyone on the team is attuned to the same definition of success. Some questions to ask while clarifying the UX include: What information do we need from users? Will information be provided open-ended or in some structured format? How should the model respond to prompts with incomplete information? Should our output be structured, or in prose? Should we always generate an output, or sometimes ask the user for clarification or more information? In our case, we’ll stick with open-ended inputs and structured outputs to allow user flexibility while maintaining predictability. We’ll avoid follow-ups to reduce the complexity of our proof of concept. Various techniques and guides exist elsewhere to assist modeling teams in crafting better requirements through a better understanding of their users. Entity Relationship Diagrams ER diagrams show all the entities and relationships involved in a system, and are an extremely powerful tool for understanding systems, use cases, and the like. Painting a picture of our use case, we can use ERDs to hone in on exactly what data we need to capture while making sure we don’t have any blind spots. The process of creating an ER diagram is quite simple: write out all the entities (nouns) you can think of related to your app. Then write out the relationships between them, and that’s it! In reality, this is done over several rounds, but it creates a rich tool useful for both understanding and communicating your system. Below is the ER diagram we crafted for RecipeBuddy: While ours is quite simple, ER Diagrams can get quite complex. Dataset Attributes Wait! There’s still more we need to decide on in terms of our dataset. Below are a few considerations, but you’ll have to think deeply about your use case to make sure you cover all the bases for your dataset. Dataset Type In this series, we’re sticking to collecting and training on SFT data, but as we covered earlier there are many different types of data to train on. Input and Output Attributes The number of variables to consider on an input and to generate on the output are important considerations in modeling, and are an indicator of the complexity of your use case. Great care should be taken in deciding this, as it will impact the diversity of scenarios you’ll need to cover in your data and impact the volume of data you need to collect (which will also impact the required compute and thus cost of training your model). In our case, let’s use the following inputs: Age Height Weight Activity level Dietary restrictions Personal preferences/Goals On the output, let’s include a daily meal plan for multiple meals, with specific guidance for each meal: Breakfast Lunch Dinner Snack 1 Snack 2 For each meal: Carbs Chicken/Fish/Meat Whey Protein Veggies Oil/Fat Distribution For each attribute that you are exploring, you should consider the natural diversity of that attribute. Highly diverse attributes require a lot more data to adequately cover than bounded ones. For example, consider creating a dataset that allows users to ask about elements in the periodic table. Simple: there are only so many elements in the periodic table. Now consider an LLM that is trained to identify all possible compounds that are possible when given a list of elements. For any given input, the number of possible outputs is effectively infinite, making this a much more challenging task. Additionally, note that the more diverse your training data, the better the model will be able to generalize concepts even to examples that it hasn’t seen in the training corpus. For our proof of concept, we won’t exhaust the distribution of each attribute, instead focusing on a finite number of examples. Edge Cases As you define your requirements you may also wish to identify specific edge cases that you wish to avoid. In our case, let’s avoid answering any questions when the user is pregnant, and instead direct them to seek help from a professional. We now have a decent spec for our data collection task, except for one thing: how much data do we need? As we described earlier, this is determined by a combination of input/output attributes, distributions of those, and the number of edge cases we want to handle. One way to quickly get a sense of how many values you need is by considering a simple formula: For each input attribute, assess how many "buckets" the values could fall into. Age, for example, might be 0-18, 18-40, 40-60, or 60+ so 4 buckets. Across all your attributes, multiply the number of buckets together. Add the number of use cases. This is one way to roughly gauge how much data you need to fully cover your use case and can be a starting point to think about what data you want to exclude or where you don’t want to consider the distribution of a particular attribute. Creating the Dataset Now we’re ready to start collecting data! But we have a few options, and we’ll have to decide on a path forward. Essentially, there are two ways we can go about collecting data: using existing data or creating new data. Using Existing Data Gather first-party data from relevant communities or internal sources. Surveys, internal data sources, and crowdsourcing can be used to gather 1st party data. Pros: This is likely the closest you can get to "ground truth" data and thus the highest quality data that you might be able to collect. Cons: Unless you already have access to a dataset, constructing new datasets in this way can be slow and time-consuming. If there is personally identifiable information in your dataset, you’ll also need to build in assurances to ensure your data providers’ privacy is not compromised. Collect third-party data from public datasets, data providers, or web scraping. Existing datasets can be found online, purchased from data brokers, or scraped directly from the web and can be a powerful way to leverage data that has already been collected. Pros: This method can be a great way to collect a large volume and diversity of real-world, human-submitted data. Cons: It can be difficult to ensure individual privacy when using 3rd party datasets. Additionally, some data collection methods like web scraping can violate some sites’ terms of service. Creating New Data Human Generated You can obviously write your own prompt/response demonstrations to train the model. To scale, you can even partner with data companies (e.g. Surge, Scale) to create human-generated data at scale. Pros: Human judgment can be useful to make sure generated data makes sense and can be useful. Cons: Having humans write data can be costly and time-intensive. Add in various levels of quality control, and Human Data becomes a complex operation. Synthetically Generated You can also simply ask an LLM to generate the data for you. Pros: This is a cheap method that can scale to large numbers of datasets very quickly. Cons: Models are not able to outperform themselves, so often synthetic data just causes the model to regress to the mean. While this can be addressed by testing different models for the data generation step, it can also introduce hallucinations and errors in your dataset that would be easy for a human to spot, but hard for the LLM to catch. Hybrid A powerful technique is to combine human and synthetic data generation by having humans and models successively rewrite each others’ inputs. Pros: Takes the best of human and LLM generation. Can possibly outperform the model. Cons: While this is a good compromise, it still involves a fair amount of complexity and effort to get right. Choosing the Right Method for Your Project Selecting the best data creation method depends on various factors: Project scope and timeline Available resources (budget, manpower, existing data) Required data quality and specificity Privacy and legal considerations For our meal planning bot, we're opting for synthetic data generation. This choice allows us to: Quickly generate a large, diverse dataset Maintain control over the data distribution and edge cases Avoid potential privacy issues associated with real user data However, keep in mind that in a production environment, a hybrid approach combining synthetic data with carefully vetted real-world examples often yields the best results. In our case, though, we'll create synthetic data. While a hybrid approach will have worked well here, for the purposes of this tutorial, we want to keep the process simple and inexpensive so you come away with the knowledge and confidence to build a model. Generating Synthetic Data Synthetic data generation has become increasingly important in the field of AI, as it allows developers to create large, diverse datasets tailored to their specific use cases. By generating synthetic examples, we can expand our training data, cover a wider range of scenarios, and ultimately improve the performance of our AI models. The NIH, for example, partnered with the industry to create synthetic COVID-19 datasets that were useful in scenario planning and other purposes. In the context of our AI meal planner, synthetic data generation enables us to create personalized meal plans based on various user attributes and preferences. By constructing a set of rules and templates, we can generate realistic examples that mimic the kind of data our model would encounter in real-world use. One popular approach to synthetic data generation is called "Rules Based Generation." This method involves creating a structured prompt that outlines the context, input parameters, output format, and examples for the desired data. Let's break down the process of constructing such a prompt: Context: Begin by providing a clear description of the task and the role the model should assume. In our case, we want the model to act as an expert dietician creating personalized meal plans. Input parameters: Specify the user attributes and preferences that the model should consider when generating meal plans. This can include age, height, weight, activity level, dietary restrictions, and personal goals. Output format: Define a consistent structure for the generated meal plans, including the breakdown of macronutrients (carbs, protein, fat), food types, and quantities for each meal and snack. Examples: Provide a few representative examples of meal plans that adhere to the specified format and constraints. These examples serve as a template for the model to follow when generating new data points. Here's an example of a prompt that incorporates these elements: Plain Text You are an expert dietician. You prepare detailed meal plans for individuals customized based on their age, weight, activity level and a number of other attributes to optimize their health. Below, I've provided you with the attributes of a specific user, please generate a meal plan for Breakfast, Lunch, Dinner, and 2 snacks. For each meal, use the following format: Carbs (what type and how much) Chicken/Fish/Meat (what type and how much) Whey Protein (how much) Veggies (what type and how much) Oil/Fat (how much) Include a calculation of total calories per meal, and for the whole day and include that in your response. Do not include any additional information. {User Attributes} By providing this structured prompt to a language model, we can generate multiple meal plan examples by varying the user attributes. This allows us to create a diverse dataset that covers a wide range of user-profiles and preferences. However, it's crucial to validate the generated data to ensure its accuracy and appropriateness. Involving domain experts, such as dieticians or nutritionists, can help identify any inconsistencies, biases, or unrealistic examples. They can provide valuable feedback to refine the prompt and improve the quality of the generated meal plans. Let’s test this prompt on our model. Using the following user attributes: Age: 25 Height: 160cm Weight: 55kg Activity Level: N/A Dietary Restrictions: N/A Personal Preferences/Goals: They want to maintain their weight and keep a body fat percentage of 22%. We get the following response: Plain Text Breakfast (400 calories) Carbs: 50g (1 cup oatmeal) Chicken/Fish/Meat: None Whey Protein: 30g Veggies: 1 cup mixed berries Oil/Fat: 10g (2 tsp almond butter) Lunch (450 calories) Carbs: 60g (1 cup cooked quinoa) Chicken/Fish/Meat: 120g grilled chicken breast Whey Protein: None Veggies: 2 cups mixed greens salad with cucumber, tomatoes Oil/Fat: 15g (1 tbsp olive oil dressing) Dinner (550 calories) Carbs: 70g (1 medium baked sweet potato) Chicken/Fish/Meat: 150g baked salmon Whey Protein: None Veggies: 2 cups steamed broccoli Oil/Fat: 20g (1 tbsp olive oil + 1/4 avocado) Snack 1 (150 calories) Carbs: 20g (1 medium apple) Chicken/Fish/Meat: None Whey Protein: None Veggies: None Oil/Fat: 10g (2 tbsp peanut butter) Snack 2 (200 calories) Carbs: 25g (1 cup nonfat Greek yogurt) Chicken/Fish/Meat: None Whey Protein: None Veggies: None Oil/Fat: 5g (a small handful of almonds) Total Daily Calories: 1750 This looks great! The formatting has been followed, and the response looks reasonable. Spot-checking some of the meals, the calorie counts seem to check out. At this stage, it will be important to double-check that these values are accurate, and perhaps even check with a domain expert (a dietician or nutritionist in our case) to validate these results. Now, you may be wondering, why do we need to generate the data, when the model is already quite good at handling this use case? Indeed! This model seems capable of handling this use case. However, let’s make it more complex by adding a personal preference of “Don’t want boring foods”: Plain Text Breakfast (350 calories) Carbs: 1 cup oatmeal (50g) Chicken/Fish/Meat: 1/2 cup scrambled eggs (2 eggs) Whey Protein: 1 scoop (25g) Veggies: 1/2 cup spinach (20g) Oil/Fat: 1 tsp olive oil (5g) Lunch (400 calories) Carbs: 1 cup quinoa (40g) Chicken/Fish/Meat: 4 oz grilled chicken breast (115g) Whey Protein: N/A Veggies: 1 cup mixed greens (30g), 1/2 cup cherry tomatoes (50g), 1/4 cup cucumber (25g) Oil/Fat: 1 tbsp olive oil (15g) Snack 1 (150 calories) Carbs: 1/2 cup fruit salad (100g) Chicken/Fish/Meat: N/A Whey Protein: N/A Veggies: N/A Oil/Fat: N/A Snack 2 (200 calories) Carbs: 1/4 cup trail mix (50g) Chicken/Fish/Meat: N/A Whey Protein: N/A Veggies: N/A Oil/Fat: N/A Dinner (500 calories) Carbs: 1 cup brown rice (50g) Chicken/Fish/Meat: 4 oz grilled turkey breast (115g) Whey Protein: N/A Veggies: 1 cup broccoli (100g), 1/2 cup bell peppers (50g) Oil/Fat: 1 tbsp olive oil (15g) Total Daily Calories: 1600 calories Hmm.. this response isn’t as great as the first one. We will have eggs for breakfast, chicken breast for lunch, and so on. We can use this example and other "challenging" scenarios to construct our dataset. For this example, let’s spice it up: Plain Text Breakfast (350 calories) Carbs: 1 cup cassava (50g) Chicken/Fish/Meat: 1/2 cup (80g) Scrambled Eggs with 1/2 cup chopped spinach Whey Protein: 1 scoop (25g) Veggies: 1/2 cup spinach (20g) Oil/Fat: 1 tsp olive oil (5g) Lunch (400 calories) Carbs: 1 cup quinoa (40g) Chicken/Fish/Meat: 4 oz (115g) Ground Turkey (can be seasoned and formed into patties) Whey Protein: N/A Veggies: 1 cup mixed greens (30g), 1/2 cup cherry tomatoes (50g), 1/4 cup cucumber (25g) Oil/Fat: 1 tbsp olive oil (15g) Snack 1 (150 calories) Carbs: 1/2 cup fruit salad (100g) Chicken/Fish/Meat: N/A Whey Protein: N/A Veggies: N/A Oil/Fat: N/A Snack 2 (200 calories) Carbs: 1/2 cup Mango cubes (75g) Chicken/Fish/Meat: N/A Whey Protein: N/A Veggies: N/A Oil/Fat: N/A Dinner (500 calories) Carbs: 1 cup brown rice (50g) Chicken/Fish/Meat: 4 oz (115g) Tofu (can be pan-fried or baked) Whey Protein: N/A Veggies: 1 cup broccoli (100g), 1/2 cup bell peppers (50g) Oil/Fat: 1 tbsp olive oil (15g) Total Daily Calories: 1600 calories Great - that’s better, a little more interesting! As you construct your dataset, you may find your model similarly capable. But as you keep testing different scenarios, you’ll undoubtedly find edge cases where the model struggles, and where you can help it perform better. Now that we have a bunch of examples, we can start to prepare data. Before we do that, however, we need to construct a prompt for each user scenario. Above, we simply injected some user attributes into our prompt, but to make it more realistic we’ll need to construct these as sentences like: I'm 25, female, about 160cm and 120 pounds. I want to stay the same at around 22% BF. That’s it! Now we have our dataset. Let’s move to Google Sheets to start to get it ready for training. Preparing Data for Training Exact data preparation steps can depend on a number of factors, but it is generally useful to put the data in a form that can be easily reviewed and manipulated by a broad audience. Spreadsheet software like Google Sheets is a natural choice for this as most people are familiar with it, and it lends itself well to reviewing individual "records" or "examples" of training data. Setting up the data is quite simple. First, we need two columns: "Prompt" and "Response." Each row should include the respective values in those columns based on the dataset we constructed previously. Now that we have it there, it’s a good time to clean the data. Data Cleaning Before we get our data ready for training, we need to make sure it's clean of inaccuracies, inconsistencies, errors, and other issues that could get in the way of our end goal. There are a few key things to watch out for: Missing Values Is your dataset complete, or are there examples with missing fields? You'll need to decide whether you want to toss out those examples completely, or if you want to try to fill them in (also called imputation). Formatting Issues Is text capitalized appropriately? Are values in the right units? Are there any structural issues like mismatched brackets? All of these need to be resolved to ensure consistency. Outliers, Irrelevant, and Inaccurate Data Is there any data that is so far outside the norm that it could mislead the model? This data should be removed. Also, watch out for any data that is irrelevant to your use case and remove that as well. Collaborating with a domain expert can be a useful strategy to filter out datasets that don’t belong. By carefully cleaning and preprocessing your data, you're setting yourself up for success in training a high-performing model. It may not be the most glamorous part of the process, but it's absolutely essential. Time investment at this stage is critical for production-grade models and will make later steps much easier. Additional Best Practices for Data Cleaning Automate where possible: Use automated tools and scripts to handle repetitive tasks like format standardization and missing value imputation. Iterate and validate: Data cleaning is not a one-time task. Continuously iterate and validate your cleaning methods to ensure ongoing data quality. Document everything: Maintain detailed documentation of all data cleaning steps, including decisions made and methods used. This will help in debugging and refining your process. Leverage domain knowledge: Collaborate with domain experts to ensure your data cleaning process is aligned with real-world requirements and nuances. Wrapping Up Creating a high-quality dataset is essential for training an effective meal-planning LLM. By understanding the types of data, defining clear requirements, employing appropriate collection strategies, cleaning and preprocessing your data, augmenting your dataset, and iteratively refining it, you can build a model that generates personalized, diverse, and relevant meal plans. Remember, data preparation is an ongoing process. As you deploy your model and gather user feedback, continue to enhance your dataset and retrain your model to unlock new levels of performance. With a well-crafted dataset, you're well on your way to creating an AI meal planner that will delight and assist users in their culinary adventures! Looking Ahead: The Future of LLMs in Personalized Services As LLM technology continues to evolve, we can expect to see increasingly sophisticated and personalized AI services. Future iterations of our meal planning bot might: Integrate with smart home devices to consider available ingredients Adapt recommendations based on real-time health data from wearables Collaborate with other AI systems to provide holistic wellness plans By mastering the fundamentals we've covered in this series, you'll be well-positioned to leverage these exciting developments in your own projects and applications.
Here is how I became a software engineer without a computer science degree. Let me be real with you: coding was hard. I wasted so much time fixing missing semicolons, mismatched brackets, and misspelled variables. Even when the code was compiled, it would not work as expected, and I would spend hours staring at the screen and questioning my life choices. But over time, I picked up some strategies that made coding click for me, and I'm going to share these strategies with you today. Don’t Try To Know Everything The first thing I learned was that as a programmer, you don't need to know everything. When I began my first programming job, I was unfamiliar with Linux commands. When I joined Amazon, I did not fully understand G. At Amazon, my first project was in Python, and I had never written a single line of code in Python. Later, when I joined Google, I could not program in C++, but most of my work was in C++. The point I'm trying to make is that you don't need to know everything; you just need to know where to find it when you need it. But when I was a beginner, I would try to do these 30-40 hour boot camps to learn a programming language, thinking that I was going to learn everything. In reality, you cannot learn everything there is to learn. So, do not wait until you have the right skills to start your project; your project will teach you the skills. Do not wait until you have the confidence to do what you want; the confidence will come when you start doing it. Focus and Avoid Overwhelm Well, a successful warrior is an average man with laser-like focus. The same is true for a programmer, but it's really hard for a beginner to stay focused. That's because a programmer has more choices than a buffet at an Indian wedding. First, they have to choose a programming language. After that, they need to pick a course to learn that language. If they want to learn front-end development, they have all these choices, and for back-end developers, there is a whole other set of options. The more options available to a person, the longer it takes to decide which option is best. This is also called Hick's Law. As a beginner, it can be tempting to learn a little bit about many different technologies. After all, there are so many exciting areas to explore. While broad exposure is good, it's important for beginners to pick one technology stack to focus on initially. Mastery takes time and repetition, so go deep and not wide. Programming concepts take time to fully click. By focusing on one technology, you can iterate on the fundamentals again and again until they become second nature. Usually, you need to know one technology stack really well to get hired. Breadth is great, but you'll be evaluated on how well you know a specific technology that the job requires. Develop Problem-Solving Skills Next, the lesson was not to just focus on coding ability, but also on developing a problem-solving mindset. You see, coding is ultimately about solving problems, big and small. But the issues we solve as developers don't come prepackaged as coding problems like you see on LeetCode or in coding interviews. They come disguised as open-ended product requirements, refactoring challenges, or performance bottlenecks. Learning to deconstruct these messy real-world issues into solvable chunks is a key skill that you need to build. Example of Problem Solving Let's say there are thousands of users on the internet browsing for a dilution ratio calculator for car detailing products. They are wondering how much water should they add to the car detailing products based on the known ratio of detailing products and water. For this purpose, you can make a dilution ratio calculator in HTML, CSS, JS, or whatever language you prefer. But we recommend Python over HTML, CSS, and JS because you will need to write fewer lines of code. It was just one example: there are hundreds and thousands of problems that need to be solved. Techniques for Better Problem Solving Let me tell you two techniques that helped me become better at problem-solving. The first is the Five Whys analysis. This technique was created at Toyota as a way to identify underlying reasons behind manufacturing defects. Here is how it works: when you encounter a problem, you ask "why" five times, one after the other. Each answer forms the basis of the next "why." For example, let's say your code is running slower than expected. Why is that happening? Because it's taking a long time to process a large data set. But why? Because I'm using two nested loops to search for a specific value. And why is that? Because I thought that nested loops were the easiest way to solve the problem. Why is that? Because I do not know more efficient search algorithms. But why? Because I've not taken the time to study data structures and algorithms. By the time you get to the fifth "why," you have reached the core issue. The second technique I use is the separation of concerns. The main idea behind this is to break a complex problem down into smaller, manageable parts. For example, let's say you are building a web app with user authentication. You can break this problem down into multiple tasks, like building a user interface for login and registration, database management for storing user credentials, and authentication logic for verifying user identity. This makes the problem easier to process without getting overwhelmed. Building strong problem-solving skills will also help you in coding interviews. In coding interviews, they will ask you questions based on data structures and algorithms. These questions are designed to test your logical thinking and problem-solving. Stop Obsessing Over Syntax The next thing you need to do is to stop obsessing over the syntax. As I mentioned at the beginning of this article, I would constantly get frustrated by silly syntax errors. I would spend hours just trying to get my code to run without any errors. But then I realized that obsessing over the syntax is pointless. Syntax is just the grammar of the language; it's important, but not the core of coding. As we discussed, the core is problem-solving — breaking down a complex problem into simple steps that even a computer can understand. That's why I started practicing coding with pseudocode first. Pseudocode lets you mock out your solution in plain English before trying to write proper code syntax. It forces you to truly understand the logic of your solution upfront. Once I had that down, the actual coding became much easier because I just had to translate my logic into whatever language I was using. Here is an example of pseudocode for the fizz fizz game: Pascal style: procedure fizzbuzz; for i := 1 to 100 do print_number := true; if i is divisible by 3 then begin print "Fizz"; print_number := false; end; if i is divisible by 5 then begin print "Buzz"; print_number := false; end; if print_number, print i; print a newline; end C style: fizzbuzz() { for (i = 1; i <= 100; i++) { print_number = true; if (i is divisible by 3) { print "Fizz"; print_number = false; } if (i is divisible by 5) { print "Buzz"; print_number = false; } if (print_number) print i; print a newline; } } Python style: def fizzbuzz(): for i in range(1,101): print_number = true if i is divisible by 3: print "Fizz" print_number = false if i is divisible by 5: print "Buzz" print_number = false if print_number: print i print a newline Write Code for Humans, Not Computers Another thing I wish I had learned early was to write code for humans, not computers. A study done by Chajang University found that developers spend 58% of their time just trying to understand the code they are working with. As beginners, we tend to write code that works in a way that only makes sense to us. But effective code needs to be understandable by any developer who looks at it, including your future self. I cannot tell you how many times I came back to my code from a week ago and could not understand it myself. So, writing clean, readable code with proper variable naming, code formatting, and comments can massively boost your productivity as a developer. It's a skill I worked very hard to build by studying coding best practices and getting code reviews. Trust me when I say this: your future self will thank you for writing readable code. Master Debugging Techniques In talking about understanding code, the next thing that made my life easier was learning debugging. I can't tell you how many hours I wasted just randomly tweaking code, compiling, running, and tweaking again in the hopes that it would magically work. I thought that was the only way to do it. But then I read this article by Devon H. O'Dell that talked about how developers spent somewhere between 35% to 50% of their time debugging. That's over a third of their time. So, I decided to get serious about learning debugging. I learned how to use a debugger, adding log statements systematically, and recreating issues in smaller isolated cases. Learning and applying a proper debugging process saved me a lot of time and headaches. Most code editors have built-in debugging functionality. In certain cases, you can also use a third-party extension like ReSharper. To learn more about debugging, check out Google's troubleshooting and debugging techniques course on Coursera. Enroll for free! Focus on the Little Things And this brings me to the most important lesson. Back in the 1980s, American Airlines was looking for ways to cut costs and improve its profit margin. Bob Crandall, who was the head of American Airlines at the time, decided to take a closer look at the airline's food service. He noticed that the salads being served on the flight included a garnish of three olives per salad. Bob did some quick math and figured out that if they removed just one olive from each salad, they could save a substantial amount of money. Keep in mind that American Airlines was serving thousands of meals every day, so even though one olive might not seem like much, the numbers added up fast. American Airlines saved around $40,000 per year in today's dollars by doing this (Ref). The lesson here is that sometimes the biggest improvements come from paying attention to the little things. Incremental Improvements In his book Atomic Habits, James Clear talks about the power of making small, incremental changes in life (ref). The idea is that tiny habits, when compounded over time, can lead to remarkable results. He shows that just a 1% improvement every day over one year can make you 38 times better. The key to making these improvements is starting small. So, if you want to get in shape, start by doing one push-up per day. If you want to read more, start by reading just one page every night. And if you want to learn programming, start by doing just two exercises per day. Over time, these small improvements will compound. Two exercises will become 20, and 20 exercises will become one project, and so on. Whatever you do, you cannot have anything worth having without struggle. And if you're struggling to learn programming, you are on the right track.
Docker is the obvious choice for building containers, but there is a catch: writing optimized and secure Dockerfiles and managing a library of them at scale can be a real challenge. In this article, I will explain why you may want to use Cloud Native Buildpacks instead of Docker. Common Issue Using Docker When a company begins using Docker, it typically starts with a simple Dockerfile. However, as more projects require Dockerfiles, the following problematic situation often comes up: Developers tasked with containerizing their applications lack the knowledge, time, or motivation to write a new Dockerfile. As a result, they copy and paste a Dockerfile from another project. Any bad practices or security vulnerabilities in the copied Dockerfile are propagated. The copied Dockerfile might be optimized for a specific stack, leading to poor performance in the new context. This scenario repeats each time a new application needs to be containerized. While this might be manageable for a few applications, it becomes a significant issue when dealing with hundreds. That can lead to serious technical debt and security vulnerabilities. What Do Buildpacks Have To Offer? Consistent Experience In the era of platform engineering – where process standardization is crucial – microservice-oriented infrastructure must accommodate a wide variety of stacks. Buildpacks provide a standardized approach to creating container images using one command to containerize any application. That command – pack– is all that's needed (no need to write a Dockerfile or other configuration file). Eliminate Dependency Management Concerns about dependency management are eliminated. Buildpacks automatically detect, download, and install the necessary dependencies, libraries, frameworks, and runtime environments. Reusability A key benefit is the availability of production-ready buildpacks maintained by the community. Experts – from companies like Google, Heroku, Broadcom/VMware, etc. – manage these buildpack images to ensure your container images are optimized and explicitly secured for each stack. Misconceptions With Buildpacks When buildpacks were first created about a decade ago, users complained about several limitations. But most of those are long gone, so let me bust those myths. Myth 1: Limited to a Few Programming Languages Buildpacks supports all modern programming languages. Providers like Google, Heroku, and the open-source Paketo Buildpacks project support a wide range of languages, including Java, Node.js, .NET Core, Go, Python, PHP, Ruby, and Rust. Myth 2: Limited Customization “There are limitations, but they are by design”, says Cloud Foundry maintainer Tim Downey. Buildpack “limitations” will protect you from bad practices such as running your container as root or running arbitrary Shell commands. And if you need a workaround, there is always a clean way to do it. For example, if you need to install a package that is not listed in your application dependencies (such as installing some OS utility), you can cleanly do this using the apt buildpack. Myth 3: Vendor Lock-In Buildpacks output OCI images like Docker, ensuring maximum compatibility with any hosting platform or on your infra. And if a buildpack vendor-specific optimizations do not work for you, you can easily switch to another provider, such as the open-source Paketo Buildpacks, which are free of vendor dependency. Myth 4: Dependency Management Issues Dependency management is simplified with Buildpacks. They automatically track and manage dependencies, offering a robust approach for complex applications with strong security requirements. Buildpacks can even generate a Software Bill of Materials (SBOM) for your container images with the same command used to build an image. Myth 5: Unoptimized Images Buildpacks are designed with best practices to build faster and smaller images. They also offer advanced features, such as the rebase functionality, which allows rebuilding any part of an image without needing to be rebuilt entirely, which is not available in Docker-based images/workflows. Don’t Waste Your Time Writing Dockerfiles: Use Buildpacks Writing a Dockerfile is the right way to go if you have a small infrastructure or a solo developer, but for any company with a large enough infrastructure, you should seriously consider trying Cloud Native Buildpacks. The learning curve is super easy and will save you a lot of work in the long run.
We often hear about the importance of developers and the role they play in the success of a business. After all, they are those craftsmen who create the software and apps that make businesses run smoothly. However, there is one key element of development that is still overlooked – backup. Why? DevOps is constantly focused on delivering the best user experience and making sure the apps they build are bug-free. Yet what if something goes wrong one day or another? Let’s move on step-by-step. The Role of DevOps Engineer Today Currently, DevOps engineers have to cover a diverse set of responsibilities and skills. Working on the intersection of Development and Operations, they aim to boost the software development process and keep up with high-quality standards. Depending on the size of an organization a developer may need to have solid grounds in development, operations, infrastructure, and sometimes even project management and business. Moreover, developers should always stay updated with the latest trends, technologies, and DevOps principles to make their work more efficient and productive. Let’s not forget, the quality of the product is at stake. Writing, reviewing, and validating the source code, building the DevOps toolchain, managing CI/CD tools (which enlarge from time to time), analysis and automation, troubleshooting, and incident management, and creating security measures using risk management techniques and vulnerability assessment are just a small part of what DevOps may be responsible for. The Main Focus of Developers With the modern world relying on technology more and more, developers can feel the heavier burden of responsibilities on their shoulders. Of course, the main idea is to create software that is user-friendly and effective. So, what bundle of firm DevOps skills do developers need to have? Good Understanding of the Platform Most IT systems are built on the concept of a stack, which is the assemblage of widely used operating systems, services, and related toolkits for creating, implementing, and maintaining apps. That’s why DevOps must be experienced in the stack the organization uses or is about to employ. There are three main stacks that DevOps should know well: cloud framework, Linux server distributions, and Microsoft Windows Server. Scripting and Programming Languages When we speak about DevOps practices we assume that code gets into production fast through development. Thus, the main task of the DevOps engineer is to comprehend the source code, write scripts, and handle integrations; for example, to get the code version to communicate with the MySQL database for operations-side deployments. To perform their daily routine tasks effectively, DevOps engineers should be proficient in PHP, Python, Perl, Ruby, Java, C++, and some other programming languages, which is an extensive knowledge base. Moreover, it’s important to have a clear understanding of build and CI/CD tools, like Jenkins, Apache Maven, and other tools to optimize DevOps workflow. Speed and Agility: Understanding of Appropriate Tools DevOps is all about speed and agility. And its success often depends on the toolset DevOps rely on during each stage of implementation. Let’s not forget – nothing stands still, even the code. Developers should always adjust to the needs and requirements of their company’s customers in improving their products and developing the latest updates of the apps. For that reason, first, they have to understand what are Git and its benefits among other version controls, that are critical for software development these days. Moreover, DevOps engineers should have a clear view of the Git hosting services, including GitHub, GitLab, Bitbucket, etc., and their differences to be able to opt for the most suitable one that meets the organization’s needs. However, it’s not everything: the sheer amount of tools that DevOps has to deal with is difficult to enumerate. Apart from Git and Git hosting services, DevOps engineers should be aware of working with: Continuous integration servers Deployment automation Infrastructure orchestration Configuration management Containers Testing and cloud quality tools Network protocols Monitoring and analytic tools Automation Expertise Automation is the heart of the DevOps process. That’s why DevOps engineers should usually have a good command of it to automate their tasks in the entire DevOps pipeline. Continuous integration and continuous deployment, security, and configuration management – all those processes can be automated for DevOps efficiency and streamlining of their workflow. Security Skills Enterprise security is becoming more and more dependent on DevOps engineers as the risk rates are largely in line with the development pace DevOps enables. Writing secure code considering security concerns, testing for vulnerabilities in the CI/CD pipeline, troubleshooting, shifting security left, data encryption, and maintaining intrusion prevention systems and antimalware software may also rely on the DevOps team’s shoulders. However, what’s missing? Is Backup a Missing Piece? As we have already started talking about security, it’s critical to mention that DevOps practices are involving security methodologies more and more, giving rise to DevSecOps practices. With it, organizations can improve source code protection, quality, visibility, monitoring, and compliance. Unfortunately, focusing on production, DevOps can often forget about backup. Well, they won’t disregard backup at all - they can perform manual copies of their data or rely on the Git hosting provider where they push their code, though, it’s not enough. The backup script, manual copies of the source code, and snapshots can’t be considered a reliable backup plan that can guarantee data recoverability in any event of failure. It’s a myth that every backup always comes with disaster recovery. Moreover, developers should always keep in mind that all SaaS providers follow the Shared Responsibility model, and if something happens to their data – accidental data deletion, lost data due to an outage, or a ransomware attack – that’s them who will need to deal with the disaster. That’s why skills in backup and data protection are an important aspect for DevOps engineers. Moreover, they shouldn’t consider backup as a separate process, it should be regarded as an essential component of their DevOps workflow. What DevOps Should Know about Backups To make sure that their work is secured, DevOps engineers need to understand how to build a secure backup plan. As we have already mentioned, a copy isn’t enough. To ensure the security and protection of the source code, the most critical data, it’s vital to build reliable data protection that covers: Automatic scheduled backups that allow DevOps to save time without interrupting making manual backup copies Full data coverage, which means that the backup copy should include not only repositories but also all the related metadata Option to keep data in a few storage instances, which will help to meet the 3-2-1 backup rule, assuming you have 3 copies in 2 different locations with one of them off-site Unlimited retention, which allows restoration of data from any point in time Disaster recovery, which guarantees that developers can restore their data granularly or in full volume from any point in time to their company’s GitHub, GitLab, Atlassian account, or a new one, to their local device, or cross-overly to another Git hosting service, e.g., from Bitbucket to GitHub Grantee of ransomware protection, which is a bundle of security features that ensures that backups are encrypted in-flight and at rest in WARM-compliant storage Monitoring capabilities, which will help to track backup performance Conclusion With businesses aimed at becoming more lean and agile, the skills DevOps may need to vary from company to company. Organizations are becoming more and more interested in finding the methods that will help them streamline their processes while ensuring security. That’s why good command of DevOps tools – build and CI/CD, continuous monitoring and DevSecOps, automation, and collaboration tools – is critically important, as it helps to boost productivity and the efficiency of workflow.
Introduction to Data Joins In the world of data, a "join" is like merging information from different sources into a unified result. To do this, it needs a condition – typically a shared column – to link the sources together. Think of it as finding common ground between different datasets. In SQL, these sources are referred to as "tables," and the result of using a JOIN clause is a new table. Fundamentally, traditional (batch) SQL joins operate on static datasets, where you have prior knowledge of the number of rows and the content within the source tables before executing the Join. These join operations are typically simple to implement and computationally efficient. However, the dynamic and unbounded nature of streaming data presents unique challenges for performing joins in near-real-time scenarios. Streaming Data Joins In streaming data applications, one or more of these sources are continuous, unbounded streams of information. The join needs to happen in (near) real-time. In this scenario, you don't know the number of rows or the exact content beforehand. To design an effective streaming data join solution, we need to dive deeper into the nature of our data and its sources. Questions to consider include: Identifying sources and keys: Which are the primary and secondary data sources? What is the common key that will be used to connect records across these sources? Join type: What kind of Join (Inner Join, Left Join, Right Join, Full Outer Join) is required? Join window: How long should we wait for a matching event from the secondary source to arrive for a given primary event (or vice-versa)? This directly impacts latency and Service Level Agreements (SLAs). Success criteria: What percentage of primary events do we expect to be successfully joined with their corresponding secondary events? By carefully analyzing these aspects, we can tailor a streaming data join solution that meets the specific requirements of our application. The streaming data join landscape is rich with options. Established frameworks like Apache Flink and Apache Spark (also available on cloud platforms like AWS, GCP, and Databricks) provide robust capabilities for handling streaming joins. Additionally, innovative solutions that optimize specific aspects of the infrastructure, such as Meta's streaming join focusing on memory consumption, are continuously emerging. Scope The goal of this article isn't to provide a tutorial on using existing solutions. Instead, we'll delve into the intricacies of a specific streaming data join solution, exploring the tradeoffs and assumptions involved in its design. This approach will illuminate the underlying principles and considerations that drive many of the out-of-the-box streaming join capabilities available in the market. By understanding the mechanics of this particular solution, you'll gain valuable insights into the broader landscape of streaming data joins and be better equipped to choose the right tool for your specific use case. Join Key The key is a shared column or field that exists in both datasets. The specific Join Key you choose depends on the type of data you're working with and the problem you're trying to solve. We use this key to index incoming events so that when new events arrive, we can quickly look up and find any related events that are already stored. Join Window The join window is like a time frame where events from different sources are allowed to "meet and match." It's an interval during which we consider events eligible to be joined together. To set the right join window, we need to understand how quickly events arrive from each data source. This ensures that even if an event is a bit late, we still have its related events available and ready to be joined. Architecting Streaming Data Joins Here's a simplified representation of a common streaming data pipeline. The individual components are shown for clarity, but they wouldn't necessarily be separate systems or jobs in a production environment. Description A typical streaming data pipeline processes incoming events from a data source (Source 1), often passing them through a c. This component can be thought of as a way to refine the data: filtering out irrelevant events, selecting specific features, or transforming raw data into more usable formats. The refined events are then sent to the Business Logic component, where the core processing or analysis happens. This Feature Extraction step is optional; some pipelines may send raw events directly to the Business Logic component. Problem Now, imagine our pipeline needs to combine information from additional sources (Source 2 and Source 3) to enrich the main data stream. However, we need to do this without significantly slowing down the processing pipeline or affecting its performance targets. Solution To address this, we introduce a Join Component just before the Business Logic step. This component will merge events from all the input sources based on a shared unique identifier, let's call it Key X. Events from each source will flow into this Join Component (potentially after undergoing Feature Extraction). The Join Component will utilize a state storage (like a database) to keep track of incoming events based on Key X. Think of it as creating separate tables in the database for each input source, with each table indexing events by Key X. As new events arrive, they are added to their corresponding table (like Event from source 1 to table 1, event 2 to table 2, etc.) along with some additional metadata. This Join State can be imagined as follows: Join Trigger Conditions All Expected Events Arrive This means we've received events from all our data sources (Source 1, Source 2, and Source 3) for a specific Key X. We can check for this whenever we're about to add a new event to our state storage. For example, if the Join Component is currently processing an event with Key X from Source 2, it will quickly check if there are already matching rows in the tables for Source 1 and Source 3 with the same Key X. If so, it's time to join! Join Interval Expires This happens when at least one event with a particular Key X has been waiting too long to be joined. We set a time limit (the join window) for how long an event can wait. To implement this, we can set an expiration time (TTL) on each row in our tables. When the TTL expires, it triggers a notification to the Join Component, letting it know that this event needs to be joined now, even if it's missing some matches. For instance, if our join window is 15 minutes and an event from Source 2 never shows up, the Join Component will get a notification about the events from Source 1 and Source 3 that are waiting to be joined with that missing Source 2 event. Another way to handle this is to have a periodic job that checks the tables for any expired keys and sends notifications to the Join Component. Note: This second scenario is only relevant for certain types of use cases where we want to include events even if they don't have a complete match. If we only care about complete sets of events (like INNER JOIN), we can ignore this time-out trigger. How the Join Happens When either of our trigger conditions is met — either we have a complete set of events or an event has timed out — the Join Component springs into action. It fetches all the relevant events from the storage tables and performs the join operation. If some required events are missing (and we're doing a type of join that requires complete matches), the incomplete event can be discarded. The final joined event, containing information from all the sources, is then passed on to the Business Logic component for further processing. Visualization Let's make this a bit easier to picture. Imagine that events from all three sources (Source 1, Source 2, and Source 3) happen simultaneously at 12:00:00 PM. Consider the join window as 5 minutes. Optimizations Set Expiration Times (TTLs) By setting a TTL for each row in our join state storage, we enable the database to automatically clean up old events that have passed their join window. Compact Storage Instead of storing entire events, store them in a compressed format (like bytes) to further reduce the amount of storage space needed in our database. Outer Join Optimization If the use case is to perform an OUTER JOIN and one of the event streams (let's say Source 1) is simply too massive to be fully indexed in our storage, we can adjust our approach. Instead of indexing everything from Source 1, we can focus on indexing the events from Source 2 and Source 3. Then, when an event from Source 1 arrives, we can perform targeted lookups into the indexed events from the other sources to complete the join. Limit Failed Joins Joining events can be computationally expensive. By minimizing the number of failed join attempts (where we try to join events that don't have matches), we can reduce memory usage and keep our streaming pipeline running smoothly. We can use the Feature Extraction component before the Join Component to filter out events that are unlikely to have matching events from other sources. Tuning Join Window While understanding the arrival patterns of events from your input sources is crucial, it's not the only factor to consider when fine-tuning your Join Window. Factors such as data source reliability, latency requirements (SLAs), and scalability also play significant roles. Larger join window: Increases the likelihood of successfully joining events, in case of delays in event arrival times; may lead to increased latency as the system waits longer for potential matches Smaller join window: Reduces latency and memory footprint as events are processed and potentially discarded more quickly; join success rate might be low, especially if there are delays in event arrival Finding the optimal Join Window value often requires experimentation and careful consideration of your specific use case and performance requirements. Monitoring Is Key It's always a good practice to set up alerts and monitoring for your join component. This allows you to proactively identify anomalies, such as events from one source consistently arriving much later than others, or a drop in the overall join success rate. By staying on top of these issues, you can take corrective action and ensure your streaming join solution operates smoothly and efficiently. Conclusion Streaming data joins is a critical tool for unlocking the full potential of real-time data processing. While they present unique challenges compared to traditional SQL (batch) joins, hopefully, this article has given you the idea to design effective solutions. Remember, there is no one-size-fits-all approach. The ideal solution will depend on the specific characteristics of your data, your performance requirements, and your available infrastructure. By carefully considering factors such as join keys, join windows, and optimization techniques, you can build robust and efficient streaming pipelines that deliver timely, actionable insights. As the streaming data landscape continues to evolve, so too will the solutions for handling joins. Keep learning about new technologies and best practices to make sure your pipelines stay ahead of the curve as the world of data keeps changing.
I enjoy spending time learning new technologies. However, often the biggest drawback of working with new technologies is the inevitable pain points that come with early adoption. I saw this quite a bit when I was getting up to speed with Web3 in “Moving From Full-Stack Developer To Web3 Pioneer.” As software engineers, we’re accustomed to accepting these early-adopter challenges when giving new tech a test drive. What works best for me is to keep a running list of notes and commands I’ve executed, since seemingly illogical steps don’t remain in my memory. Aside from Web3, I also found this challenge in the JavaScript space, with the semi-standard requirements of using Node.js and Webpack. I wanted to identify a solution where I could just use JavaScript as is, without toiling away with Node.js and Webpack. I recently read how the Rails 7 release addressed this very situation. So, that’s the use case I’ll be covering in this article. About Rails To be fully transparent, my experience with Ruby and Ruby on Rails is little to none. I remember watching someone issue some commands to create a fully functional service years ago, and I thought “Wow, that looks awesome.” But I’ve never spent time playing around with this approach to building services and applications. I’m pretty sure I saw that demo in early 2006 because Rails first emerged in late 2005. As I saw in the demonstration, the end result was a service that supported the model-view-controller (MVC) design pattern, a pattern that I was familiar with through my early use of the Spring, Struts, JSF, and Seam frameworks. Rails maintains a promise to keep things straightforward while adhering to DRY (don’t repeat yourself) practices. To help honor this promise, Ruby uses Gems for engineers to introduce shared dependencies into their projects. Version 7 Highlights In late 2021, the seventh major version of Rails introduced some exciting features: Asynchronous querying: Getting away from running queries serially Encrypted database layer: Securing data between the service and persistence layers Comparison validator: Allows object validation before persistence Import maps: No longer require Node.js and Webpack for JavaScript libraries That last feature is what drove me to write this article. How Do Import Maps Work? At a high level, the importmaps-rails Gem allows developers to import maps into their applications. The use of /bin/importmap allows engineers to update, pin, or unpin dependencies as needed. This is similar to how Maven and Gradle work in Java-based projects. This eliminates needing to deal with the complexities related to bundling packages and transpiling ES6 and Babel. Goodbye Webpack! Goodbye Node.js! Let’s Build Something Since I hadn’t even touched Ruby on Rails in almost two decades, the first thing I needed to do was follow this guide to install Ruby 3.3 on my MacBook Pro. Once installed, I just needed to install the Ruby plugin as part of my IntelliJ IDEA IDE. Then, I created a new Ruby on Rails project in IntelliJ called import-map and specified the use of Importmap for the JavaScript framework: With the project created, I first wanted to see how easy it would be to use a local JavaScript library. So, I created a new JavaScript file called /public/jvc_utilities.js with the following contents: JavaScript export default function() { console.log('*****************'); console.log('* jvc-utilities *'); console.log('* version 0.0.1 *'); console.log('*****************'); } The default function simply echoes some commands to the JavaScript console. Next, I created an HTML file (/public/jvc-utilities.html) with the following contents: HTML <!DOCTYPE html> <html> <head> <title>jvc-utilities</title> </head> <script type="importmap"> { "imports": { "jvc_utilities": "./jvc_utilities.js"} } </script> <script type="module"> import JvcUtilities from "jvc_utilities"; JvcUtilities(); </script> <h3>jvc-utilities.html</h3> <p>Open the console to see the output of the <code>JvcUtilities()</code> function. </p> </html> This example demonstrates how a local JavaScript file can be used with a public HTML file — without any additional work. Next, I created a new controller called Example: Shell bin/rails generate controller Example index I wanted to use the Lodash library for this example, so I used the following command to add the library to my import-map project: Shell bin/importmap pin lodash To add some JavaScript-based functionality to the controller, I updated javascript/controllers/example_controller.js to look like this: JavaScript import { Controller } from "@hotwired/stimulus" import _ from "lodash" export default class extends Controller { connect() { const array = [1, 2, 3] const doubled = _.map(array, n => n * 2) console.log('array', array) // Output: [1, 2, 3] console.log('doubled', doubled) // Output: [2, 4, 6] this.element.textContent = `array=${array} doubled=${doubled.join(', ')}` } } This logic establishes an array of three values, and then it doubles the values. I use the Lodash map() method to do this. Finally, I updated views/example/index.html.erb to contain the following: XML <h3>Example Controller</h3> <div data-controller="example"></div> At this point, the following URIs are now available: /jvc-utilities.html /example/index Let’s Deploy and Validate Rather than run the Rails service locally, I thought I would use Heroku instead. This way, I could make sure my service could be accessible to other consumers. Using my Heroku account, I followed the “Getting Started on Heroku with Ruby” guide. Based on my project, my first step was to add a file named Procfile with the following contents: Shell web: bundle exec puma -C config/puma.rb Next, I used the Heroku CLI to create a new application in Heroku: Shell heroku login heroku create With the create command, I had the following application up and running: Shell Creating app... done, lit-basin-84681 https://lit-basin-84681-3f5a7507b174.herokuapp.com/ | https://git.heroku.com/lit-basin-84681.git This step also created the Git remote that the Heroku ecosystem uses. Now, all I needed do was push my latest updates to Heroku and deploy the application: Shell git push heroku main With that, my code was pushed to Heroku, which then compiled and deployed my application. In less than a minute, I saw the following, letting me know that my application was ready for use: Shell remote: Verifying deploy... done. To https://git.heroku.com/lit-basin-84681.git fe0b7ad..1a21bdd main -> main Then, I navigated to the /example/index page using my Heroku URL (which is unique to my application, but I have since taken it down): https://lit-basin-84681-3f5a7507b174.herokuapp.com/example/index This is what I saw: And when I viewed the JavaScript console in my browser, the following logs appeared: Navigating to /jvc-utilities.html, I saw the following information: When I viewed the JavaScript console in my browser, I saw the following logs: Success. I was able to use a self-contained JavaScript library and also the public Lodash JavaScript library in my Rails 7 application—all by using Import Maps and without needing to deal with Webpack or Node.js. Buh-bye, Webpack and Node.js! Conclusion My readers may recall my personal mission statement, which I feel can apply to any IT professional: “Focus your time on delivering features/functionality that extends the value of your intellectual property. Leverage frameworks, products, and services for everything else.” — J. Vester In this article, I dove head-first into Rails 7 and used Import Maps to show how easily you can use JavaScript libraries without the extra effort of needing to use Webpack and Node.js. I was quite impressed by the small amount of time that was required to accomplish my goals, despite it being over two decades since I had last seen Rails in action. From a deployment perspective, the effort to deploy the Rails application onto the Heroku platform consisted of the creation of a Procfile and three CLI commands. In both cases, Rails and Heroku adhere to my mission statement by allowing me to remain laser-focused on delivering value to my customers and not get bogged down by challenges with Webpack, Node.js, or even DevOps tasks. While I am certain we will continue to face not-so-ideal pain points when exploring new technologies, I am also confident that in time we will see similar accomplishments as I demonstrated in this article. As always, my source code can be found on GitLab here. Have a really great day!
One-time passwords are one of the most relied-on forms of multi-factor authentication (MFA). They’re also failing miserably at keeping simple attacks at bay. Any shared secret a user can unknowingly hand over is a target for cybercriminals, even short-lived TOTPs. Consider this: What if the multi-factor authentication your users rely on couldn’t save your organization from a large-scale account takeover? That’s what happened to an organization using SMS one-time passwords to secure customer accounts. We’ll call the affected organization “Example Company,” or EC for short. By deploying a replica of the real EC login page and a “spoofed” URL — a similar-looking (but fake) web address — threat actors intercepted user credentials and OTPs in real-time. This allowed them to authenticate on the legitimate site, granting full account access and potentially persistent tokens or cookies via the “remember me” function. Figure 1: SMS MFA bypass attack using MITM tactics This is not an isolated incident. Numerous high-profile breaches highlight the glaring insufficiency of traditional MFA implementations. Don’t get the wrong idea, though: two factors are still better than one. As Slavik Markovich asserts in SC Magazine, “MFA implementation remains an essential pillar in identity security.” He further points out that “when properly configured, MFA blocks 99% of attacks.” Snowflake, a cloud data provider serving large enterprises like AT&T, is still reeling from a breach involving user credentials — reportedly without MFA in place. AT&T paid a whopping 5.7 Bitcoin ($370,000 USD at the time of payment) ransom to the cybercriminals responsible, a deal struck for deleting the stolen data. Could MFA have saved the telecom company over a quarter million? It would have certainly made it much harder to abscond with 109 million customers’ call and text messaging metadata. Yet, despite the effectiveness of MFA, adoption lags. A recent Wall Street Journal article highlights this gap, quoting Keeper Security CTO Craig Lurey: “MFA isn’t always easy. Older technology might not be able to run the software necessary for MFA.” Users, too, are to blame, Lurey told the Journal, noting they “push back against MFA as cumbersome.” With MFA adoption meeting such resistance, it’s a tough pill to swallow when some implementations are still phishable and vulnerable to attack. To better defend against attacks that can defeat vulnerable MFA implementations, we need to understand how these tactics tick. The Anatomy of an SMS MFA Bypass Attack The threat actor that targeted EC, the company in my initial example, didn’t use sophisticated methods to overwhelm network infrastructure or exploit a backdoor. They went after unsuspecting users, tricking them into handing over credentials on an impostor login page. After plying the real site for an MFA challenge sent to users’ phones, it was a simple matter of collecting SMS OTPs and logging in. This method, known as a man-in-the-middle (MITM) attack, is increasingly common. While some MFA bypass tactics like prompt bombing and basic social engineering rely on the naivety of users, a pixel-perfect MITM attempt can be much more convincing — yet still deviously simple. The attacker doesn’t need to hijack a session, steal cookies, or swap a SIM card. Here’s a breakdown of a typical MITM attack: The threat actor creates (or purchases a kit containing) a convincing imitation of a genuine login page, often using a domain name that looks similar to the real one. Users are lured to this site, usually through phishing emails or malicious ads. When a user enters their credentials, the attacker captures them. If MFA is required, the legitimate site sends a one-time code to the user. The user, still connected to the fake site, enters this code, which the cybercriminal then uses to log in on the real site. The genius of MITM attacks, and their danger, is simplicity. The fraudster doesn’t need to hijack a session, steal cookies, or swap a SIM card. It doesn’t require breaking encryption or brute-forcing passwords. Instead, it leverages human behavior and the limitations of certain MFA methods, particularly those relying on one-time passwords with a longer lifespan. But what makes this tactic particularly insidious is that it can bypass MFA in real-time. The user thinks they’re going through a normal, secure login process, complete with the anticipated MFA step. In reality, they’re handing over their account to a cybercriminal. Simple MITM attacks are significantly easier to pull off for novice attackers compared to increasingly popular AITM (adversary-in-the-middle) variants, which typically require an indirect or reverse proxy to collect session tokens. However, with AITM kits readily available from open-source projects like EvilProxy and the PhaaS (phishing-as-a-service) package from Storm-1011, more complex approaches are available to script kiddies willing to learn basic functions. Not All MFA Is Created Equally MFA might have prevented or contained the Snowflake breach, but it also might have been a story like TTS, the travel platform. The harsh reality is that not all MFA is created equally. Some current popular methods, like SMS OTPs, are simply not strong enough to defend against increasingly advanced and persistent threats. The root of the problem lies with the authentication factors themselves. Knowledge-based factors like passwords and OTPs are inherently vulnerable to social engineering. Even inherence factors can be spoofed, hijacked, or bypassed without proper safeguards. Only possession factors, when properly implemented using public key cryptography (as with FIDO2/U2F or passkeys), offer sufficient protection against MFA bypass attacks. Case in point: TTS, our travel platform example, used SMS OTPs. It’s technically MFA, but it’s a weak variant. It’s high time we faced the fact that SMS was never intended to be used as a security mechanism, and text messages are always out-of-band. Apart from the direct threat of SIM swaps, SMS OTPs time out more slowly than their TOTP authenticator app counterparts, which makes them magnets for phishing. The same weaknesses are present in email and authenticator app OTPs. Anything a user can see and share with a cybercriminal, assume it will be a target. Magic links could have helped in both breaches we discussed because they are links that don’t require manual input. An attacker positioned as a man in the middle wouldn’t be able to intercept a magic link. Instead, they’d be forced to breach the target user’s email account. This underscores a painfully obvious issue at the core of our MFA landscape: shared, transferable secrets. Whether it’s an SMS, email, or even time-based OTP from an authenticator app, these methods all rely on a piece of information that can be knowingly (or unknowingly) shared by the user. Same-device authentication is the only way to increase the certainty you’re dealing with the person who initiated the MFA challenge. The Key to Secure MFA Is in Your User’s Device Possession-based authentication offers a promising solution to the problems posed by out-of-band MFA. With device-enabled auth methods creating reliable, secure ecosystems, the “what you have” factor is open to anyone with a capable smartphone or browser. In today’s threat landscape, the key to stopping MFA bypass attacks is in your user’s device. Here’s why: No shared, transferable secrets: Unlike OTPs, there’s no code for users to manually enter or click. The authentication process happens through device-bound properties that can’t be intercepted or duplicated. Genuine same-device authentication: Biometrics or a PIN can prove presence, but more significantly, they ensure it’s all happening on the same device. Phishing resistance: Since there’s no secret for unsuspecting users to enter spoofed URLs, phishing attempts become largely pointless. A fake login page can’t steal a user’s smartphone. Smoother UX: Users don’t need to wait for (or miss) SMSes, emails, or copy codes from an app. A simple PIN or biometric verification is all it takes. Reduced reliance on out-of-band ecosystems: SMS, email, and authenticator app OTPs may be convenient, but they’re a nightmare when a threat actor gets through. Admittedly, there are some adoption hurdles that we need to face. Transitioning to these newer, more secure MFA methods can pose financial challenges when organizations update their infrastructure. It can cause uncertainty among uninformed users who view biometric authentication with skepticism (which is often misplaced when it comes to FIDO authentication). However, moving to device-based MFA is a necessary, essential step for companies with vulnerable user populations still using OTPs for MFA. For organizations serious about security, it’s not worth waiting for an expensive MFA bypass attack. The cost of a modern solution is fractional when compared to the reputation loss and financial burden of a breach. Despite the minor roadblocks to adoption, it’s up to security leaders to lead the charge toward safer, possession-based MFA — and far, far away from shared secrets.
Hello, DZone Community! We have a survey in progress as part of our original research for the upcoming Trend Report. We would love for you to join us by sharing your experiences and insights (anonymously if you choose) — readers just like you drive the content that we cover in our Trend Reports. check out the details for our current research survey below Over the coming months, we will compile and analyze data from hundreds of respondents; results and observations will be featured in the "Key Research Findings" of our Trend Reports. Data Engineering Research As a continuation of our annual data-related research, we're consolidating our database, data pipeline, and data and analytics scopes into a single 12-minute survey that will guide help guide the narrative of our October Data Engineering Trend Report. Our 2024 Data Engineering survey explores: Vector data and databases + other AI-driven data capabilities Data pipelines, real-time processing, and structured storage Data architecture, observability, security, and governance Distributed database design + architectures Database types, languages, and use cases Join the Data Engineering Research You'll also have the chance to enter the $500 raffle at the end of the survey — five random people will be drawn and will receive $100 each (USD)! Your responses help inform the narrative of our Trend Reports, so we truly cannot do this without you. Stay tuned for each report's launch and see how your insights align with the larger DZone Community. We thank you in advance for your help! —The DZone Publications team
Enhancing Agile Product Development With AI and LLMs
August 7, 2024 by
Apache Kafka + Flink + Snowflake: Cost-Efficient Analytics and Data Governance
August 10, 2024
by
CORE
Explainable AI: Making the Black Box Transparent
May 16, 2023
by
CORE
Why I Use RTK Query for API Calls in React
August 9, 2024 by
Content Detection Technologies in Data Loss Prevention (DLP) Products
August 9, 2024 by
Low Code vs. Traditional Development: A Comprehensive Comparison
May 16, 2023 by
Apache Kafka + Flink + Snowflake: Cost-Efficient Analytics and Data Governance
August 10, 2024
by
CORE
Why I Use RTK Query for API Calls in React
August 9, 2024 by
Connecting ChatGPT to Code Review Made Easy
August 9, 2024 by
August 8, 2024 by
Low Code vs. Traditional Development: A Comprehensive Comparison
May 16, 2023 by
Apache Kafka + Flink + Snowflake: Cost-Efficient Analytics and Data Governance
August 10, 2024
by
CORE
Five IntelliJ Idea Plugins That Will Change the Way You Code
May 15, 2023 by