Documentation

No results
    gitHub

    Data modeling and the AI lifecycle

    Artificial Intelligence is only as good as the data it understands. A strong data model keeps AI grounded, ensuring accurate outputs, real-time reasoning, and seamless system integration.  Data modeling contributes to better AI by providing a structured framework for organizing, interpreting, and leveraging data effectively. It helps define the relationships between different data elements, ensuring that AI systems can access clean, consistent, and relevant information for training and decision-making.  By establishing clear data schemas and reducing ambiguity, data modeling improves the quality and accuracy of machine learning models, reduces bias, enhances interpretability, and enables scalability.  Ultimately, it lays the foundation for more intelligent, reliable, and efficient AI systems by aligning data with the goals and logic of AI algorithms.

     

     

    Conversely, AI can contribute to better data modeling by enabling advanced automation and intelligence throughout the modeling lifecycle. The current state of GenAI does not replace the synthesized knowledge of a subject-matter expert.  It would take so much effort to create the prompts, that the user might as well specify the details directly in a data model.   Nevertheless, we see the significant benefits of GenAI-aided data modeling, for example by leveraging GenAI to supplement descriptions and comments.  Or by helping to identify meaning in existing structures that lack proper descriptions.  

     

    In such cases, the application can reverse-engineer Mermaid ERD code potentially generated by GenAI outside Hackolade Studio.  The application can also generate Mermaid diagrams from existing Hackolade data models (albeit with some limitations to match Mermaid's own restrictions.)   GenAI can be leveraged for metadata enrichment by generating meaningful descriptions for entities and attributes to be edited by subject-matter experts, and by recommending attributes based on industry standards. Furthermore, AI can propose dimensional models optimized from transactional schemas and suggest improvements such as better partition key choices, laying the groundwork for more efficient and standards-aligned data architectures.

    Data modeling contributes to better AI

    Developing an AI solution involves several iterative steps. The process begins with understanding the business problem and clearly defining objectives. Next, data is collected and explored to assess its structure, quality, and potential value, including identifying missing values, anomalies, biases, and uncovering patterns. The data is then prepared through cleaning and transformation, addressing issues such as incomplete data and bias. These early stages are supported by traditional data modeling practices: conceptual, logical, and physical data modeling that provide structure and context. 

     

     

    Data moldeing and the AI Lifecycle

     

    Diagram courtesy of Dave Wells dwells@infocentrig.org

     

     

    This foundational work enables the development and training of algorithms using the prepared data. Model performance is then optimized through parameter tuning and evaluated using accuracy, precision, and alignment with business goals. Finally, the AI model is deployed in a real-world environment, where its performance is continuously monitored and refined in response to new data and feedback.

     

    AI contributes to better data modeling

    AI can contribute to better data modeling by providing the ability to automate and enhance some aspects of the data modeling process.  Currently AI is not yet good enough at understanding the specifics of domains in order to create entire data models.  For now, the knowledge by subject-matter experts and data modelers of the nuances of organizations remains too complex for AI to handle.  But it can assist in many ways to increase productivity.

    It can analyze large and complex datasets to identify hidden patterns, relationships, and anomalies that might be missed by human analysts. AI-driven tools can recommend optimal data structures, detect inconsistencies, and suggest schema improvements based on usage patterns and historical data. Machine learning algorithms also help in predictive modeling, enabling dynamic and adaptive models that evolve with new data. Additionally, AI can streamline tasks like data cleaning, entity recognition, and metadata generation, making the data modeling process faster, more accurate, and more scalable.

     

    Currently on our roadmap, are the following features:

    • Upcoming: reverse-engineer Mermaid ERD code that could have been produced by GenAI in response to some prompt executed outside of Hackolade
    • Then: generate Mermaid ERD code from a Hackolade Studio model to be used in a GenAI prompt.  Note that Mermaid has some limitations, such as lack of composite PKs/FKs, Not Null constraints, etc.
    • Still to be scheduled: use GenAI to create descriptions for selected entities and attributes of a model
    • Use GenAI to suggest attributes for given entities, according to industry -specific standards
    • Longer term: use GenAI to suggest an optimal dimensional model, given a transactional schema.  This could be done indirectly with the first 2 points in the list above.
    • Use GenAI to suggest more optimal modeling, choice of partition keys, etc..  But this has not yet been designed.
    • and more, based on customer feedback and suggestions.

     

    Note that it is of course foreseen that any direct AI interaction from Hackolade Studio will be entirely optional for users, including that it could be disabled, if desired or if mandated by policy of the user's organization.