Large Language Models in Modern Data Engineering: A Systematic Review of Architectures, Use Cases, and Limitations
Main Article Content
Abstract
The rapid advancement of large language models (LLMs) since 2022
has significantly reshaped modern data engineering practices. Originally
developed for natural language processing tasks, LLMs are increasingly
integrated into data engineering workflows, including data ingestion, schema inference, metadata
generation, transformation logic synthesis, data quality monitoring, and natural-language interaction with
analytical systems. This systematic review examines the role of LLMs in contemporary data engineering,
focusing on architectural integration patterns, practical use cases across the data lifecycle, and inherent
limitations affecting reliability and governance. Following PRISMA-informed guidelines, peer-reviewed
articles, preprints, and industrial reports published between 2022 and 2025 were analyzed. The review
identifies Retrieval-Augmented Generation (RAG), hybrid vector-database architectures, and agent-based
orchestration frameworks as dominant deployment strategies. Evidence suggests that LLM-assisted
pipelines improve developer productivity, reduce manual coding overhead, and enhance accessibility of
data platforms for non-technical stakeholders. However, persistent challenges remain, including
hallucination, data privacy risks, limited explainability, operational costs, and scalability constraints. The
findings emphasize the need for robust architectural safeguards, evaluation benchmarks, and governance
frameworks to ensure safe and effective production adoption. This review contributes a structured
taxonomy of LLM-centric data engineering architectures and outlines future research directions to support
trustworthy, scalable, and auditable data platforms.
Downloads
Article Details
Issue
Section

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.