Large Language Models in Modern Data Engineering: A Systematic Review of Architectures, Use Cases, and Limitations

Shambhu Adhikari

PDF

Published: 2025-12-16

Keywords:

Large language models, Data engineering, Retrieval-augmented generation, Data pipelines, Model governance

Shambhu Adhikari

Sr. Data Engineer - United Airlines, NW, NJ

Abstract

The rapid advancement of large language models (LLMs) since 2022
has significantly reshaped modern data engineering practices. Originally
developed for natural language processing tasks, LLMs are increasingly
integrated into data engineering workflows, including data ingestion, schema inference, metadata
generation, transformation logic synthesis, data quality monitoring, and natural-language interaction with
analytical systems. This systematic review examines the role of LLMs in contemporary data engineering,
focusing on architectural integration patterns, practical use cases across the data lifecycle, and inherent
limitations affecting reliability and governance. Following PRISMA-informed guidelines, peer-reviewed
articles, preprints, and industrial reports published between 2022 and 2025 were analyzed. The review
identifies Retrieval-Augmented Generation (RAG), hybrid vector-database architectures, and agent-based
orchestration frameworks as dominant deployment strategies. Evidence suggests that LLM-assisted
pipelines improve developer productivity, reduce manual coding overhead, and enhance accessibility of
data platforms for non-technical stakeholders. However, persistent challenges remain, including
hallucination, data privacy risks, limited explainability, operational costs, and scalability constraints. The
findings emphasize the need for robust architectural safeguards, evaluation benchmarks, and governance
frameworks to ensure safe and effective production adoption. This review contributes a structured
taxonomy of LLM-centric data engineering architectures and outlines future research directions to support
trustworthy, scalable, and auditable data platforms.

Downloads

Download data is not yet available.

Issue

Vol. 2 No. 1 (2025): International Journal of Business & Computational Science (Volume 02) 2025

Section

Articles

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

Large Language Models in Modern Data Engineering: A Systematic Review of Architectures, Use Cases, and Limitations

Abstract

Downloads

Issue

Section

Most read articles by the same author(s)

Similar Articles

Article Sidebar

Main Article Content

Abstract

Downloads

Article Details

Issue

Section

Most read articles by the same author(s)

Similar Articles