Secure Code Generation with CodeT5: Leveraging Large Language Models and CVE Dataset

  • Sanjana Ganesh Nayak
  • , Anirudha Rao
  • , B. L.Siddhartha Bhat
  • , Sanjana Ganesh Nayak
  • , Mamatha Balachandra*
  • , Om Prakash
  • , Varinder Pratap Singh
  • *Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

In the last couple of years, generative models especially large language models drawing from recent advances in AI have emerged as promising for numerous applications. This work illustrates how a large language model, CodeT5, can enhance secure text-to-code generation. CodeT5 is proposed as a unified pre-trained encoder–decoder transformer model that benefits from semantic hints given by developer-assigned identifier names, improving code understanding and promoting trustworthy text-to-code transcribing. It addresses gaps by incorporating an identifier-aware pre-training task and connecting natural language to programming language abstractions through user-written code comments. To enhance code security, CodeT5 is trained on a huge CVE dataset, leveraging code snippets before and after security patches. This new hybrid paradigm helps promote secure coding as well as AI-enhanced software engineering.

Original languageEnglish
Title of host publicationInformation Systems for Intelligent Systems - Proceedings of ISBM 2024
EditorsChakchai So In, Narendra S. Londhe, Nityesh Bhatt, Meelis Kitsing
PublisherSpringer Science and Business Media Deutschland GmbH
Pages29-39
Number of pages11
ISBN (Print)9789819612055
DOIs
Publication statusPublished - 2025
Event3rd World Conference on Information Systems for Business Management, ISBM 2024 - Bangkok, Thailand
Duration: 12-09-202413-09-2024

Publication series

NameSmart Innovation, Systems and Technologies
Volume430 SIST
ISSN (Print)2190-3018
ISSN (Electronic)2190-3026

Conference

Conference3rd World Conference on Information Systems for Business Management, ISBM 2024
Country/TerritoryThailand
CityBangkok
Period12-09-2413-09-24

All Science Journal Classification (ASJC) codes

  • General Decision Sciences
  • General Computer Science

Fingerprint

Dive into the research topics of 'Secure Code Generation with CodeT5: Leveraging Large Language Models and CVE Dataset'. Together they form a unique fingerprint.

Cite this