TelecomGPT: A Framework to Build Telecom-Specific Large Language Models

Paper: TelecomGPT: A Framework to Build Telecom-Specific Large Language Models
Authors: Hang Zou, Qiyang Zhao, Yu Tian, Lina Bariah, Faouzi Bader, Thierry Lestable, Merouane Debbah

TL;DR — We built a three-stage pipeline (continual pre-training → instruction tuning → alignment) to adapt open-source LLMs for telecom. Key results:

75.3% on 3GPP document classification — nearly 2x GPT-4o’s 38.9%
Outperforms GPT-4 on telecom math equation reconstruction (49.45 vs 49.38 MathBERT score)
4x improvement on telecom code infilling over base Llama3-8B-Instruct
All built on 7-8B parameter models — a fraction of GPT-4’s size and cost

Motivation

Large Language Models like GPT-4 and Llama-3 are impressive generalists, but they struggle with the telecom domain. Ask GPT-4 to classify a 3GPP technical specification into the correct working group, and it gets it right less than 40% of the time. Ask it to infill a missing equation in a wireless communications paper, and it barely outperforms a coin flip.

Why? Telecom is a deeply specialized field with its own vocabulary, mathematical conventions, protocol structures, and standards ecosystem. The 3GPP corpus alone spans tens of thousands of pages of dense, interconnected specifications. General-purpose LLMs simply haven’t seen enough of this material — and even when they have, they haven’t been trained to reason about it effectively.

TelecomGPT addresses this gap head-on. Rather than building a telecom LLM from scratch (prohibitively expensive), we developed a systematic pipeline to adapt existing open-source LLMs — Llama-2-7B, Mistral-7B, and Llama-3-8B — into telecom specialists. Along the way, we also built the evaluation infrastructure that the field was missing.

The Three-Stage Pipeline

Our adaptation pipeline consists of three sequential stages, each building on the previous one (see the cover figure above).

Stage 1: Continual Pre-training

The first challenge is data. We built OpenTelecom, a curated pre-training corpus of 1.68 billion tokens drawn from diverse telecom sources:

Source	Tokens (M)	Share
ArXiv papers	893	53.2%
Code (GitHub)	260.1	15.5%
Patents (C4)	253.2	15.1%
3GPP Standards	193	11.5%
StackExchange	51.9	3.1%
Wikipedia	18.9	1.1%
IEEE Standards	7.5	0.5%
Books	1.9	0.1%

To ensure relevance, we applied a sophisticated filtering pipeline using over 700 domain-specific keywords spanning wireless communications, signal processing, networking protocols, and standards terminology. The dataset was deduplicated and quality-filtered to avoid contaminating the model with irrelevant content.

Continual pre-training uses standard causal language modeling — the model learns to predict the next token, but now on telecom-specific text. This stage injects domain vocabulary and factual knowledge into the model’s parameters.

Stage 2: Instruction Tuning (SFT)

Raw knowledge isn’t enough — the model needs to know how to use it. We created TelecomInstruct, a diverse instruction dataset covering:

Multiple-choice QA — answering questions about telecom standards and research
Open-ended QA — free-form answers to telecom questions
Document classification — identifying which 3GPP working group a text belongs to
Math equation generation — filling in missing equations from wireless communications papers
Code tasks — summarization, analysis, infilling, and generation of telecom-related code

We used QLoRA (Quantized Low-Rank Adaptation) for parameter-efficient fine-tuning with rank $r = 512$ and scaling factor $\alpha = 256$, training for 3 epochs. This keeps the computational cost manageable — about 1.5 hours on 8 GPUs.

Stage 3: Alignment Tuning (DPO)

The final stage uses Direct Preference Optimization (DPO) to align the model’s outputs with human preferences. We created TelecomAlign, a preference dataset where each example contains a prompt, a preferred (chosen) response, and a rejected response.

In plain terms, DPO teaches the model to increase the probability of good responses and decrease the probability of bad ones — without needing a separate reward model. It optimizes the policy directly using the objective:

$$\mathcal{L}_{\text{DPO}}(\pi_\theta; \pi_{\text{ref}}) = -\mathbb{E}\left[\log \sigma\left(\beta \log \frac{\pi_\theta(y_w | x)}{\pi_{\text{ref}}(y_w | x)} - \beta \log \frac{\pi_\theta(y_l | x)}{\pi_{\text{ref}}(y_l | x)}\right)\right]$$

where $\pi_\theta$ is the policy being trained, $\pi_{\text{ref}}$ is the reference (SFT) model, $y_w$ and $y_l$ are the preferred and rejected responses, and $\beta = 0.1$ controls the deviation from the reference policy.

Five Novel Evaluation Benchmarks

A recurring problem in telecom AI research is the lack of standardized evaluation. We introduced five benchmark categories specifically designed for the telecom domain.

Model name key: In the tables below, suffixes indicate training stages applied — TP = continual pre-training, TI = instruction tuning, TA = alignment tuning (DPO). For example, Llama3-8B-TI-TA means Llama3-8B after both instruction tuning and alignment.

1. Telecom MCQ (Multiple-Choice Questions)

500 questions across five categories: Lexicon, Research Overview, Research Publications, Standards Overview, and Standards Specifications. Notice how our instruction-tuned models close the gap to GPT-4, especially on Lexicon where Llama3-8B-TI scores a perfect 96%:

Model	Lexicon	Research	Standards Spec.	Overall
GPT-4o	92.0	79.5	62.8	78.0
GPT-4	92.0	78.0	60.0	75.0
Llama3-8B-Instruct	80.0	69.8	50.0	64.8
Llama3-8B-TI	96.0	74.9	56.4	71.2
Mistral-7B-Instruct	84.0	65.0	51.0	62.0
Mistral-7B-TI	84.0	70.7	51.1	65.2

Our instruction-tuned Llama3 model achieves 71.2% overall — closing a significant portion of the gap to GPT-4 (75%) despite being 25x smaller.

2. 3GPP Document Classification

Classifying technical specification text into 16 working groups across three TSG categories (RAN, SA, CT). This is where TelecomGPT truly shines — look at the CT column, where general-purpose models completely fail:

Model	RAN	SA	CT	Overall
GPT-4o	44.1	47.6	17.3	38.9
GPT-3.5	42.9	48.6	16.3	38.5
Llama3-8B-Instruct	39.1	38.7	16.2	33.4
Llama3-8B-TI	82.8	68.8	73.6	75.3

TelecomGPT nearly doubles GPT-4o’s accuracy on this task. General-purpose models simply lack the fine-grained understanding of 3GPP’s organizational structure that our domain-specific training provides.

3. Telecom Math Modeling

Perhaps our most novel benchmark: given a research paper with an equation removed, can the model reconstruct it? We evaluate ~600 equations from 170 papers using MathBERT embeddings for semantic similarity scoring. Pay attention to the $\geq$90% column — the share of near-perfect reconstructions:

Model	Avg. Score	$\geq$90%	$\geq$50%
GPT-4	49.38	3.77%	50.35%
Llama3-8B-Instruct	40.78	2.51%	34.45%
Llama3-8B-TI-TA	49.45	9.52%	50.73%
Mistral-7B-Instruct	35.54	1.53%	29.43%
Mistral-7B-TI-TA	48.11	7.22%	49.26%

After both instruction tuning and alignment, our Llama3-based TelecomGPT slightly outperforms GPT-4 on average score (49.45 vs 49.38) and produces substantially more near-perfect reconstructions (9.52% vs 3.77% at $\geq$90% threshold).

CDF of MathBERT scores Figure 2: Cumulative density function of MathBERT cosine similarity scores. TelecomGPT’s curve (brown) closely tracks or exceeds GPT-4 (purple) across the full score range, while significantly outperforming all other open-source base models.

4. Telecom Code Tasks

Four subtasks: code summarization, code analysis, code infilling, and code generation — all using telecom-specific code (MATLAB signal processing, Python simulations, etc.). The Infilling column tells the clearest story:

Model	Summary (R1)	Analysis (R1)	Infilling (R1)	Generation (R1)
Llama3-8B-Instruct	0.320	0.334	0.113	0.191
Llama3-8B-TI-TA	0.517	0.405	0.434	0.271
Mistral-7B-Instruct	0.364	0.398	0.106	0.194
Mistral-7B-TI-TA	0.575	0.547	0.439	0.260

The improvements on code infilling are striking: from 0.113 to 0.434 ROUGE-1 for Llama3, almost a 4x improvement. This demonstrates that the model has learned the patterns and conventions of telecom-specific code.

5. Instruction Following

We evaluate on both general telecom instructions and protocol-specific instructions (e.g., “Explain the RRC connection setup procedure”). The Open QA column reveals the starkest gap between base and adapted models:

Model	Open QA (R1)	General Instr. (R1)	Protocol Instr. (R1)
Llama3-8B-Instruct	0.055	0.188	0.204
Llama3-8B-TI-TA	0.417	0.414	0.353

The open-ended QA improvement — from 0.055 to 0.417 — is particularly noteworthy. The base instruct model essentially could not answer open-ended telecom questions; TelecomGPT can.

What Each Training Stage Contributes

A natural question: is every stage necessary? Our ablation analysis shows each stage serves a distinct purpose:

Stage	What it adds	Evidence
Continual Pre-training (TP)	Domain vocabulary and factual knowledge	~4% MCQ accuracy boost for Llama2
Instruction Tuning (TI)	Task-following ability across telecom tasks	Dramatic improvements across all benchmarks
Alignment (TA)	Response quality and conciseness	Consistent gains on math modeling and open QA

The biggest jump comes from instruction tuning — but alignment provides meaningful refinements, particularly for generation-heavy tasks like math modeling and open-ended QA.

Key Takeaways

Domain adaptation works. A 7-8B parameter model, properly adapted, can match or exceed GPT-4 on specialized telecom tasks — at a fraction of the inference cost.
Data curation matters more than data volume. Our 1.68B token pre-training corpus is tiny by modern standards, but careful filtering with domain keywords ensures high relevance.
Evaluation infrastructure is as important as the model. Before TelecomGPT, there was no standardized way to measure how well an LLM understands telecom. Our five benchmarks cover the breadth of skills that a telecom AI assistant needs.
The 3GPP classification result is a wake-up call. GPT-4o at 38.9% vs TelecomGPT at 75.3% — this isn’t a marginal improvement. It demonstrates that domain-specific training unlocks capabilities that scale alone cannot provide.
Open-source models can be competitive. By building on Llama and Mistral, our entire pipeline is reproducible and extensible by the research community.

What’s Next?

TelecomGPT is a framework, not a final product. Future directions include:

Scaling up — applying the pipeline to larger base models (70B+)
Multimodal telecom AI — incorporating network diagrams, signal visualizations, and protocol flow charts
RAG integration — combining TelecomGPT with retrieval over live 3GPP specifications
Agent capabilities — enabling TelecomGPT to interact with network simulators and configuration tools

If you’re interested in the details, check out the full paper on arXiv.

This post summarizes our work on TelecomGPT, published in July 2024. Since then, we’ve extended the vision to the physical layer — see RF-GPT: Teaching Language Models to See the Invisible Spectrum, where we give a language model its first glimpse of radio-frequency signals. If you’re working on domain-adapted LLMs for telecom or related fields, feel free to connect with me on LinkedIn.

Motivation#

The Three-Stage Pipeline#

Stage 1: Continual Pre-training#

Stage 2: Instruction Tuning (SFT)#

Stage 3: Alignment Tuning (DPO)#

Five Novel Evaluation Benchmarks#

1. Telecom MCQ (Multiple-Choice Questions)#

2. 3GPP Document Classification#

3. Telecom Math Modeling#

4. Telecom Code Tasks#

5. Instruction Following#

What Each Training Stage Contributes#

Key Takeaways#

What’s Next?#