Skip to content

Domain-Specific Languages (DSL): "Code That Doesn't Look Like Code" in the Backend World

Introduction

In a real-world case, engineer Armin used AI to build a set of infrastructure services at his new company, totaling about 40,000 lines of code (Go + YAML + Pulumi + SDK glue code), of which over 90% was generated by AI. This case involved many terms unfamiliar to beginners: YAML, Pulumi, HCL, Lua, SDK glue code... They're neither Python nor JavaScript, yet they're ubiquitous in backend projects. This article will systematically introduce these technologies from a unified perspective — Domain-Specific Languages (DSL).

Learning Objectives

In backend development, beyond the business logic written in general-purpose programming languages (Python, Go, Java, etc.), there are numerous files and code with different purposes, different syntaxes, but none belonging to general-purpose programming languages. They share a common umbrella concept: DSL (Domain-Specific Language).

After reading this article, you will be able to:

  • Understand the essential difference between DSL and general-purpose programming languages (GPL)
  • Master the DSL classification system: data serialization formats, embedded scripting languages, infrastructure definition languages
  • Distinguish the applicable scenarios of XML, JSON, YAML, TOML, CSV, Protobuf, and other data formats
  • Understand the design purpose of embedded scripting languages like Lua
  • Explain the principles and differences between Terraform (HCL) and Pulumi
  • Understand how the OpenAPI specification and SDK auto-generation work
  • Judge which types of code are suitable for AI generation
ChapterTopicCore Concepts
Chapter 1DSL overviewDSL vs GPL definitions, classification system, and landscape
Chapter 2Data serialization formatsXML, JSON, YAML, TOML, CSV, Protobuf, etc.
Chapter 3Embedded scripting languagesDesign philosophy and typical applications of Lua and similar languages
Chapter 4Infrastructure as CodePrinciples and comparison of Terraform (HCL) and Pulumi
Chapter 5Glue code and SDK generationOpenAPI specification and client code auto-generation
Chapter 6AI and DSLWhy AI is particularly good at generating DSL code

1. DSL Overview: Another World Beyond General-Purpose Languages

1.1 What Is a DSL?

DSL (Domain-Specific Language) is a language designed for a specific domain or specific task. In contrast, GPL (General-Purpose Language), such as Python, Java, Go, C++, etc., is designed to solve arbitrary computational problems.

The core differences:

DimensionGPL (General-Purpose Language)DSL (Domain-Specific Language)
Design goalSolve arbitrary computational problemsSolve problems in a specific domain
Expressive rangeTuring-complete, can theoretically compute anythingUsually intentionally limited in expressive range
Learning costHigher, requires understanding the full language systemLower, only need to understand domain concepts
Typical examplesPython, Java, Go, C++, JavaScriptSQL, HTML/CSS, Regular expressions, YAML, HCL

You've actually been using DSLs all along:

  • SQL is a DSL for the database querying domain — you use SELECT * FROM users WHERE age > 18 to query data, rather than manually writing traversal logic in Python
  • HTML/CSS are DSLs for web structure and styling — you use tags and attributes to describe pages, rather than manipulating pixels in C++
  • Regular expressions are a DSL for text pattern matching — you use \d{3}-\d{4} to match phone numbers, rather than writing character comparison loops by hand

1.2 DSL Classification

DSLs can be divided into two major categories based on "whether they are Turing-complete":

External DSL

Has independent syntax and parser, not dependent on any general-purpose programming language. The code written by users is processed by a dedicated interpreter or compiler.

  • Pure data description type: JSON, YAML, XML, TOML, CSV, Protobuf (contains no logic whatsoever)
  • Query/operation type: SQL, GraphQL, regular expressions (limited logic capability)
  • Domain modeling type: HCL (Terraform), Dockerfile, Nginx configuration syntax (declaratively describes the state of a specific domain)

Internal DSL (Embedded DSL)

Parasitically lives inside a general-purpose programming language, using the host language's syntax to build domain-specific expressions. The code itself is valid host language code, but reads like a specialized language.

  • Pulumi (written in TypeScript/Python/Go, but the API design reads like declarative configuration)
  • Ruby on Rails route definitions (get '/users', to: 'users#index' — valid Ruby code, but reads like configuration)
  • Test framework assertion syntax (expect(value).toBe(42) — valid JavaScript, but reads like natural language)

1.3 DSL Landscape in Backend Projects

In a typical backend project, you'll encounter the following categories of DSLs:

DSLs in Backend Projects
├── Data Serialization Formats (describe data structures)
│   ├── Text formats: JSON, YAML, XML, TOML, CSV, INI
│   └── Binary formats: Protobuf, MessagePack, Avro, BSON
├── Embedded Scripting Languages (programmable configuration layer)
│   ├── Lua (game engines, Nginx, Redis)
│   ├── GDScript (Godot engine)
│   └── Jsonnet (configuration template generation)
├── Infrastructure and Ops DSLs (declaratively describe system state)
│   ├── HCL (Terraform)
│   ├── Dockerfile / Docker Compose YAML
│   └── Nginx / Apache configuration syntax
└── Interface Description Languages (describe API contracts)
    ├── OpenAPI / Swagger
    ├── Protocol Buffers (.proto files)
    └── GraphQL Schema

With this landscape in mind, the following chapters will unfold each branch one by one.


2. Data Serialization Formats: Describing Structured Data in Text

2.1 What Is Data Serialization?

Serialization is the process of converting in-memory data structures (objects, dictionaries, arrays, etc.) into a storable or transmissible text/byte stream. The reverse process — restoring in-memory data structures from text/byte streams — is called deserialization.

Data serialization formats are the most fundamental category of DSLs — they are pure data description external DSLs with no logic capability, responsible only for statically describing "what the value is."

2.2 Why Do We Need These Formats?

Suppose you've developed a backend service with a database address of localhost:5432. If you hardcode this address in the source code, local development works fine, but when deploying to production, the database address changes to db.prod.company.com:5432, and you'd need to modify the source code and recompile.

The standard engineering practice is: separate variable parameters from code and store them in independent configuration files. The program reads the configuration file at startup and determines its behavior based on the values within.

Beyond configuration, data serialization formats are widely used for: data exchange between systems (API requests/responses), data persistence storage, cross-language communication, and more.

2.3 Human-Readable Text Formats

The following are the most common text serialization formats in engineering, introduced in chronological order.

INI

The earliest configuration format, originating from Windows systems. Simple structure, composed of sections and key-value pairs:

ini
[database]
host = localhost
port = 5432

[server]
debug = true

The advantage is strong readability. The limitation is no support for nested structures or array types, making it unable to express complex configurations. Currently mainly found in legacy systems and some Linux configurations (like php.ini, my.cnf).

CSV

CSV (Comma-Separated Values) is the simplest tabular data format:

csv
name,age,city
Alice,30,Beijing
Bob,25,Shanghai

Each row is a record, with fields separated by commas. CSV is widely used for data import/export, spreadsheet exchange, and data analysis pipelines. Its limitation is that it can only express flat two-dimensional tables, doesn't support nested structures, and has no type information (all values are strings).

XML

XML (eXtensible Markup Language) was born in 1998 and was once the mainstream standard for data exchange:

xml
<?xml version="1.0" encoding="UTF-8"?>
<config>
  <database>
    <host>localhost</host>
    <port>5432</port>
  </database>
  <server>
    <debug>true</debug>
    <allowed_origins>
      <origin>https://example.com</origin>
      <origin>https://app.example.com</origin>
    </allowed_origins>
  </server>
</config>

XML has very strong expressive power, supporting nesting, attributes, namespaces, Schema validation, and other advanced features. But its syntax is verbose — large amounts of opening/closing tags result in a low signal-to-noise ratio, and the experience of writing and reading by hand is poor.

XML is still widely used in:

  • Java ecosystem (Maven's pom.xml, Spring configuration, Android layout files)
  • Enterprise web services (SOAP protocol)
  • Office document formats (.docx, .xlsx are essentially ZIP-compressed collections of XML files)
  • RSS/Atom feeds, SVG vector graphics

JSON

JSON (JavaScript Object Notation) was born in 2001 and rapidly replaced XML as the de facto standard for Web API data exchange due to its simplicity:

json
{
  "database": {
    "host": "localhost",
    "port": 5432
  },
  "server": {
    "debug": true
  }
}

Its advantages are clear structure and native parsing support in virtually all programming languages. The main drawback is no support for comments, and the numerous brackets and quotes are error-prone when writing by hand. JSON is also the standard format for frontend project configuration (package.json, tsconfig.json).

YAML

YAML (YAML Ain't Markup Language) was also born in 2001 and is currently the most widely used configuration format in the backend and DevOps world. Docker Compose, Kubernetes, GitHub Actions, and other tools all use YAML:

yaml
# Database configuration
database:
  host: localhost
  port: 5432

# Server configuration
server:
  debug: true
  allowed_origins:
    - https://example.com
    - https://app.example.com

Advantages include comment support, concise syntax, and the ability to express complex nested structures. The disadvantage is that it relies on indentation to represent hierarchy — indentation errors cause parsing failures, which is the most common issue for beginners.

Note: YAML's full name "YAML Ain't Markup Language" is a recursive acronym.

TOML

TOML (Tom's Obvious Minimal Language) was born in 2013 and is adopted by Rust's package manager Cargo and Python's pyproject.toml:

toml
[database]
host = "localhost"
port = 5432

[server]
debug = true
allowed_origins = [
  "https://example.com",
  "https://app.example.com"
]

TOML attempts to combine INI's simplicity with YAML's expressiveness while avoiding indentation-sensitivity issues.

2.4 Binary Serialization Formats

The above formats are all human-readable text. For scenarios with higher performance and size requirements, there are also binary serialization formats — they sacrifice readability for smaller sizes and faster parsing speeds.

FormatDeveloperCharacteristicsTypical Use Cases
Protocol Buffers (Protobuf)GoogleRequires pre-defined .proto Schema files, strongly typed, extremely small sizegRPC communication, Google internal services, high-performance microservices
MessagePackCommunityBinary version of JSON-like format, no Schema requiredRedis internal encoding, cross-language high-performance communication
AvroApacheSupports Schema evolution, suitable for big data scenariosHadoop/Kafka ecosystem data serialization
BSONMongoDBBinary extension of JSON, supports more data typesMongoDB database internal storage format

Taking Protocol Buffers as an example, you need to define the Schema first:

protobuf
// user.proto
syntax = "proto3";

message User {
  string name = 1;
  int32 age = 2;
  string email = 3;
}

Then the compiler (protoc) automatically generates serialization/deserialization code for various languages. This "define Schema first, then generate code" pattern is consistent with the OpenAPI SDK generation approach introduced later.

2.5 Complete Comparison

FormatTypeYear BornReadabilitySupports CommentsTypical Use Cases
INIText1980sHighYesSystem configuration, legacy projects
CSVText1972HighNoData import/export, spreadsheet exchange
XMLText1998MediumYesJava ecosystem, enterprise web services, document formats
JSONText2001HighNoWeb API data exchange, frontend configuration
YAMLText2001HighYesDocker, K8s, CI/CD, backend service configuration
TOMLText2013HighYesRust/Python project configuration
ProtobufBinary2008NonegRPC, high-performance microservice communication
MessagePackBinary2008NoneHigh-performance cross-language communication
AvroBinary2009NoneHadoop/Kafka big data pipelines
BSONBinary2009NoneMongoDB internal storage

Key takeaway: The essential function of all these formats is the same — converting structured data into a storable, transmissible form. Text formats prioritize human readability and ease of editing; binary formats prioritize parsing performance and transmission size. Which format to choose depends on the trade-offs required by the specific scenario.


3. Embedded Scripting Languages: The Programmable Configuration Layer

3.1 Concept Definition

Python, JavaScript, Go, and similar languages are general-purpose programming languages (GPLs) that can run independently and build complete applications.

In contrast, there is another category of languages specifically designed to be embedded within other host programs, providing programmable extension capabilities for the host program. These are called embedded scripting languages.

The core problem they solve: when static configuration files (YAML/JSON) lack sufficient expressiveness and conditional logic, loops, and other logic are needed, how to achieve dynamic behavior without modifying the host program's source code.

3.2 Lua: The Most Representative Embedded Scripting Language

Lua (meaning "moon" in Portuguese) is an extremely lightweight scripting language; the entire interpreter is only a few hundred KB when compiled. Its design goal is not to run independently, but to serve as an embeddable extension layer.

Typical Lua application scenarios:

  • Game engines: World of Warcraft's addon system and Roblox's game scripts both use Lua. Game engines implement core rendering and physics computation in C/C++, while delegating frequently-changing parts like level logic and NPC dialogue to Lua scripts. This way, designers can modify game content without recompiling the engine.

  • Web servers: OpenResty embeds Lua inside Nginx, enabling ops personnel to implement request filtering, rate limiting, authentication, and other logic using Lua scripts without modifying Nginx's C source code.

  • Databases: Redis supports sending Lua scripts to the server for execution, used to implement composite operations requiring atomicity guarantees (such as "read-then-write").

Here's an example of a Lua script embedded in Nginx (OpenResty):

lua
-- Function: Token authentication for /api/secret path
local uri = ngx.var.uri
local token = ngx.req.get_headers()["Authorization"]

if uri == "/api/secret" and token ~= "Bearer my-secret-token" then
    ngx.status = 403
    ngx.say("Access denied")
    return ngx.exit(403)
end

3.3 Other Embedded Scripting Languages

LanguageHost EnvironmentTypical Use
LuaGame engines, Nginx (OpenResty), RedisGame logic, gateway policies, cache operations
VimScript / LuaVim / Neovim editorEditor plugin development
Emacs LispEmacs editorEditor behavior customization
GDScriptGodot game engineGame logic scripts
JsonnetKubernetes ecosystem / configuration generation toolsTemplate-based generation of large numbers of similar JSON/YAML configurations

Key takeaway: Embedded scripting languages occupy the boundary between internal DSL and external DSL in the DSL classification — they are independent languages (with their own syntax and interpreters), but their design goal is to be embedded in host programs rather than independently build applications. They fill the gap between "static configuration files" (pure data description DSLs) and "general-purpose programming languages" (GPLs): when configuration needs to express logic (conditional branching, loops, function calls), embedding a lightweight scripting language is the standard engineering solution.


4. Infrastructure as Code

4.1 What Is "Infrastructure"

In backend engineering, "infrastructure" refers to the underlying resources that applications depend on to run:

  • Compute resources: Servers (virtual machines or containers)
  • Data storage: Database instances, object storage buckets
  • Networking: Firewall rules, load balancers, DNS configuration
  • Middleware: Message queues, cache clusters

In the cloud computing era, these resources are created and managed through cloud providers' (AWS, Alibaba Cloud, Tencent Cloud) consoles via graphical interfaces.

4.2 Limitations of Manual Management

Manual operations via the console are feasible for small-scale projects, but as project scale grows, the following problems emerge:

  1. Not repeatable: Operations aren't recorded; you can't precisely reproduce the same environment
  2. Not auditable: Can't trace "who changed what configuration, when"
  3. Not collaborative: Operations can't be put under version control or code review
  4. Error-prone: Manual operations in production carry the risk of mistakes

Infrastructure as Code (IaC) has the core idea: declaratively define infrastructure resources using code, giving them version control, automated execution, and repeatable deployment capabilities.

4.3 Terraform

Terraform is the most widely used IaC tool, developed by HashiCorp. It uses the dedicated HCL (HashiCorp Configuration Language).

Terraform adopts a declarative paradigm: users describe the desired end state, and Terraform automatically calculates the operations needed to transition from the current state to the target state.

hcl
# Define a cloud server
resource "aws_instance" "my_server" {
  ami           = "ami-0c55b159cbfafe1f0"  # OS image
  instance_type = "t3.micro"               # Instance type

  tags = {
    Name = "my-first-server"
  }
}

# Define a PostgreSQL database instance
resource "aws_db_instance" "my_database" {
  engine         = "postgres"
  instance_class = "db.t3.micro"
  username       = "admin"
  password       = "please-use-secrets-manager"
}

Execution flow:

bash
terraform plan    # Preview the changes to be made
terraform apply   # Confirm and execute, automatically creating resources on the cloud platform

4.4 Pulumi

Pulumi offers a different approach: directly use general-purpose programming languages (TypeScript, Python, Go, etc.) to define infrastructure, rather than learning the dedicated HCL syntax.

The same server definition, expressed with Pulumi + TypeScript:

typescript
import * as aws from "@pulumi/aws";

const server = new aws.ec2.Instance("my-server", {
    ami: "ami-0c55b159cbfafe1f0",
    instanceType: "t3.micro",
    tags: { Name: "my-first-server" },
});

const bucket = new aws.s3.Bucket("my-bucket", {
    acl: "private",
});

export const serverIp = server.publicIp;

Since it uses general-purpose programming languages, developers can leverage language features like loops, conditional branching, and function abstraction to handle complex infrastructure logic.

4.5 Terraform vs Pulumi Comparison

DimensionTerraformPulumi
LanguageHCL (dedicated language)TypeScript / Python / Go and other general-purpose languages
Learning costNeed to learn HCL syntaxUse already-mastered programming languages, lower learning cost
Community ecosystemVery mature, covers nearly all cloud providersRapidly growing, but smaller scale than Terraform
Use casesOps-team-led standardized infrastructure managementDeveloper-led projects needing complex logic
AI code generation fitHigh (fixed patterns)Very high (essentially general-purpose language code)

Key takeaway: HCL in IaC tools is a typical external DSL — it has independent syntax and a parser, specifically for declaratively describing infrastructure state. Pulumi adopts an internal DSL strategy — using general-purpose programming language syntax to express domain-specific concepts. Both share the same goal (transforming infrastructure management from manual operations to code-driven), but take different paths (dedicated language vs general-purpose language). Code can be put under Git version control, undergo team review, and be automatically executed and rolled back.


5. Glue Code and SDK Auto-Generation

5.1 What Is Glue Code

In software engineering, glue code refers to code that itself contains no business logic but merely connects two systems or modules.

Typical glue code includes:

  • HTTP request code written when the frontend calls backend APIs (URL construction, header settings, response parsing)
  • HTTP client code written when backend service A calls service B's interface
  • Interface adaptation code between different programming languages

The characteristics of such code: highly repetitive, pattern-fixed, but indispensable.

5.2 OpenAPI Specification and Code Auto-Generation

Since glue code has highly pattern-based characteristics, the engineering world's solution is: first describe API interfaces in a standard format, then use tools to automatically generate client code.

The OpenAPI Specification (formerly Swagger) is the industry standard for describing REST APIs. It uses YAML or JSON format to precisely define API paths, parameters, request bodies, and response structures:

yaml
openapi: 3.0.0
info:
  title: Email Service API
  version: 1.0.0

paths:
  /emails:
    post:
      summary: Send email
      requestBody:
        content:
          application/json:
            schema:
              type: object
              properties:
                to:
                  type: string
                  example: "user@example.com"
                subject:
                  type: string
                body:
                  type: string
      responses:
        '200':
          description: Sent successfully

Based on this specification file, tools like openapi-generator can automatically generate client SDKs in multiple languages:

  • Python: client.emails.send(to="user@example.com", subject="Hi", body="Hello")
  • TypeScript: client.emails.send({ to: "user@example.com", subject: "Hi", body: "Hello" })
  • Go: client.Emails.Send(ctx, &SendEmailRequest{To: "user@example.com", ...})

The generated SDK encapsulates all HTTP request details; callers don't need to care about URL paths, request methods, serialization formats, or other low-level implementation details.

5.2 Revisiting Armin's Case

Returning to the case from the beginning of this article, we can now accurately understand each component:

ComponentNatureDescription
GoBusiness logic codeCore functionality implementation of the email service
YAMLConfiguration filesService configuration, CI/CD pipeline definitions, OpenAPI specification files
PulumiInfrastructure codeDefine cloud resources (servers, databases, networking) using Go/TypeScript
SDK glue codeAuto-generated client librariesPython and TypeScript SDKs auto-generated from the OpenAPI specification

The YAML configuration, Pulumi resource definitions, and SDK glue code are all highly pattern-based code with clear specification constraints — precisely the areas where AI code generation is most capable. Therefore, "90% of 40,000 lines generated by AI" is entirely reasonable.


6. AI and DSL

6.1 AI Code Generation Applicability Analysis

Characteristic DimensionSuitable for AI GenerationNot Suitable for AI Generation
Pattern levelHighly repetitive, has fixed templatesRequires creative design, no precedent to follow
Specification constraintsHas clear schema or syntax specificationVague requirements, unclear boundaries
Context dependencyLocally self-consistent, individual definitions don't depend on global understandingRequires understanding the entire system's architectural intent
VerifiabilityCan be automatically validated by tools (e.g., terraform validate)Can only rely on human judgment of design reasonableness

The four categories of technologies introduced in this article — configuration files, embedded scripts, IaC code, and SDK glue code — all share the characteristics in the left column. This explains why AI's code generation effectiveness in these areas is significantly better than for business logic code.

6.2 Evaluation Framework

When judging whether a piece of code is suitable for AI generation, you can reference these three criteria:

  1. Is there an existing specification or schema? — If yes, AI-friendly
  2. Is it a large number of repeated patterns? — If yes, AI-friendly
  3. Can the generated result be automatically verified by tools? — If yes, AI-friendly

Code that satisfies all three criteria (like generating SDKs from OpenAPI specifications, or batch-defining homogeneous resources with Terraform) can heavily rely on AI generation. Code that satisfies none of the criteria (like designing a new distributed consistency protocol) still requires engineers to complete themselves.


7. Glossary

TermFull NameDefinition
DSLDomain-Specific LanguageA language designed for a specific domain, contrasted with general-purpose programming languages
GPLGeneral-Purpose LanguageA programming language that can solve arbitrary computational problems, e.g., Python, Java, Go
External DSLExternal DSLA domain-specific language with independent syntax and parser, e.g., SQL, HCL, YAML
Internal DSLInternal DSL / Embedded DSLA domain-specific expression built using host language syntax, parasitic within a GPL, e.g., Pulumi
Data SerializationData SerializationThe process of converting in-memory data structures into a storable or transmissible format
INIInitializationThe earliest key-value configuration format, originating from Windows systems
CSVComma-Separated ValuesA plain text tabular format with comma-separated fields
XMLeXtensible Markup LanguageA tag-based text data format with strong expressiveness but verbose syntax
JSONJavaScript Object NotationA lightweight key-value-based data exchange format, the de facto standard for Web APIs
YAMLYAML Ain't Markup LanguageAn indentation-based configuration file format, widely used in backend and DevOps
TOMLTom's Obvious Minimal LanguageAn explicit-syntax configuration format, commonly used in Rust and Python ecosystems
ProtobufProtocol BuffersA binary serialization format developed by Google, requires pre-defined Schema, small size and fast
MessagePackA JSON-like binary serialization format, no Schema required
LuaA lightweight embedded scripting language, commonly used for game engine, web server, and database extensions
IaCInfrastructure as CodeThe engineering practice of defining and managing cloud computing resources with code
TerraformAn IaC tool developed by HashiCorp, using the HCL declarative language
HCLHashiCorp Configuration LanguageThe dedicated configuration language used by Terraform
PulumiAn IaC tool supporting general-purpose programming languages
OpenAPIThe industry standard specification for describing REST API interfaces (formerly Swagger)
SDKSoftware Development KitA client library that encapsulates API calling details
Glue CodeGlue CodeAdapter code without business logic, used only to connect two systems

Summary

There is a large amount of non-business-logic code in backend engineering. They share a common umbrella concept: DSL (Domain-Specific Language) — languages designed for specific domains, contrasted with general-purpose programming languages.

The DSLs introduced in this article can be categorized into four groups:

  1. Data serialization formats (XML / JSON / YAML / TOML / CSV / Protobuf, etc.) — Pure data description external DSLs, converting structured data into storable, transmissible forms
  2. Embedded scripting languages (Lua, etc.) — Between configuration and general-purpose languages, providing programmable extension capabilities for host programs
  3. Infrastructure definition languages (HCL / Dockerfile, etc.) — Declarative external DSLs describing the desired system state; Pulumi achieves the same goal as an internal DSL
  4. Interface description languages and glue code generation (OpenAPI / .proto) — Automatically generating inter-system connection code through specification descriptions

Understanding the DSL classification framework enables you to quickly identify the nature of "code that doesn't look like code" in backend projects: which category of DSL it belongs to, what domain problem it solves, and why it isn't written in a general-purpose programming language.

At the same time, because DSL code has the characteristics of being highly pattern-based, specification-driven, and automatically verifiable, it is also the most effective application area for current AI code generation technology.