Migrating SAS to Databricks: DATA Steps to PySpark and Delta Lake

SAS has been the backbone of enterprise analytics for decades. From financial institutions running credit-risk models to pharmaceutical companies processing clinical trial data, SAS programs power mission-critical workloads across every regulated industry. But the economics of SAS licensing — per-core pricing that escalates with server capacity — combined with a shrinking talent pool and the industry's shift toward open-source analytics, have pushed organizations to seek alternatives. Databricks, built on Apache Spark and anchored by Delta Lake, has emerged as the leading destination for SAS migration.

The appeal is structural, not cosmetic. PySpark provides a native fit for SAS DATA step logic at distributed scale. MLflow replaces SAS analytics procedures with a modern experiment-tracking and model-deployment framework. The Lakehouse architecture unifies data warehousing and data science into a single platform, eliminating the need for separate SAS servers, SAS Grid clusters, and standalone databases. And the cost model shifts from perpetual per-core licensing to elastic, pay-per-use compute that scales to zero when idle.

This article provides a comprehensive technical guide to migrating SAS programs to Databricks, covering every major SAS construct and its Databricks equivalent, with detailed code examples and architectural guidance.

Why Migrate from SAS to Databricks?

Before diving into the technical mapping, it is worth understanding the strategic forces driving SAS-to-Databricks migration. These are not abstract trends — they are concrete business pressures that affect budget, hiring, and competitive positioning.

Licensing Cost Reduction

SAS licensing operates on a per-core, annual-renewal model that penalizes organizations for scaling infrastructure. A typical SAS deployment with Base SAS, SAS/STAT, SAS/ETS, SAS Enterprise Miner, and SAS Grid Manager can cost millions per year for a mid-sized analytics team. Databricks pricing is consumption-based: you pay for the compute hours you actually use, and clusters auto-terminate when jobs complete. Organizations routinely report 40-60% cost reductions after migrating from SAS to Databricks, even accounting for the migration investment.

PySpark as a Native Fit for SAS Logic

SAS DATA steps process data row by row with an implicit loop, applying conditional logic, creating variables, and writing output datasets. PySpark DataFrames express the same transformations using column-level operations that Spark distributes across a cluster. The conceptual mapping is direct: a SAS DATA step that reads, transforms, and outputs data becomes a PySpark chain of spark.read, .withColumn(), .filter(), and .write. The programming model is different — columnar rather than row-based — but the intent and output are identical.

MLflow Replaces SAS Analytics

SAS procedures like PROC LOGISTIC, PROC REG, PROC MIXED, and SAS Enterprise Miner provide statistical modeling and machine learning in a proprietary, closed environment. MLflow on Databricks offers experiment tracking, model versioning, model registry, and deployment — all integrated with PySpark, scikit-learn, TensorFlow, and PyTorch. The shift from SAS analytics to MLflow is not just a tool change; it unlocks the entire open-source ML ecosystem while maintaining the governance and reproducibility that enterprise analytics demands.

Lakehouse Architecture

SAS environments typically consist of SAS servers connected to relational databases, with SAS datasets stored on network file systems. Data moves between systems through PROC EXPORT, PROC IMPORT, and ODBC/JDBC connections. The Databricks Lakehouse eliminates this architecture by storing all data in Delta Lake tables on cloud object storage (S3, ADLS, GCS), governed by Unity Catalog, and accessible from notebooks, SQL endpoints, and BI tools. There is no separate data warehouse, no SAS library file system, and no batch export/import cycle.

Talent and Community

The SAS programmer workforce is aging. Universities have shifted curricula from SAS to Python and R. Hiring a SAS developer in 2026 is significantly more expensive than hiring a Python/PySpark developer with equivalent experience. Migrating to Databricks positions organizations to recruit from a much larger, younger, and more cost-effective talent pool.

SAS to Databricks migration — automated end-to-end by MigryX

Architecture Comparison: SAS Server vs. Databricks Clusters

Understanding the architectural differences between SAS and Databricks is essential for planning a migration. SAS operates on a shared-server model, while Databricks operates on an elastic-cluster model. The implications touch compute, storage, orchestration, and security.

In a traditional SAS deployment, a SAS Application Server runs on one or more physical or virtual machines. SAS programs execute sequentially or in parallel (via SAS Grid) on these servers, reading data from SAS datasets (on shared file systems) or databases (via SAS/ACCESS engines). SAS Metadata Server manages security, library definitions, and program metadata. SAS programs are submitted interactively through SAS Enterprise Guide, SAS Studio, or batch submission.

In Databricks, computation happens on ephemeral Apache Spark clusters that spin up when a job starts and terminate when it completes. Notebooks provide an interactive development environment similar to SAS Studio but with multi-language support (Python, SQL, Scala, R). Data is stored in Delta Lake tables on cloud object storage, governed by Unity Catalog. Databricks Workflows orchestrate job execution with scheduling, retries, and alerting.

SAS Component	Databricks Equivalent	Key Difference
SAS Application Server	Databricks Cluster	Ephemeral, auto-scaling vs. persistent, fixed-capacity
SAS Grid Manager	Databricks Job Clusters	Automatic cluster provisioning per job vs. manual grid configuration
SAS Metadata Server	Unity Catalog	SQL-based governance with column-level lineage
SAS Enterprise Guide / SAS Studio	Databricks Notebooks	Multi-language, collaborative, version-controlled
SAS Library (LIBNAME)	Unity Catalog Schema	Three-level namespace: catalog.schema.table
SAS Dataset (.sas7bdat)	Delta Lake Table	ACID transactions, time travel, schema evolution
SAS/ACCESS Engines	Spark Connectors / JDBC	Native connectors for hundreds of data sources
SAS Stored Processes	Databricks Jobs / SQL Endpoints	REST API access, parameterized execution
SAS Batch Scheduling	Databricks Workflows	DAG-based orchestration with dependency management

MigryX: Idiomatic Code, Not Line-by-Line Translation

The difference between MigryX and manual migration is not just speed — it is code quality. MigryX generates idiomatic, platform-optimized code that leverages native features of your target platform. A SAS DATA step does not become a clunky row-by-row loop — it becomes a clean, vectorized DataFrame operation. A PROC SQL query does not become a literal translation — it becomes an optimized query that takes advantage of your platform’s pushdown capabilities.

Comprehensive SAS-to-Databricks Mapping Table

The following table maps every major SAS construct to its Databricks equivalent. This serves as a reference for migration planning and is the foundation for MigryX's automated conversion engine.

SAS Construct	Databricks / PySpark Equivalent	Notes
DATA step	PySpark DataFrame transformations	Row-by-row logic becomes columnar operations with withColumn(), filter(), when()
PROC SQL	Databricks SQL / spark.sql()	ANSI SQL with Delta Lake extensions; nearly 1:1 syntax migration
SAS Macros (%macro/%mend)	Python functions	Macro variables become function parameters; %IF/%THEN becomes if/else
PROC SORT	.orderBy() / .sort()	Distributed sorting; also supports .sortWithinPartitions()
PROC MEANS / PROC FREQ	.groupBy().agg()	F.mean(), F.count(), F.stddev(), F.sum() with groupBy()
SAS Datasets (.sas7bdat)	Delta Lake Tables	ACID transactions, schema enforcement, time travel
SAS Libraries (LIBNAME)	Unity Catalog Schemas	catalog.schema.table namespace replaces libref.dataset
PROC MODEL / Enterprise Miner	MLflow + Feature Store	Experiment tracking, model registry, feature engineering
FORMAT / INFORMAT	PySpark Types + UDFs	cast(), to_date(), date_format() for type conversions
PROC EXPORT	Delta table write / .write.csv()	Write to Delta, Parquet, CSV, JSON, or external systems
MERGE statement	Delta MERGE INTO	Atomic upsert with WHEN MATCHED / WHEN NOT MATCHED
BY-group processing	.groupBy()	FIRST. and LAST. logic via Window functions with row_number()
PROC TRANSPOSE	pivot() / unpivot()	Native PySpark pivot operations
PROC LOGISTIC / PROC REG	scikit-learn / SparkML + MLflow	Full ML ecosystem with experiment tracking
PROC REPORT / PROC PRINT	display() / Databricks SQL Dashboards	Interactive visualization in notebooks or dashboards
SAS/CONNECT (RSUBMIT)	Databricks Workflows (multi-task)	Remote execution replaced by distributed job orchestration
Retained variables	Window functions / lag()	F.lag() and F.lead() replace DATA step RETAIN
Arrays	List comprehensions + select()	Python loops over column lists replace SAS arrays
Hash objects	broadcast join / map lookup	Spark broadcast joins replace SAS hash table lookups

Code Examples: SAS DATA Step to PySpark

The most common migration task is converting SAS DATA steps to PySpark DataFrame transformations. The following examples demonstrate the pattern for increasingly complex scenarios.

Basic DATA Step: Read, Transform, Output

# SAS Original:
# data work.customers_clean;
#   set raw.customers;
#   where age >= 18 and status = 'ACTIVE';
#   full_name = catx(' ', first_name, last_name);
#   age_group = ifc(age >= 65, 'Senior',
#               ifc(age >= 30, 'Adult', 'Young Adult'));
#   signup_year = year(signup_date);
# run;

# PySpark Equivalent on Databricks:
from pyspark.sql import functions as F

customers = spark.table("catalog.raw.customers")

customers_clean = (
    customers
    .filter(
        (F.col("age") >= 18) &
        (F.col("status") == "ACTIVE")
    )
    .withColumn("full_name",
        F.concat_ws(" ", F.col("first_name"), F.col("last_name"))
    )
    .withColumn("age_group",
        F.when(F.col("age") >= 65, "Senior")
         .when(F.col("age") >= 30, "Adult")
         .otherwise("Young Adult")
    )
    .withColumn("signup_year", F.year(F.col("signup_date")))
)

customers_clean.write.mode("overwrite").saveAsTable("catalog.work.customers_clean")

BY-Group Processing with FIRST. and LAST.

SAS DATA step BY-group processing uses automatic variables FIRST.variable and LAST.variable to detect group boundaries. In PySpark, this translates to Window functions with row_number() and boundary detection logic.

# SAS Original:
# proc sort data=transactions; by customer_id transaction_date; run;
# data work.first_last_txn;
#   set transactions;
#   by customer_id;
#   if first.customer_id then first_txn_date = transaction_date;
#   if last.customer_id then last_txn_date = transaction_date;
#   retain first_txn_date;
#   if last.customer_id then output;
# run;

# PySpark Equivalent:
from pyspark.sql.window import Window

w = Window.partitionBy("customer_id").orderBy("transaction_date")
w_desc = Window.partitionBy("customer_id").orderBy(F.desc("transaction_date"))

first_last_txn = (
    spark.table("catalog.bronze.transactions")
    .withColumn("rn_asc", F.row_number().over(w))
    .withColumn("rn_desc", F.row_number().over(w_desc))
    .groupBy("customer_id")
    .agg(
        F.min("transaction_date").alias("first_txn_date"),
        F.max("transaction_date").alias("last_txn_date"),
        F.count("*").alias("txn_count"),
        F.sum("amount").alias("total_amount")
    )
)

first_last_txn.write.mode("overwrite").saveAsTable("catalog.silver.customer_txn_summary")

RETAIN Statement and Running Totals

The SAS RETAIN statement carries variable values across iterations of the DATA step loop. In PySpark, this maps to Window functions with cumulative aggregation.

# SAS Original:
# data work.running_totals;
#   set transactions;
#   by customer_id;
#   retain running_balance 0;
#   if first.customer_id then running_balance = 0;
#   running_balance = running_balance + amount;
# run;

# PySpark Equivalent:
w_running = (
    Window
    .partitionBy("customer_id")
    .orderBy("transaction_date")
    .rowsBetween(Window.unboundedPreceding, Window.currentRow)
)

running_totals = (
    spark.table("catalog.bronze.transactions")
    .withColumn("running_balance", F.sum("amount").over(w_running))
)

running_totals.write.mode("overwrite").saveAsTable("catalog.silver.running_totals")

Code Examples: PROC SQL to Databricks SQL

SAS PROC SQL is remarkably close to standard SQL, making it one of the easiest constructs to migrate. Databricks SQL supports ANSI SQL with extensions for Delta Lake operations, window functions, and semi-structured data.

-- SAS Original:
-- proc sql;
--   create table work.revenue_summary as
--   select
--     region,
--     product_line,
--     count(distinct customer_id) as unique_customers,
--     sum(revenue) as total_revenue,
--     mean(revenue) as avg_revenue,
--     calculated total_revenue / sum(sum(revenue)) as pct_total format=percent8.2
--   from sales.transactions
--   where transaction_date between '01JAN2025'd and '31DEC2025'd
--   group by region, product_line
--   having calculated total_revenue > 100000
--   order by total_revenue desc;
-- quit;

-- Databricks SQL Equivalent:
CREATE OR REPLACE TABLE catalog.gold.revenue_summary AS
SELECT
    region,
    product_line,
    COUNT(DISTINCT customer_id) AS unique_customers,
    SUM(revenue) AS total_revenue,
    AVG(revenue) AS avg_revenue,
    SUM(revenue) / SUM(SUM(revenue)) OVER () AS pct_total
FROM catalog.silver.transactions
WHERE transaction_date BETWEEN '2025-01-01' AND '2025-12-31'
GROUP BY region, product_line
HAVING SUM(revenue) > 100000
ORDER BY total_revenue DESC;

Key differences: SAS uses calculated to reference derived columns within the same query; Databricks SQL uses window functions or subqueries. SAS date literals ('01JAN2025'd) become ISO format strings. SAS formats (format=percent8.2) become formatting in the presentation layer.

SAS PROC SQL Subqueries and Correlated Queries

-- SAS Original:
-- proc sql;
--   select a.customer_id, a.customer_name, b.total_orders
--   from customers a
--   inner join (
--     select customer_id, count(*) as total_orders
--     from orders
--     group by customer_id
--   ) b on a.customer_id = b.customer_id
--   where b.total_orders > (select avg(order_count) from order_summary);
-- quit;

-- Databricks SQL Equivalent:
SELECT a.customer_id, a.customer_name, b.total_orders
FROM catalog.silver.customers a
INNER JOIN (
    SELECT customer_id, COUNT(*) AS total_orders
    FROM catalog.silver.orders
    GROUP BY customer_id
) b ON a.customer_id = b.customer_id
WHERE b.total_orders > (SELECT AVG(order_count) FROM catalog.gold.order_summary);

SAS Macros to Python Functions

SAS macros are a text-substitution system that generates SAS code at compile time. They are powerful but notoriously difficult to debug and maintain. Python functions provide a cleaner, more testable replacement with full access to the language's control flow, exception handling, and package ecosystem.

# SAS Original:
# %macro process_monthly(dsn=, start_date=, end_date=);
#   %let month_list = ;
#   data &dsn._processed;
#     set &dsn.;
#     where date between "&start_date."d and "&end_date."d;
#     month_key = put(date, yyyymm6.);
#   run;
#   proc means data=&dsn._processed noprint;
#     by month_key;
#     var revenue units;
#     output out=&dsn._monthly sum= mean= / autoname;
#   run;
# %mend;
# %process_monthly(dsn=sales, start_date=01JAN2025, end_date=31DEC2025);

# Python/PySpark Equivalent on Databricks:
def process_monthly(table_name: str, start_date: str, end_date: str) -> None:
    """Process a table by month, computing revenue and unit aggregates."""
    df = spark.table(f"catalog.bronze.{table_name}")

    processed = (
        df
        .filter(
            (F.col("date") >= start_date) &
            (F.col("date") <= end_date)
        )
        .withColumn("month_key", F.date_format(F.col("date"), "yyyyMM"))
    )

    monthly = (
        processed
        .groupBy("month_key")
        .agg(
            F.sum("revenue").alias("revenue_sum"),
            F.mean("revenue").alias("revenue_mean"),
            F.sum("units").alias("units_sum"),
            F.mean("units").alias("units_mean")
        )
        .orderBy("month_key")
    )

    monthly.write.mode("overwrite").saveAsTable(
        f"catalog.silver.{table_name}_monthly"
    )

# Call the function
process_monthly("sales", "2025-01-01", "2025-12-31")

The Python version is easier to read, test, and maintain. Parameters are typed. The function can be unit-tested without a SAS runtime. Error handling uses standard Python try/except rather than SAS macro error trapping with %sysfunc and &syscc.

SAS MERGE to Delta MERGE INTO

The SAS MERGE statement combines datasets by a common key, handling matched and unmatched records. Delta Lake's MERGE INTO provides atomic upsert semantics that go beyond what SAS MERGE offers, including conditional updates, deletes, and schema evolution.

# SAS Original:
# proc sort data=existing; by customer_id; run;
# proc sort data=updates; by customer_id; run;
# data work.merged;
#   merge existing(in=a) updates(in=b);
#   by customer_id;
#   if a and b then update_flag = 'U';
#   else if b then update_flag = 'I';
# run;

# Databricks Delta MERGE Equivalent:
from delta.tables import DeltaTable

target = DeltaTable.forName(spark, "catalog.silver.customers")
source = spark.table("catalog.bronze.customer_updates")

target.alias("t").merge(
    source.alias("s"),
    "t.customer_id = s.customer_id"
).whenMatchedUpdate(set={
    "name": "s.name",
    "email": "s.email",
    "phone": "s.phone",
    "address": "s.address",
    "updated_at": "current_timestamp()",
    "update_flag": "'U'"
}).whenNotMatchedInsert(values={
    "customer_id": "s.customer_id",
    "name": "s.name",
    "email": "s.email",
    "phone": "s.phone",
    "address": "s.address",
    "created_at": "current_timestamp()",
    "updated_at": "current_timestamp()",
    "update_flag": "'I'"
}).execute()

Delta MERGE INTO is superior to SAS MERGE in every measurable dimension: it is atomic (ACID-compliant), handles concurrent writes safely, supports conditional logic within MATCHED and NOT MATCHED clauses, and automatically maintains audit columns. SAS MERGE, by contrast, requires pre-sorted input and provides no transactional guarantees.

SAS Procedures to PySpark Equivalents

PROC SORT to .orderBy()

# SAS: proc sort data=sales; by region descending revenue; run;
# PySpark:
sales_sorted = spark.table("catalog.silver.sales").orderBy("region", F.desc("revenue"))

# SAS: proc sort data=sales nodupkey; by customer_id; run;
# PySpark (remove duplicates):
sales_deduped = (
    spark.table("catalog.silver.sales")
    .dropDuplicates(["customer_id"])
)

PROC MEANS / PROC FREQ to .groupBy().agg()

# SAS: proc means data=sales n mean std min max sum;
#        var revenue units;
#        class region;
#      run;
# PySpark:
summary = (
    spark.table("catalog.silver.sales")
    .groupBy("region")
    .agg(
        F.count("revenue").alias("n"),
        F.mean("revenue").alias("revenue_mean"),
        F.stddev("revenue").alias("revenue_std"),
        F.min("revenue").alias("revenue_min"),
        F.max("revenue").alias("revenue_max"),
        F.sum("revenue").alias("revenue_sum"),
        F.count("units").alias("units_n"),
        F.mean("units").alias("units_mean"),
        F.sum("units").alias("units_sum")
    )
)

# SAS: proc freq data=sales; tables region * product_line / chisq; run;
# PySpark (cross-tabulation):
crosstab = (
    spark.table("catalog.silver.sales")
    .groupBy("region", "product_line")
    .agg(F.count("*").alias("frequency"))
    .orderBy("region", "product_line")
)
display(crosstab)

PROC TRANSPOSE to pivot()

# SAS: proc transpose data=monthly_sales out=pivoted prefix=month_;
#        by region;
#        id month;
#        var revenue;
#      run;
# PySpark:
pivoted = (
    spark.table("catalog.silver.monthly_sales")
    .groupBy("region")
    .pivot("month")
    .agg(F.sum("revenue"))
)

pivoted.write.mode("overwrite").saveAsTable("catalog.gold.pivoted_sales")

MigryX precision parser — Deep AST-level analysis ensures every construct is understood before conversion begins

Platform-Specific Optimization by MigryX

MigryX maintains deep knowledge of every target platform’s strengths and best practices. When converting to Snowflake, it leverages Snowpark and native SQL functions. When targeting Databricks, it uses PySpark DataFrame operations optimized for distributed execution. When generating dbt models, it follows dbt best practices for modularity and testability. This platform awareness is what makes MigryX output production-ready from day one.

SAS Libraries and Datasets to Unity Catalog and Delta Lake

In SAS, libraries are defined with LIBNAME statements that point to directories or database connections. Datasets are physical files within those library paths. Unity Catalog replaces this model with a three-level namespace: catalog.schema.table.

# SAS Library definitions:
# libname raw '/data/raw';
# libname silver oracle user=myuser password=mypwd path=mydb;
# libname gold '/data/gold';

# Databricks Unity Catalog equivalent:
# No LIBNAME needed — tables are referenced directly:
# catalog.raw.customers      (replaces raw.customers)
# catalog.silver.customers   (replaces silver.customers)
# catalog.gold.customers     (replaces gold.customers)

# Access control via SQL GRANT:
# GRANT SELECT ON SCHEMA catalog.gold TO `analysts`;
# GRANT ALL PRIVILEGES ON SCHEMA catalog.silver TO `data_engineers`;

Every SAS dataset maps to a Delta Lake table. SAS dataset options (KEEP=, DROP=, WHERE=, RENAME=) map to PySpark operations: .select() for KEEP/DROP, .filter() for WHERE, .withColumnRenamed() for RENAME. SAS formats and informats map to PySpark type casts and format functions.

Medallion Architecture: Organizing the Migration

The Databricks Medallion Architecture (Bronze, Silver, Gold) provides a natural framework for organizing migrated SAS programs. This layered approach maps cleanly to the typical SAS data flow: raw ingestion, cleaning and transformation, and reporting-ready output.

Medallion Architecture Mapping
Bronze Layer — Raw data ingestion. Replaces SAS PROC IMPORT, INFILE statements, and initial data reads. Data is stored as-is in Delta tables with Auto Loader handling incremental file ingestion.
Silver Layer — Cleaned and conformed data. Replaces SAS DATA step cleaning, standardization, and joining logic. Schema enforcement via Delta Lake ensures data quality without SAS PROC CONTENTS checks.
Gold Layer — Business-ready aggregates and analytical outputs. Replaces SAS PROC MEANS summaries, PROC REPORT outputs, and analytical datasets prepared for SAS Enterprise Miner or SAS Visual Analytics.

The Medallion Architecture also provides a migration sequencing strategy: migrate Bronze programs first (data ingestion is lowest risk), then Silver (transformation logic is the core of most SAS programs), and finally Gold (reporting and analytics). This incremental approach allows validation at each layer before proceeding to the next.

SAS Analytics to MLflow on Databricks

SAS Enterprise Miner and SAS/STAT procedures (PROC LOGISTIC, PROC REG, PROC MIXED, PROC GENMOD) provide statistical modeling in a proprietary environment. MLflow on Databricks replaces this with an open-source experiment-tracking and model-management platform.

# SAS Original:
# proc logistic data=model_data;
#   model churn(event='1') = tenure monthly_charges total_charges
#                             contract_type payment_method;
#   output out=scored predicted=pred_prob;
# run;

# Databricks MLflow Equivalent:
import mlflow
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score

model_data = spark.table("catalog.gold.model_data").toPandas()
X = model_data[["tenure", "monthly_charges", "total_charges",
                 "contract_type_encoded", "payment_method_encoded"]]
y = model_data["churn"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

with mlflow.start_run(run_name="churn_logistic_v1"):
    model = LogisticRegression(max_iter=1000)
    model.fit(X_train, y_train)
    pred_prob = model.predict_proba(X_test)[:, 1]
    auc = roc_auc_score(y_test, pred_prob)

    mlflow.log_metric("auc_roc", auc)
    mlflow.log_param("model_type", "logistic_regression")
    mlflow.sklearn.log_model(model, "churn_model",
                              registered_model_name="churn_prediction")

    print(f"AUC-ROC: {auc:.4f}")

MLflow provides capabilities that SAS Enterprise Miner cannot match: automatic experiment tracking, model versioning with lineage, A/B testing of model versions, and deployment to REST endpoints, batch scoring jobs, or streaming inference — all without additional licensing.

FORMAT and INFORMAT to PySpark Types

SAS formats and informats control how data is displayed and read. In PySpark, type conversions use cast(), to_date(), to_timestamp(), and date_format().

# SAS formats and their PySpark equivalents:
# FORMAT date_col DATE9.         → F.date_format(F.col("date_col"), "ddMMMyyyy")
# FORMAT amount DOLLAR12.2       → F.format_number(F.col("amount"), 2)
# FORMAT pct PERCENT8.2          → F.concat(F.round(F.col("pct") * 100, 2), F.lit("%"))
# INFORMAT char_date ANYDTDTE.   → F.to_date(F.col("char_date"), "yyyy-MM-dd")
# INPUT(char_num, 8.)            → F.col("char_num").cast("double")

df = (
    spark.table("catalog.bronze.raw_data")
    .withColumn("transaction_date", F.to_date(F.col("date_str"), "MM/dd/yyyy"))
    .withColumn("amount", F.col("amount_str").cast("decimal(12,2)"))
    .withColumn("quantity", F.col("qty_str").cast("integer"))
    .withColumn("created_ts", F.to_timestamp(F.col("ts_str"), "yyyy-MM-dd HH:mm:ss"))
)

PROC EXPORT to Delta Table Writes

SAS PROC EXPORT writes datasets to external formats (CSV, Excel, databases). In Databricks, the primary target is Delta Lake tables, but PySpark supports writing to any format including CSV, Parquet, JSON, and external databases.

# SAS: proc export data=gold.report outfile='/output/report.csv' dbms=csv replace; run;
# PySpark:
spark.table("catalog.gold.report").write.mode("overwrite").csv("/mnt/output/report.csv", header=True)

# SAS: proc export data=gold.report dbms=xlsx outfile='/output/report.xlsx' replace; run;
# PySpark (using pandas for Excel):
spark.table("catalog.gold.report").toPandas().to_excel("/tmp/report.xlsx", index=False)

# Primary pattern: write to Delta Lake (no export needed)
result_df.write.mode("overwrite").saveAsTable("catalog.gold.report")

How MigryX Automates SAS-to-Databricks Migration

MigryX provides automated conversion of SAS programs to Databricks-ready PySpark notebooks. The platform parses SAS source code — DATA steps, PROC SQL, macro definitions, PROC SORT, PROC MEANS, MERGE statements, and all supporting constructs — and generates equivalent PySpark code with Delta Lake integration.

The MigryX engine handles the most challenging aspects of SAS migration:

DATA step conversion — Row-by-row SAS logic is analyzed and translated to columnar PySpark operations, including RETAIN-to-Window-function mapping, FIRST./LAST. processing, and array handling.
Macro expansion — SAS macros are parsed, expanded, and converted to parameterized Python functions with proper scope management and error handling.
PROC-to-PySpark mapping — PROC SORT, PROC MEANS, PROC FREQ, PROC TRANSPOSE, and other procedures are converted to their PySpark equivalents using the mapping table documented in this article.
Library and dataset resolution — SAS LIBNAME references are mapped to Unity Catalog schemas, and dataset references are converted to Delta table paths.
Format and informat handling — SAS formats and informats are translated to PySpark type conversions and formatting functions.
MERGE-to-Delta MERGE — SAS MERGE statements with IN= conditions are converted to Delta MERGE INTO operations with proper MATCHED and NOT MATCHED clauses.
Lineage and dependency mapping — MigryX traces data lineage across SAS programs, identifying input/output relationships and generating a migration dependency graph.

For organizations with thousands of SAS programs, manual migration is impractical. A typical enterprise SAS estate contains 5,000-20,000 programs with complex interdependencies, macro libraries, and custom formats. MigryX processes the entire estate, generates converted code, and produces validation reports that compare source and target output row counts, column profiles, and statistical summaries.

Key Takeaways
Every SAS construct has a direct Databricks equivalent — DATA steps to PySpark DataFrames, PROC SQL to Databricks SQL, SAS macros to Python functions, SAS datasets to Delta Lake tables, SAS libraries to Unity Catalog schemas.
The Medallion Architecture (Bronze/Silver/Gold) provides a natural migration framework that maps to typical SAS data flows from raw ingestion through analytical output.
MLflow on Databricks replaces SAS Enterprise Miner and SAS/STAT procedures with open-source experiment tracking, model versioning, and deployment — without per-core licensing fees.
Delta Lake MERGE INTO is superior to SAS MERGE: atomic, ACID-compliant, and safe for concurrent writes.
SAS BY-group processing with FIRST. and LAST. translates to PySpark Window functions with row_number(), providing the same logic at distributed scale.
MigryX automates the conversion of SAS DATA steps, PROC SQL, macros, and procedures to production-ready PySpark notebooks with Delta Lake integration, handling the most complex migration patterns at enterprise scale.

Migrating from SAS to Databricks is not just a technology change — it is a fundamental shift in how organizations approach analytics. The move from proprietary, per-core licensing to open-source, elastic compute unlocks both cost savings and architectural flexibility. PySpark provides a natural home for SAS DATA step logic. Delta Lake surpasses SAS datasets in every dimension: ACID transactions, time travel, schema evolution, and concurrent access. And the Databricks ecosystem — MLflow, Unity Catalog, Databricks SQL — delivers capabilities that would require multiple SAS products and additional licenses to replicate.

The technical mapping is clear. The economic case is compelling. The question is not whether to migrate, but how to execute efficiently. For organizations with large SAS estates, automated conversion with MigryX reduces migration timelines from years to months, preserving institutional logic while unlocking the full potential of the Databricks Lakehouse.

Why MigryX Delivers Superior Migration Results

The challenges described throughout this article are exactly what MigryX was built to solve. Here is how MigryX transforms this process:

Production-ready output: MigryX generates code that passes code review and runs in production — not prototype-quality output that needs weeks of cleanup.
Platform optimization: Converted code leverages target platform-specific features for maximum performance and cost efficiency.
25+ source technologies: Whether migrating from SAS, Informatica, DataStage, SSIS, or any of 25+ legacy technologies, MigryX handles it.
Automated documentation: Every conversion decision is documented with before/after code mappings and transformation rationale.

MigryX combines precision AST parsing with Merlin AI to deliver 99% accurate, production-ready migration — turning what used to be a multi-year manual effort into a streamlined, validated process. See it in action.

Ready to migrate from SAS to Databricks?

See how MigryX converts SAS DATA steps, PROC SQL, macros, and procedures to production-ready PySpark notebooks with Delta Lake and MLflow integration.

Explore SAS Migration Schedule a Demo