Functional Programming's Influence in Cybersecurity

Outline

The problem with imperative languages.
History and Rise of Functional Programming Languages.
1. The Invention of LISP and Its Influence
  1. Garbage Collection
    1. Call Stack and Heap
  2. Time-Sharing
  3. REPL
Functional Programming's Influence in Concurrent and Parallel Computing
1. Recursion
2. Immutability
3. Persistent Data Structures
4. Higher-Order Functions
5. Benefits of Functional Programming in Security
6. Case Study: Erlang
Dependent Type Systems Eliminate Security Bugs
1. Case Study: Rust was Developed in OCaml
2. Formal Verification of Software Security: Coq

The Problem with Imperative Languages

A common problem in imperative languages is that it is very easy to mistranslate the design specifications into working code. The mere fact the same type of security bugs spread across codebases over time in C-based languages is a testament to that. Some people that had first-hand experience developing in imperative languages got tired of this and decided to switch to a functional programming language. These are languages inspired by a branch of mathematics called Lambda Calculus. The use of such languages has prevented the presence of security bugs that plague imperative languages.

In this video I will review how functional programming has influenced secure coding practices in the industry.

The History and Rise of Functional Programming

The history of functional programming begins with the history of modern computing. In the 1930s mathematicians debated if electronic machines could detect if a program could terminate. This question is now known as the Halting Problem. Alan Turing and Alonzo Church worked together to prove that this was impossible. Alan Turing introduced Alonzo Church to the Turing machine to model computations. Alonzo Church introduced Lambda Calculus as a language to express mathematical computations that can be executed by the machine.

A student of Alonzo Church, John McCarthy, programmed the first version of LISP in an effort to demonstrate it as a more human-readable language to model Turing machines. It was based on Church's lambda calculus. John McCarthy introduced several programming concepts that have made our programs safer today: including but not limited to garbage collection (automatic memory management), time-sharing, symbolic expressions, and more.

Garbage Collection: An Early Technique in Automatic Memory Management

Let's take a look at garbage collection -- a pioneering technique in automatic memory management. Coders can often just specify fixed amounts of memory to be reserved for program's use while the program is being compiled--known as static memory management. Static memory management always assigns memory from a region of memory in your computer known as the call stack.

But sometimes its necessary to allow the machine to assign memory on-demand during program execution . Many of the data structures we rely on require assigning memory on-demand to work properly--such as linked lists, trees, and hash tables. Such data structures can grow or shrink in size during program execution. This is known as dynamic memory management. Languages such as C allow coders to specify how much memory should be assigned on-demand to allow data structures to grow and shrink as needed. Whenever the machine must find memory during program-execution on-demand to meet the changing memory needs of the data structures the computer assigns memory from a region of memory in your machine called the heap.

Let me explain the difference between the call stack and the heap further.

The Stack vs The Heap

The call stack is a contiguous region of memory. This is the region of memory used to execute function calls in your code. The coder explicitly defines how much memory should be assigned from the call stack directly. The compiler automatically assigns the required memory from the call stack whenever the function call is taking place--pushing the memory into the call stack. When the execution of the function call ends the compiler deallocates the memory required by said function call from the stack.

The call stack has less memory available than the heap. One major difference between static and dynamic memory allocation is that the machine takes care of deallocating memory after each function call ends. But in languages such as C the coder is responsible for coding instructions for the manual deallocation of memory from the heap after it is no longer necessary. Throughout history coders have made mistakes in deallocating dynamic memory from the heap--causing memory leaks. When this happens, code can malfunction, the computer may run out of memory, and data can be overwritten before necessary especially if dynamic memory allocation is being used in programs that feature concurrency and parallelism. Such bugs in concurrent code can be notoriously difficult to debug.

To avoid such memory leaks John McCarthy invented garbage collection: a technique to automatically deallocate memory from the heap when it is no longer needed. In LISP, whenever an object becomes inaccessible to the program the garbage collection program automatically deallocates the object from dynamic memory. One method LISP does this is called the reference counting method. To keep track of which objects are accessible the garbage collector itself keeps track of the number of references available for the program to dereference throughout program execution. When the number of accessible references hits zero the garbage collector deallocates the object from the heap. A second technique is the Mark-and-Sweep Algorithm. In Mark-and-Sweep references to objects allocated with dynamic memory are marked in a data structure. This is the Mark stage. Later, the machine traverses through the data structure's elements. Any object that is marked is now unmarked. Any object that is unmarked, meaning it is no longer accessible to the program, is deallocated from the heap. You can learn more about garbage collection from The Garbage Collection Handbook.

Time Sharing

John McCarthy also is credited with inventing the concept of the time-sharing computer system. In time-sharing each user or application is assigned intervals of time where each is allowed to use the computer's resources to compute a task. Since a task can take longer than the interval of time allocated the computation of the user's task pauses whenever the currentmost interval of time passes for that user's session--at which point another program or user is allowed to start or resume a program that they were executing on the machine. This made it possible for several users to benefit from using the same computer at a lower cost. It also made it possible for the same person to simultaneously perform multiple tasks on the same machine. The concept later allowed imporant innovations in Computer Science such as the time sharing operating system (e.g. UNIX). Without time sharing websites and cloud computing would never exist!

Read-Eval-Print-Loop

John McCarthy also helped invent the Read-Eval-Print-Loop--an important concept in modern cybersecurity scripting and debugging. Today, cybersecurity personnel--both blue team and red-team rely on command line shells to command machines to perform numerous security tasks. But did you know that the history of command-line shells goes all the way back to LISP! All scripting languages you are familiar with (Python, Perl, BASH, etc.) all can or do execute within a REPL. REPLs accept subexpressions from the user, execute them, and print their output on the next line on behalf of the user. This helps the user make a decision on the next expression to use. REPLs proved to be an excellent environment to quickly prototype solutions for everyday tasks. That's why most scripting languages offer a REPL.

The next innovation in the history of functional programming to have an impact in security is the invention of persistent and immutable data structures--also known as purely functional data structures.

Persistent and Immutable Data Structures

Persistent and immutable data structures have proven to be excellent techniques to store and retrieve data when handling concurrency and parallelism. In concurrency the computer task-switches between different activities--where each activity is executed as a set of CPU instructions known as a thread. In parallelism tasks are divided amongst several CPU cores. This can help speed up program execution. A common problem with concurrency and parallelism is that it is easy for data to be corrupted by different threads. Persistent and immutable data structures avoid common issues such as deadlocks and race conditions.

The history of persistent and immutable data structures starts with the creation of LISP by John McCarthy in 1958. LISP is considered the first functional programming language since it is an implementation of a subset of lambda calculus. LISP was the first language to allow for recursive function calls. A recursive function call is a mathematical function that invokes itself. A classic example is the famous recursive algorithm to calculate the nth number in the Fibonacci sequence. Below is a mathematical definition of the formula followed by Common LISP code:

Recursive Definition of Fibonacci sequence and sample LISP code.

The concept of recursion is a crucial prerequisite to performing calculations with persistent and immutable data structures. These are data structures whose contents cannot be directly modified in-place after the data structure is initialized with values the first time. Since the contents cannot be modified in-place a brand new copy of the data structure with the updated contents has to be made.

A persistent data structure preserves the previous state of it's data contents even after a new copy of the data structure is made that has updated information. A persistent data structure simply makes a new copy of the data structure with the updated information instead of modifying the original data structure's contents in-place.

These two properties--persistence and immutability--allow for the safe update of information in concurrent and parallel systems. To update information safely the coder must make recursive calls to make updates to the persistent data structure.

The below entry from Wikipedia demonstrates immutability and persistence for singly-linked lists:

To make updated copies of persistent and immutable data structures the computer must often make recursive calls. The cost of this is that more memory is wasted than if we simply updated the contents of the original data structure in-place. Although persistence and immutability has the downside of using more memory it has the upside of preventing common bugs in concurrent and parallel code.

MetaLanguage, Standard ML, and Haskell

LISP introduced recursion and made it a major concept in computer science. But it alone did not make persistent and immutable data structures a major practice. Two future languages inspired by LISP did: Metalanguage (ML) and Haskell.

History of ML

ML was developed in 1978 at the University of Edinburgh. It was originally designed as a language to help mathematicians prove theorems. ML was the first language to introduce immutable data structures. ML also introduced several programming concepts that influenced future functional programming languages.

One major innovation of ML was the Hindley-Milner Type System.

The Hindley-Milner Type System

ML introduced a powerful type system known as the Hindley-Milner type system--which ensures programs are type-safe. The Hindley-Milner type system thus prevents coders from making type errors such as initializing a string as an integer and vice-versa. Major languages that are type-safe feature the Hindley-Milner type system including Rust, OCaml, Standard ML, Haskell, and Clean.

More Functional Programming Concepts

ML also introduced introduced several concepts used in theorem proving such as higher-order functions and first-class functions. An institute in France named INRIA worked with the US to develop an Object-Oriented language built on top of ML called OCaml.

OCaml is now the successor to ML and Standard ML. Throughout OCaml's history mathematicians and coders have conducted plenty of research in automated theorem proving. Automated Theorem Proving is a field in computer science where computers assist the human to prove a mathematical theorem.

Researchers have extended this concept to proving computer programs are correct--a concept known as program correctness. Several programming languages have been invented to help coders prove their programs satisfy expectations--including assurance the program is free of common security bugs. Such languages are called proof-assist languages. In program correctness, the program is written in a proof-assist language. The proof-assistant language is designed to guide the coder to prove desired properties about the code. In the past, coders have managed to prove programs to be resistant to buffer overflows to programs as complex as entire cryptographic libraries such as HACL*

The Coq (Soon to Be Renamed "Rocq") Language

The most influential and practical proof-assistant language is Coq. This proof-assist language was written in OCaml and allows the coder to automatically convert Coq programs to OCaml or Haskell programs. It has a wealth of documentation you can start learning from such as Software Foundations or Certified Programming with Dependent Types. Although a niche skill in the software security industry program correctness is well known for ensuring the highest assurance that code is free from security bugs and will likely continued to be used in the industry for this reason.

FOSRES

Functional Programming's Influence in Cybersecurity

Recent Posts

コメント

FOSRES