Outline
Why This Tutorial?
Prerequisites
Other Great References
Downloading gcc
Choose a Great Editor: Vim or Emacs
Our first program: "Hello World!"
Return Status: Why It's Important
Preprocessor Directives
The main() function: It's Importance
The printf() function: Use it to format what you print
Primitive Data Types
Why Learn C for Cybersecurity?
C remains a heavily used language in system programming everywhere. Our programming languages, operating systems, computer networks, web servers, content delivery networks, cryptographic libraries, aircraft, vehicles, and digital military defense continue to be developed in C.
Since C is the language of operating system development it remains the lingua franca amongst developers.
Despite it's omniscience C codebases suffer from more security vulnerabilities than any other language used in production.
Cybersecurity professionel everywhere are expected to be able to identify such vulnerabilites in C codebases for this reason and recommend the necessary steps to avoid them in development.
Why This Tutorial?
With so many books and Youtube channels dedicated to teaching C why this one?
Some books mainly focus on syntax.
Those just make you good at grammar like C Primer Plus (buy it--it's worth it!!!)
Not problem solving.
Fewer do a great job of teaching pointers like Pointers on C (buy it--it's worth it!!!).
Even fewer discuss how to avoid common security bugs like Seacord's "Secure Coding in C/C++", Second Edition (buy it--it's worth it!!!).
You will not be trusted with deploying software in C or other C-based languages until you master pointer manipulation.
College courses and industries have fallen short of their responsibility of teaching their students and employees secure coding practices to avoid common security bugs in C.
If you are reading this you are probably a college student disillusioned and frustrated with the lack of care your university put into teaching the C language.
Yet the other Youtube channels that cover these topics in-depth are hard-to-find, or do not offer exercises to build your skills and/or even if they do offer courses that are outrageously overpriced.
I made up my mind to offer this comprehensive tutorial series covering the entirety of the C programming language, coding Data Structures & Algorithms in C to help you pass your job interviews, and secure coding techniques used in the industry.
By the end of this tutorial you should know C well enough to begin research on how to use C for any security-focused project you wish.
You should also be able to identify security flaws in your and your colleagues' codebases and rectify the mistake.
Target Audience
Anyone who is interested in developing tools for software security assessment will find this tutorial helpful in their coding journey.
If you are a security engineer that is interested in developing static analysis tools, formal verification tools, and compiler-based security defenses this tutorial series should serve as an outstanding introduction to the C language.
Each blog post will come complete with programming exercises and a solutions manual to help you check your accuracy.
Prerequisites
I assume you already have experience in an imperative programming language, preferably Java, C++, or Golang--preferably C++. If you do not you will struggle quite a bit to understand this tutorial. Fair warning.
I also assume you have experience using the GNU/Linux command line interface.
In this tutorial I used GNU/Debian Linux.
If you do not I strongly recommend you stop reading this and learn how to do that NOW.
If you need to learn BASH here is a great tutorial to get started:
Another great site to learn the BASH command line is ss64.com.
Beyond just watching lectures you actually need to practice using BASH. Install GNU/Debian on your computer now and force yourself to do as many tasks as possible through the command line. This will force you to become proficient at using it.
Other Great References
Great books on C that I would trust my life with include "C Primer Plus", "Pointers on C" by Kenneth A Reek, "The C Programming Language, Second Edition", and "Fluent C".
Downloading gcc
You must first install a industry-standard C compiler. GCC is a great choice and is the official compiler used to compile GNU/Linux. So I recommend you install that.
Installing on Debian
Installing GCC on Debian is straightforward. The following link explains how in the command line.
Choose a Great Editor: Vim or Emacs
I could make a separate Youtube video on the futile debates coders had on which text editor for software development is better: Vim or Emacs. Most people start out with those glossy GUI text editors.
And then at some point the person gets ticked off that the editor slows down their development.
Vim and Emacs are both famous for helping developers finish their code faster--despite their lackluster appearance.
I strongly recommend you learn Vim if you have not already. It will help you develop faster.
Emacs is even more powerful for productivity--but it does have a larger learning curve. The reason why I recommend Vim first is that you can apply Vim keystrokes in Emacs. Please do NOT try to use Emacs built-in keystrokes. Unless you want to break your fingers.
If you are serious about using vim install it in your GNU/Linux environment and then see this tutorial to get started with Vim.
Write Our First Program: "Hello World!"
Let's make our first C program. Don't worry about understanding what it means. The whole point is to teach you how to write, compile, and execute C programs.
Writing a C program
In BASH enter a directory you are comfortable working in. Type in the following command to create a brand new file named "hello_world.c" and press Enter:
Now type the following in the file:
Hang on. You may be shocked by how different this is from the language you are used to. I know. The point is you need to get used to compiling and executing a program.
Now type in ":w" in vim to save the file.
Next type in ":q" to exit vim.
Compiling a C program
To compile our C program in BASH we will apply the following command.
The above BASH command will compile hello_world.c into a binary executable, known as machine code. This is a program your machine's CPU can directly execute. The form of compilation done above is known as compilation to native machine code.
In the real world developers compile several C source code files at once to build the final executable. But for this tutorial I will keep it simple and let us just compile one file per tutorial for now.
Executing a C Program
Now that we have compiled our C program to native machine code we can execute it by doing the following in BASH:
Simply doing the above will cause "Hello World!" to be printed. If you see that in BASH then you wrote, compiled, and executed your first C program.
Now let's break down how this C program works.
Preprocessor Directives
In the first line we see the following line:
The "#include" is a preprocessor directive. When the compiler reads this line it replaces the line with the contents of the file known as "stdio.h". In the C programming language files that declare the C library functions for an API are called C header files. Header files usually only contain the function's declaration which shows the function name and set of parameter arguments. This does NOT include the function's definition, which is the code that the computer executes when the function is invoked in the call stack.
For example, for the "printf" function in our program the following is the "printf" function declaration in "stdio.h" in my machine:
To help you understand "#include" better try the following in BASH:
In the photo above I copied and pasted the entire contents of "stdio.h" into a new C source code file. I then copied the contents of the original "hello_world.c" into the new C source code file and deleted the "include <stdio.h>". I then compiled and executed this new C source code file. It executed as normal.
"stdio.h" is a standard C library--meaning you should be able to use the library in your C source code immediately after installing GCC. Throughout your days coding in C you will often be importing C standard libraries such as "stdio.h". A great website to learn how to use C standard libraries is en.cppreference.com. Please pause this video and visit this website now. It is a lifelong reference.
Here is a link to the page on en.cppreference.com for stdio.h.
Here is a link to the page on en.cppreference.com for "printf()".
The "main()" function:
The next line we will take a look at is the "main()" function:
In the above line we see the function declaration for the C function "main()". In the C language the "int" in the function declaration above specifies what data type the value that is returned from "main()" must be. In C "int" must be an integer in the following range:
The "int" in C is used the same way as in Java or C++ for function declarations. Since the "main()" function begins with data type declaration "int" the compiler will expect the coder to return a value of data type "int" at the end of "main()" execution.
The Importance of the "main()" function in C:
In C the "main()" function is the first function the machine will execute. It is even required for proper compilation.
To demonstrate this take a look at the following:
In the above photo I first display the contents of the "/tmp/test.c" C source code file. Note it does not have a definition for the "main()" function. The gcc compiler thus complains the definition for "main()" is missing by saying "undefined reference to 'main'".
The last part of the function declaration is the "void" within the parentheses for "main()". This tells the compiler we do not want coders to pass any arguments to "main()" when invoking the main()" function.
Let's take a look at those curly braces:
Just as in C++ and Java, those curly braces define the boundary of the function definition.
Any variables defined within the function definition that is assigned static memory will automatically be deallocated and inaccessible after the function call returns.
The "printf()" function: Use It to Format What You Print
We now turn our attention to "main()"'s function definition. Take a look at printf below:
The "printf()" function is the standard printing function C coders use. It has its pros and cons. The best advantage of "printf()" is that it is very easy to format the final string that is printed--especially when you have to print a message involving values of several different data types. I will discuss the downside to "printf()" in a future programming exercise.
Let's take a look at the first argument in "printf()":
The "%s" argument in "printf()" tells the compiler we wish to print a string. A string is an array of bytes. But in conventional programming people often store human readable text and call them strings. The second argument in "printf()" is the "Hello World!\n" string.
We can ask printf to print more strings by playing with the first argument--which allows us to tell "printf()" how we wish our strings to be printed. Below is an example:
In the modified code above we see a second "%s" string in the first argument for "printf()". This tells "printf()" to expect a third argument that is also a string. That third argument is the string "This is the second string in the second line\n".
We don't have to always specify extra "%s" strings in the first argument for "printf()". Another way of doing the above is shown below:
In the above highlighted line in Figure 15 I replaced the second "%s" string with the actual string it was specifying for. I admit this code looks more ugly but the point is the first argument in "printf()" is just a string. Any time you see a substring in the first argument to "printf()" that begins with a '%' it is a format specifier for a value of a certain data type. "%s" specifies yet another string that is expected in the next argument in "printf()".
Return Status: Why It's Important
The final line of code in "main()" is shown below:
This statement tells the machine to end the "main()" function call with exit status number 0. This will help coders tell if program execution completed without error. With the above statement you can easily check if your program completed without error by doing the following in BASH:
In the photo above, I check if program execution completed successfully by applying the "echo $status" command. Since we see that 0 has been returned we know that our program worked without any reported errors.
Primitive Data Types
Now that we have discussed how "printf()" works let's discuss primitive data types more. We have already seen our first primitive data type "int" and we know the range of integers we are allowed to represent of that data type. There are several more. If you are experienced in Java or C++ all of the below should be familiar:
The diagram above explains most of the primitive data types, with the exception of pointers, the format specifier used by printf-based functions to print such variables, the size that indicates how much memory the data type uses, and the range of valid values for each data type.
Now let's take a step back--why do half of these primitive data types start with the reserved keyword "signed"?
To understand why let's compare "signed char" vs "unsigned char". Notice the only difference between the two are the valid range of values for each. In fact "unsigned char" has twice a larger range of positive values than "signed char" despite taking up the same amount of memory.
In "signed" binary representation the most significant bit, or the leftmost bit in a bit vector, is used to represent whether the number is a positive or negative integer. A most significant bit of 0 for a "signed" binary number means the number represents a positive integer. A most significant bit of 1 means the number represents a negative integer.
To make this more clear please see the below photo that compares the binary and numerical representations of an "unsigned char" vs a "signed char":
Now let's take a look at the following code sample where we practice initializing and printing each data type:
Below is the result of compiling and executing the program:
Take a moment and take a look at the results in the command line shown above.
Take note of the "?" symbol after "unsigned_ch:". The number that was stored in "unsigned_ch", 255, is not a printable ASCII character. That's why we see the "?".
Also take a look at the decimal "large_double" printed. Notice it is a decimal greater than 1.1e+300. In floating-point arithmetic it is normal for there to be a loss in precision when doing floating-point math. This is why applications where correctness of arithmetic is crucial should never use them. Banking and financial applications avoid floating-point numbers and instead use unsigned integer data types instead. This prevents unnecessary loss of value.
This concludes Part 1 of this C Tutorial. To reinforce your understanding please complete the following exercises below.
Exercises
Exercise 1-1
Write a C program to print the maximum and minimum possible floating-point values for the "float" data type. Research which C standard library offers constants similar to those found in "limits.h" which offer integral constants for the primitive integer data types discussed in this chapter. This exercise is meant to help you practice researching C standard libraries and learning to use C Standard API functions as well as practice using "printf()".
Exercise 1-2
Write a C program to print the value of sin(35.9676432) in radians. When using "printf()" be sure to use the correct format flag and round the result to two decimal places.
Exercise 1-3
Write a program that assigns a variable of type "double" named "input" and that stores the value 23998299.45989898. Print the value of the variable in scientific notation where the base is rounded to three decimal places.
To round values of type "double" to a certain decimal place you can apply the format flag as "%.xe" in the first argument to "printf()" where 'x' is the number of decimal places desired to be rounded to. The 'e' in "%.xe" is the format specifier for scientific notation.
Exercise 1-4
The following exercise will teach you the importance of being consistent with data types as you copy values from one variable to another. Compile and execute the following code. What do you notice is wrong with the output?
Exercise 1-5
When you work on projects in C you will have to import the same C library in multiple C source code files. If we are not careful we can end up repasting the entire contents of a header file twice into the same C source file--wasting the compiler's time to compile the C source code file. To prevent this we use the "#ifndef" preprocessor directive to prevent the double import of the same C standard library into the C source code file. Research how to use "#ifndef", "#define", and "endif" preprocessor directives and edit the following C source code file to prevent the double import of the C standard libraries it imports:
Exercise 1-6
Read the following C source code file below:
Identify which variables whose values can be switched without loss of precision.
To test your conclusions switch the value of the variables that can be interchanged and print their values using "printf()".
Be sure to use the correct format specifier for each variable.
Exercise 1-7
If you are an experienced Java or C++ developer you have used the "const" qualifier before. In these languages the "const" qualifier is used to qualify constant expressions that cannot be edited after initialization. In C, however, the "const" qualifier does NOT mean the initialized variable is guaranteed to have a constant value. It simply means the variable is read-only. C coders instead use the "#define" preprocessor directive to initialize a macro that symbolizes the value it represents. Macros defined with the "#define" cannot be edited.
In a C source code file initialize "#define" macros for each of the following values:
Name a macro "true" to represent 1 and another macro "false" to represent 0.
345998798002
400.001
Software Security Assessment
The following exercises are designed to test your ability to identify and correct security bugs in C source code.
Exercise 1-8
Identify what is wrong with the following C source code file. Then correct it.
Exercise 1-9
Identify what is wrong with the following C source code file. Correct it.
Further Reading
An excellent companion text for the absolute C beginner is "C Primer Plus". It goes into more depth on the syntax and semantics of "printf()".
An excellent reference to the C programming language is "C: A Reference Manual: Fifth Edition" by Harbison & Steele.
Solutions Manual
Exercise 1-1
Here is the terminal output:
Exercise 1-2
Here is the terminal output:
Exercise 1-3
Here is the terminal output:
Exercise 1-4
Here is the terminal output:
Note that "little_decimal" that has type float does not have the same digits after the mantissa as "big_decimal" that has type double.
Also variable "c" that has type "char" has a different value than that of "i".
Finally "sc" that has type "char" has a different value than that of "uc".
These are common errors in C code.
Whenever copying values it is crucial to ensure both the source and destination variables have the same data type to prevent data loss.
Exercise 1-5
Exercise 1-6
All variables of the same data type can always be interchanged safely.
The variables "pi", "size_count", and "atoms_in_universe" can be interchanged without data loss.
The same is true for "size" and "llu_max_value".
The same is true for "x" and "y".
The variable "test" of type float can be assigned to any of the double variables in the source code file without data loss.
"lld_min_value" can be assigned to any of the variables of type "unsigned long long int" in the source code file.
"temperature" can be assigned to "x" or "y".
"count" can be assigned to any variable of data types "unsigned long long int" in the source code file.
Unsafe Copying Explained
The next pair of variables can be interchanged safely in this source code file but in general should not be done due to potential loss of precision amongst data types when copying values:
"size_count" can be assigned to "test". Beware that as a "double" "size_count" takes up more memory than "test".
"size" can be assigned to "lld_min_value", "x", "y", "temperature" ,and "count". Beware that as an "unsigned long long int" "size" takes up more memory than "x", "y", "temperature", and "count". Beware that "size" is an "unsigned" type whereas "lld_min_value" is a "signed" type.
Below is a modified version of the file "which_variables_are_interchangeable.c" that demonstrates the explanation in "Unsafe Copying Explained":
Exercise 1-7
Here is the terminal output:
The "true" and "false" macros are how Bjarne Stroustrop originally defined the macros for C++. Bjarne originally used macros to modify C into the first prototype of C++. Of course Bjarne eventually had to eventually stop relying on macros and write a complete compiler for C++.
Exercise 1-8
Below is the original source code file:
"printf()" in line 5 has what is called a format string vulnerability since the compiler expects a second string argument.
Format string vulnerabilities are common security bugs in C/C++ codebases.
If the coder fails to format "printf()" arguments correctly the computer can print data that is unpredictable.
Below is a sample of "what_is_wrong.c" prints in the terminal:
Note that the result is unique for each instance of program execution. This is called undefined behavior since there is no standard on how the machine interprets the format string vulnerability.
Below is the correct version:
Whenever we want "printf()" to print the '%' character in the first format string argument we must specify it as "%%".
Exercise 1-9
Below is the original source code for this problem:
"main()" in line 3 specifies that it returns a value of data type "float". It standard for "main()" to return a value of data type "int". In fact when we execute this file we get the following:
Notice the exit status code is 20 instead of 0.
The correct code replaces "float" in the function declaration with "int".
Nice tutorial which I decided to follow for getting hands in low level programming