CHARACTER FUNCTIONS

If you’ve ever learned a programming language other than structured query language (SQL) or managed data within Microsoft Excel, then it’s highly likely you’ve used functions or methods to easily manipulate and transform character data (i.e., strings) into a more desirable state. In my experience, these functions are rarely needed for analytical purposes. Instead, they are largely used to remedy data quality issues resulting from poor application and/or database controls. In other instances, transformations may be required due to requests from the consumers of your data within reports and/or dashboards.

Let’s look at a handful of character functions, starting off those that tackle the ubiquitous requirement of altering character case type. The schemas, corresponding tables, and data used in the examples that follow can be found at livesql.oracle.com.

Altering Character Casing

-- Convert all characters to lower-case or upper-case.
SELECT
    oe.customers.cust_first_name,
    UPPER(oe.customers.cust_first_name) AS first_name_upper,
    LOWER(oe.customers.cust_first_name) AS first_name_lower
FROM
    oe.customers;

cust_first_name	first_name_upper	first_name_lower
Sivaji	SIVAJI	sivaji
Mammutti	MAMMUTTI	mammutti
Elia	ELIA	elia
Ishwarya	ISHWARYA	ishwarya
Gustav	GUSTAV	gustav

-- Capitalize the first letter in each word/sequence of characters.
SELECT
    oe.product_information.product_description,
    INITCAP(oe.product_information.product_description) AS altered_product_description
FROM
    oe.product_information;

product_description	altered_product_description
Inkjet printer, color, 8 pages per minute, high resolution (photo quality). Memory: 16MB. Dimensions (HxWxD): 7.3 x 17.5 x 14 inch. Paper size: A4, US Letter, envelopes. Interface: Centronics parallel, IEEE 1284 compliant.	Inkjet Printer, Color, 8 Pages Per Minute, High Resolution (Photo Quality). Memory: 16mb. Dimensions (Hxwxd): 7.3 X 17.5 X 14 Inch. Paper Size: A4, Us Letter, Envelopes. Interface: Centronics Parallel, Ieee 1284 Compliant.

The two examples above are relatively straightforward. That said, an explanation follows for clarity.

In the first example, the pre-existing attribute, cust_first_name is passed as the argument to the UPPER and LOWER functions. Consequently, the result-set contains the customer’s first name as it exists in the table and two new columns containing the same values but with modified character casing. In the second example, product descriptions (i.e., product_description) undergo a transformation made possible by the INITCAP function, resulting in first letter of each word or series of non-space characters being capitalized.

String Concatenation
Like the altering of character casing in strings, concatenation of strings is both, easy and common. Often, the way data is stored and the way you wish to present it are at odds with one another. For example, in the customers table, there exists two columns dedicated to storing the customer’s first and last name. From a database designer’s perspective, it’s sensible to avoid composite attributes; however, for a business user and/or consumer of your reports and dashboards, not so much. Often, it is desirable to display first and last name values as a single string as if it was stored in a single column.

The example below returns a result-set containing the pre-existing customer first and last name attributes along with two columns containing each customer’s first and last name, separated by a space and in one column. As a bonus, there is also a column that combines the concatenation operator with the UPPER function to convert the customer’s full name into all uppercase characters. I’ve done this to show that the result of a function can be passed as an argument into another function. In the example, you’ll notice that two different methods were used to achieve the same thing. The third column in our SELECT clause utilizes the CONCAT function while the fourth column uses Oracle’s concatenate operator, the pipe (i.e., ||). While my initial intentions were to not go in-depth regarding function documentation, I feel that this example requires it due to the nested CONCAT function used to derive the concatenation of the customer’s first and last name in the third column. Comments on the documentation and how it relates to the example follow.

The CONCAT functions accepts two arguments with both typically being strings or numeric values converted to a character data type. Passing two values into this function will result in the concatenation of those values. Great and simple enough – right? In some instances, yes; however, in many instances we’ll want to concatenate two values but have those values separated by a space. Technically, that means we need to concatenate three values, but the CONCAT function only accepts two. A solution to this problem is to pass the concatenation of two strings as the first or second argument, enabling the concatenation of three values.

In my example below, I’ve chosen to pass the concatenation of each customer’s first name and a space character into the first argument. The customer’s last name is passed as the second argument. This results in the concatenation of each customer’s first and last name, separated by a space.

-- Concatenate multiple column values.
SELECT
    oe.customers.cust_first_name,
    oe.customers.cust_last_name,
    CONCAT(
        CONCAT(
            oe.customers.cust_first_name,
            ' '
        ),
        oe.customers.cust_last_name
    ) AS cust_full_name,
    (oe.customers.cust_first_name
        || ' '
        || oe.customers.cust_last_name
    ) AS cust_full_name,
    UPPER(
        oe.customers.cust_first_name
        || ' '
        || oe.customers.cust_last_name
    ) AS cust_full_name_all_caps
FROM
	oe.customers;

cust_first_name	cust_last_name	cust_full_name	cust_full_name	cust_full_name_all_caps
Sivaji	Landis	Sivaji Landis	Sivaji Landis	SIVAJI LANDIS
Mammutti	Pacino	Mammutti Pacino	Mammutti Pacino	MAMMUTTI PACINO
Elia	Fawcett	Elia Fawcett	Elia Fawcett	ELIA FAWCETT
Ishwarya	Roberts	Ishwarya Roberts	Ishwarya Roberts	ISHWARYA ROBERTS
Gustav	Steenburgen	Gustav Steenburgen	Gustav Steenburgen	GUSTAV STEENBURGEN

Removing Leading and Trailing Characters
Often, applications permit users to submit data in free-form text, meaning there is not a finite number of values to select from a drop-down list. Instead, the user is permitted to simply type in their response. A common consequence of free-form text is that the data must be cleaned up prior to it becoming useful in analyses. An example of this is the removal of leading and trailing spaces to ensure only distinct responses are captured. See below for the code corresponding to this type of task.

-- Remove unwanted characters from the beginning and/or end of a string.
SELECT
    TRIM('    The spaces from both sides of this column''s values have been removed.    ') AS trim_both_sides,
    LTRIM('    Only the leading spaces have been removed from this column''s values.    ') AS left_trim,
    RTRIM('    Only the trailing spaces have ben removed from this column''s values.    ') AS right_trim
FROM
    dual;

trim_both_sides	left_trim	right_trim
The spaces from both sides of this column’s values have been removed.	Only the leading spaces have been removed from this column’s values.	Only the trailing spaces have ben removed from this column’s values.

Padding Values
Seldom in my career have I had to concern myself with padding values; however, in these rare instances, it was incredibly important to be able to do so. In a very memorable case, hundreds of legacy systems of record were retired and consolidated into one. Unfortunately, after this took place, the new key values generated by the new system of record were four characters longer than the legacy ones. As a result, transactions created in the new system that were related to those in the old system based on the key value were technically different. For example, 099 is distinct from 99 with equality comparisons yielding false when character data type and no transformations are performed such as converting these two numbers to integers. This is quite problematic for tying historic transactions and newer transactions to one another based on the key value and is evident when attempting to JOIN two tables together based on these values.

In the following example, I address padding order ID values and sales representative ID values with zeros. This results in all values of the same attribute being of identical length.

-- Pad order ID and sales representative ID values with 0s.
SELECT
    oe.orders.order_id,
    LPAD(oe.orders.order_id, 10, 0) AS order_id_conversion,
    RPAD(oe.orders.sales_rep_id, 5, 0) AS sales_rep_conversion
FROM
    oe.orders;

order_id	order_id_conversion	sales_rep_conversion
2458	2458	15300
2397	2397	15400
2454	2454	15400
2354	2354	15500
2358	2358	15500

String Extraction
Extraction of specific portions of a string is supported by every programming language I’ve used throughout my career. It’s also supported in the most popular spreadsheet software, Microsoft Excel. In my opinion, the hard part when it comes to string extraction is developing the strategy.

Okay, so how do we do it in SQL? A widely available function, SUBSTR (i.e., substring), can be used by itself when the extraction starting position is static. Simply provide the string to extract from, the starting position represented by an integer, and the number of characters you would like to extract. If the number of characters to extract is not provided, then this argument defaults to the number of characters from the starting position to the end of the string. In the first example below, three digits are extracted from each employee’s phone number, starting at the fifth number in the phone number and ending with the third digit following the starting digit. Keep in mind that while I say digit, the phone numbers are stored as character data type in the employees table. This is important as the function accepts expressions of character data type, only. If the phone number was stored as a numeric, you could simply convert the value and pass that as the argument to SUBSTR. Easy enough – right? Yes, when the starting position is static.

-- Extract three digits from each employee's phone number.
SELECT
    hr.employees.phone_number,
    SUBSTR(hr.employees.phone_number, 5, 3) AS digit_extraction
FROM
    hr.employees;

phone_number	digit_extraction
650.509.4876	509
650.507.9879	507
650.505.4876	505
590.423.5567	423
011.44.1345.729268	44.

In the subsequent examples, you’ll notice that extracting portions of a string where the starting position of the extraction varies from value to value introduces a fair level of complexity. In these instances, the function INSTR is used to locate the value representing the desired starting position within a given string and return the position of that value within the string as an integer.

-- Extract the email domain from each customer email string.
SELECT
    oe.customers.cust_email,
    SUBSTR(oe.customers.cust_email,
    	INSTR(oe.customers.cust_email, '@', 1, 1) + 1
    ) AS email_domain
FROM
    oe.customers;

cust_email	email_domain
Sivaji.Landis@GOLDENEYE.EXAMPLE.COM	GOLDENEYE.EXAMPLE.COM
Mammutti.Pacino@GREBE.EXAMPLE.COM	GREBE.EXAMPLE.COM
Elia.Fawcett@JACANA.EXAMPLE.COM	JACANA.EXAMPLE.COM
Ishwarya.Roberts@LAPWING.EXAMPLE.COM	LAPWING.EXAMPLE.COM
Gustav.Steenburgen@PINTAIL.EXAMPLE.COM	PINTAIL.EXAMPLE.COM

In the example above, SUBSTR and INSTR are used together to extract the email domain from each customer’s email address. The starting position for the extraction is determined by the integer position of the “@” character within each email address. The “+ 1” included in SUBSTR’s second argument ensures that the extraction occurs at the character immediately following the “@” character.

In the next example, additional complexity is introduced when extracting each customer’s last name from the email address while extracting the customer’s first name is similar to extracting the email domain. What makes pulling out each customer’s last name so difficult? The customer’s last names are of variable length and do not start at the beginning nor end with the last character of each email address. In other words, both the starting position and the number of characters to extract are variable. Luckily, each email address has attributes that can assist us in this task. Namely, the “.” character that separates the first and last name and the “@” character that follows the last character of each last name within the email addresses.

-- Extract each customer's first and last name from each email.
SELECT
    oe.customers.cust_email,
    oe.customers.cust_first_name,
    oe.customers.cust_last_name,
    INSTR(oe.customers.cust_email, '.', 1, 1) AS first_dot,
    SUBSTR(oe.customers.cust_email, 1,
       INSTR(oe.customers.cust_email, '.', 1, 1) - 1
    ) AS first_name,
    SUBSTR(oe.customers.cust_email,
       INSTR(oe.customers.cust_email, '.', 1, 1) + 1,
       INSTR(oe.customers.cust_email, '@', 1, 1) - 1
           - INSTR(oe.customers.cust_email, '.', 1, 1)
    ) AS last_name
FROM
    oe.customers;

cust_email	cust_first_name	cust_last_name	first_dot	first_name	last_name
Sivaji.Landis@GOLDENEYE.EXAMPLE.COM	Sivaji	Landis	7	Sivaji	Landis
Mammutti.Pacino@GREBE.EXAMPLE.COM	Mammutti	Pacino	9	Mammutti	Pacino
Elia.Fawcett@JACANA.EXAMPLE.COM	Elia	Fawcett	5	Elia	Fawcett
Ishwarya.Roberts@LAPWING.EXAMPLE.COM	Ishwarya	Roberts	9	Ishwarya	Roberts
Gustav.Steenburgen@PINTAIL.EXAMPLE.COM	Gustav	Steenburgen	7	Gustav	Steenburgen

In the final example below, the goal is to extract the second set of digits within each employee phone number. That is, extract the sequence of digits that appear between the first and second dot or period in the phone number value. Like the example above, calculating the position of each dot will allow us to determine where to start and end the extraction task for each phone number. While the 1^st dot always appears in the fourth position within each phone number, the 2^nd dot position varies. This is a fact that must be accounted for in the solution. Arguably, it would be wise to assume that the first dot could also be found at varying positions within each phone number to develop a more robust solution. In my solution, I’ve made this assumption. In the output is the original phone number value, the sequence of digits extracted when not accounting for varying positions of the first dot and the number of digits present between the two dots, and a final column that accounts for the varying positions of the two dots and the number of digits present between them.

-- Extract the second set of digits within a telephone number.
SELECT
    hr.employees.phone_number,
    SUBSTR(hr.employees.phone_number, 5, 3) AS digit_extraction,
    SUBSTR(hr.employees.phone_number,
        INSTR(hr.employees.phone_number, '.', 1, 1) + 1,
        INSTR(hr.employees.phone_number, '.', 1, 2) - 1
            - INSTR(hr.employees.phone_number, '.', 1, 1)
    ) AS variable_digit_extraction
FROM
    hr.employees;

phone_number	digit_extraction	variable_digit_extraction
515.123.4567	123	123
590.423.4567	423	423
515.124.4569	124	124
011.44.1344.429268	44.	44
650.507.9876	507	507