Skip to content

Latest commit

 

History

History
359 lines (269 loc) · 15.7 KB

File metadata and controls

359 lines (269 loc) · 15.7 KB

Common Data Structures: Strings

The Second World War helped push forward the birth of modern electronic computers. The world's first general-purpose electronic computer was called ENIAC, the Electronic Numerical Integrator and Computer. It was built at the University of Pennsylvania in the United States, covered 167 square meters, weighed about 27 tons, and could complete about 5000 floating-point operations per second, as shown below. After ENIAC was built, it was used to calculate missile trajectories, and numerical computation became one of the most important functions of modern electronic computers.

As time went on, numerical computation was still a very important part of what computers do every day, but today's computers also need to process a large amount of information in text form. If we want to use Python programs to handle this text information, we must first understand the string type and the operations and methods related to it.

Defining Strings

A string is a finite sequence made up of zero or more characters:

$$ s = a_1a_2 \cdots a_n ,,,,, (0 \le n \le \infty) $$

In a Python program, if we wrap one or more characters with single quotes or double quotes, we can represent a string. The characters in a string can be special symbols, English letters, Chinese characters, Japanese hiragana or katakana, Greek letters, Emoji characters, and so on.

s1 = 'hello, world!'
s2 = "你好,世界!❤️"
s3 = '''hello,
wonderful
world!'''
print(s1)
print(s2)
print(s3)

Escape Characters

We can use \ (backslash) in a string to mean escaping. This means the character after \ no longer keeps its original meaning. For example, \n does not mean the character \ and the character n; it means a newline. \t does not mean the character \ and the character t; it means a tab. So if the string itself contains special characters such as ', ", and \, we must escape them with \. For example, if we want to output a string that contains single quotes or backslashes, we need to write it like this.

s1 = '\'hello, world!\''
s2 = '\\hello, world!\\'
print(s1)
print(s2)

Raw Strings

Python also has a kind of string that starts with r or R. This kind of string is called a raw string. It means every character in the string keeps its original meaning, and there are no escape characters. For example, in the string 'hello\n', \n means a newline; but in r'hello\n', \n is only the character \ and the character n. Run the code below and see what it prints.

s1 = '\it \is \time \to \read \now'
s2 = r'\it \is \time \to \read \now'
print(s1)
print(s2)

Note: In the variable s1 above, \t, \r, and \n are all escape characters. \t is a tab, \n is a newline, and \r is a carriage return, which is like moving the output back to the start of the line. Compare the outputs of the two print functions and see what the difference is.

Special Character Notation

In Python, after \ we can also put an octal number or a hexadecimal number to represent a character. For example, \141 and \x61 both stand for the lowercase letter a. The former is octal notation, and the latter is hexadecimal notation. Another way to represent a character is to put a Unicode character code after \u. For example, \u9a86\u660a stands for the Chinese name "骆昊". Run the code below and see what it prints.

s1 = '\141\142\143\x61\x62\x63'
s2 = '\u9a86\u660a'
print(s1)
print(s2)

String Operations

Python provides many operators for strings, and many of them work like the operators for lists. For example, we can use + to join strings, * to repeat the content of a string, in and not in to determine whether one string contains another string, and [] and [:] to take out one character or several characters from a string.

Concatenation and Repetition

The example below shows how to use + and * to join and repeat strings.

s1 = 'hello' + ', ' + 'world'
print(s1)    # hello, world
s2 = '!' * 3
print(s2)    # !!!
s1 += s2
print(s1)    # hello, world!!!
s1 *= 2
print(s1)    # hello, world!!!hello, world!!!

Using * to repeat a string is a very interesting feature. In many programming languages, if you want to represent a string with 10 a characters, you can only write 'aaaaaaaaaa'; but in Python, you can write 'a' * 10. You may feel that 'aaaaaaaaaa' is not inconvenient, but think about what happens if the character a needs to be repeated 100 times or 1000 times.

Comparison

For two string variables, we can directly use comparison operators to check whether the two strings are equal or to compare their size. It should be explained that strings also exist in binary form in computer memory, so comparing the size of strings really means comparing the code value of each character. For example, the code of A is 65, while the code of a is 97, so the result of 'A' < 'a' is really the result of 65 < 97, and this is clearly True. For 'boy' < 'bad', because the first characters are both 'b', we still cannot tell their order, so the real comparison is between the second characters. Clearly, the result of 'o' < 'a' is False, so 'boy' < 'bad' is also False. If you do not know the code value of a character, you can use the ord function to get it. We mentioned this function before. For example, the value of ord('A') is 65, and the value of ord('昊') is 26122. Look at the code below carefully.

s1 = 'a whole new world'
s2 = 'hello world'
print(s1 == s2)             # False
print(s1 < s2)              # True
print(s1 == 'hello world')  # False
print(s2 == 'hello world')  # True
print(s2 != 'Hello world')  # True
s3 = '骆昊'
print(ord('骆'))            # 39558
print(ord('昊'))            # 26122
s4 = '王大锤'
print(ord('王'))            # 29579
print(ord('大'))            # 22823
print(ord('锤'))            # 38180
print(s3 >= s4)             # True
print(s3 != s4)             # True

Membership

In Python, we can use in and not in to judge whether one string contains another character or string. Just like with lists, in and not in are called membership operators, and they produce the Boolean value True or False, as shown below.

s1 = 'hello, world'
s2 = 'goodbye, world'
print('wo' in s1)      # True
print('wo' not in s2)  # False
print(s2 in s1)        # False

Length

Getting the length of a string is the same as getting the number of elements in a list: use the built-in function len, as shown below.

s = 'hello, world'
print(len(s))                 # 12
print(len('goodbye, world'))  # 14

Indexing and Slicing

Strings support indexing and slicing just like lists and tuples, because strings are also ordered sequences. But there is one thing to note: because strings are immutable, you cannot use indexing to modify characters in a string.

s = 'abc123456'
n = len(s)
print(s[0], s[-n])    # a a
print(s[n-1], s[-1])  # 6 6
print(s[2], s[-7])    # c c
print(s[5], s[-4])    # 3 3
print(s[2:5])         # c12
print(s[-7:-4])       # c12
print(s[2:])          # c123456
print(s[:2])          # ab
print(s[::2])         # ac246
print(s[::-1])        # 654321cba

One more reminder: when doing indexing, if the index goes out of range, Python raises IndexError, and the error message is string index out of range.

Iterating Over Characters

If we want to go through each character in a string, we can use a for-in loop. There are two common ways.

Method 1:

s = 'hello'
for i in range(len(s)):
    print(s[i])

Method 2:

s = 'hello'
for elem in s:
    print(elem)

String Methods

In Python, we can work with strings through methods that belong to the string type itself. Suppose we have a string named foo, and the string has a method named bar, then the syntax for using the string method is foo.bar(). This is the same syntax we used before for list methods.

Case Conversion

The code below shows methods related to string case conversion.

s1 = 'hello, world!'
# Capitalize the first letter of the string.
print(s1.capitalize())  # Hello, world!
# Capitalize the first letter of every word.
print(s1.title())       # Hello, World!
# Convert the string to uppercase.
print(s1.upper())       # HELLO, WORLD!
s2 = 'GOODBYE'
# Convert the string to lowercase.
print(s2.lower())       # goodbye
# Check the values of s1 and s2.
print(s1)
print(s2)

Note: Because strings are immutable, these methods return new strings rather than changing the original one. So when we check the values of s1 and s2 at the end, their values have not changed.

Searching

If we want to search for one string inside another string from left to right, we can use the find or index method. When using find and index, we can also specify the search range through the method arguments, which means the search does not have to start at index 0.

s = 'hello, world!'
print(s.find('or'))      # 8
print(s.find('or', 9))   # -1
print(s.find('of'))      # -1
print(s.index('or'))     # 8
print(s.index('or', 9))  # ValueError: substring not found

Note: If find cannot find the specified string, it returns -1; if index cannot find the specified string, it raises ValueError.

find and index also have reverse-search versions, which search from right to left. They are rfind and rindex, as shown below.

s = 'hello world!'
print(s.find('o'))       # 4
print(s.rfind('o'))      # 7
print(s.rindex('o'))     # 7

Property Checks

We can use startswith and endswith to judge whether a string starts or ends with another string. We can also use methods that start with is to judge the features of a string. These methods all return Boolean values, as shown below.

s1 = 'hello, world!'
print(s1.startswith('He'))   # False
print(s1.startswith('hel'))  # True
print(s1.endswith('!'))      # True
s2 = 'abc123456'
print(s2.isdigit())  # False
print(s2.isalpha())  # False
print(s2.isalnum())  # True

Note: isdigit is used to judge whether a string is made up only of digits. isalpha is used to judge whether a string is made up only of letters. The letters here mean Unicode letters, but do not include Emoji characters. isalnum is used to judge whether a string is made up of letters and digits.

Formatting

In Python, strings can be centered, left-aligned, or right-aligned through the center, ljust, and rjust methods. If we want to pad the left side of a string with zeros, we can also use the zfill method.

s = 'hello, world'
print(s.center(20, '*'))  # ****hello, world****
print(s.rjust(20))        #         hello, world
print(s.ljust(20, '~'))   # hello, world~~~~~~~~
print('33'.zfill(5))      # 00033
print('-33'.zfill(5))     # -0033

Formatting

We have already seen the following formatting style when using the print function.

a = 321
b = 123
print('%d * %d = %d' % (a, b, a * b))

Of course, we can also use the string format method to format a string, as shown below.

a = 321
b = 123
print('{0} * {1} = {2}'.format(a, b, a * b))

Starting from Python 3.6, there is a simpler way to write formatted strings: add f before the string. In an f string, {variable_name} is a placeholder, and it will be replaced by the value of the variable, as shown below.

a = 321
b = 123
print(f'{a} * {b} = {a * b}')

If we need more control over how variable values are shown, we can use formatting patterns like those in the table below.

Value Placeholder Result Meaning
3.1415926 {:.2f} '3.14' keep 2 decimal places
3.1415926 {:+.2f} '+3.14' show sign and keep 2 decimal places
-1 {:+.2f} '-1.00' show sign and keep 2 decimal places
3.1415926 {:.0f} '3' no decimals
123 {:0>10d} '0000000123' pad on the left with 0 to 10 characters
123 {:x<10d} '123xxxxxxx' pad on the right with x to 10 characters
123 {:>10d} ' 123' pad on the left with spaces to 10 characters
123 {:<10d} '123 ' pad on the right with spaces to 10 characters
123456789 {:,} '123,456,789' comma-separated format
0.123 {:.2%} '12.30%' percentage format
123456789 {:.2e} '1.23e+08' scientific notation

Trimming

The strip method of a string can help us get a new string after removing specified characters from both ends of the original string. By default, it removes spaces. This method is very practical. It can be used to remove leading and trailing spaces that a user typed by mistake. The strip method also has lstrip and rstrip versions, and you can probably guess what they do from the names.

s1 = '   jackfrued@126.com  '
print(s1.strip())      # jackfrued@126.com
s2 = '~你好,世界~'
print(s2.lstrip('~'))  # 你好,世界~
print(s2.rstrip('~'))  # ~你好,世界

Replacing

If we want to replace specified content in a string with new content, we can use the replace method, as shown below. The first parameter of replace is the content to be replaced, the second parameter is the new content, and the third parameter can be used to specify the number of replacements.

s = 'hello, good world'
print(s.replace('o', '@'))     # hell@, g@@d w@rld
print(s.replace('o', '@', 1))  # hell@, good world

Splitting and Joining

We can use the string method split to split one string into many strings and put them into a list. We can also use the string method join to join many strings in a list into one string, as shown below.

s = 'I love you'
words = s.split()
print(words)            # ['I', 'love', 'you']
print('~'.join(words))  # I~love~you

It should be explained that split uses spaces by default. We can also use other characters to split the string, and we can also set the maximum number of splits to control the result, as shown below.

s = 'I#love#you#so#much'
words = s.split('#')
print(words)
words = s.split('#', 2)
print(words)

Encoding and Decoding

Besides the string type str, Python also has a byte-string type for binary data, which is bytes. A byte string is a finite sequence made up of zero or more bytes. Through the encode method of a string, we can encode a string into a byte string in some encoding. We can also use the decode method of a byte string to decode a byte string into a string, as shown below.

a = '骆昊'
b = a.encode('utf-8')
c = a.encode('gbk')
print(b)                  # b'\xe9\xaa\x86\xe6\x98\x8a'
print(c)                  # b'\xc2\xe6\xea\xbb'
print(b.decode('utf-8'))  # 骆昊
print(c.decode('gbk'))    # 骆昊

Please note that if the encoding and decoding methods are not the same, it will cause garbled text, meaning the original content cannot be restored, or it may raise UnicodeDecodeError and make the program crash.

Other Methods

For the string type, there is another common operation: checking whether a string matches some specific pattern. For example, when a website checks the username and email in user registration information, that is a kind of pattern-matching check. The tool used for pattern matching is called a regular expression. Python supports regular expressions through the re module in the standard library. We will talk about this later in the course.

Summary

Knowing how to represent and work with strings is very important for programmers, because we often need to process text information. In Python, we can work with strings through operators such as concatenation, indexing, and slicing, and we can also use the many methods provided by the string type itself.