Stringy Pythons

Jed Rembold & Fred Agbo

February 5, 2025

Announcements

  • You have Problem Set #3 posted and due on Monday 12 at 10 pm
  • This week’s lecture introduces concepts you would need for the PS and the first project coming up next week
  • I hope you are reading along the chapter for this week
  • Link to Polling https://www.polleverywhere.com/agbofred203

Review!

How would you represent the number \[28_{16}\] in binary?

  1. \(01101101_2\)
  2. \(10101010_2\)
  3. \(00111010_2\)
  4. \(00101000_2\)

Other Data Types (STRING)

  • Numbers are great, but what about other types of data?
  • We briefly touched string data type in earlier class but let us revisit it in more details
  • Note! other data type you would see later include:
    • lists!

Strings

  • A string in Python represents textual data, in form of a sequence of individual characters
    • Domain: all possible sequences of characters
    • Operations: Many! We’ll see some of them soon
  • Denoted by placing the desired sequence of characters between two quotation marks
    • 'I am a string'
    • In Python, either single or double quotes can be used, but the ends must match
      • "I am also a string!"
      • "I'm sad you've gone"

Lists

  • A list in Python represents a sequence of any type of data
  • Denote by bordering with square brackets ([, ]) with commas separating each element of the sequence
    • Each element could be any data type (even mixing from element to element!)
    • ['This', 'is', 'a', 'list']
    • ['Great', 4, 'storing', 5 * 10]
  • There are many operations that we will see are possible on lists, but will start with only the basics

Sequences

  • Both strings and lists are examples of a more general type called a sequence
    • Strings are sequences of characters
    • Lists are sequences of anything
  • Sequences are ordered, so we can number off their elements, which we call their index
    • Counting in Python always starts with 0, so the first element of the sequence has index 0
  • Python defines operations that work on all sequences
    • Selecting an individual element out of a sequence
    • Concatenating two sequences together
    • Determing the number of elements in a sequence

Selection

  • You can select or “pluck out” just a single element from a sequence using square brackets [ ]
    • There are no commas between these square brackets, so they can’t be confused with a list
    • The square brackets come after the sequence (or variable name representing a sequence)
    • Inside the square brackets, you place the index number of the element you want to select
>>> A = [2, 4, 6, 8]
>>> print(A[1])
4
>>> B = "Spaghetti"
>>> print(B[6])
't'

Concatenation

  • Concatenation is the act of taking two separate objects and bringing them together to create a single object
  • For sequences, concatenation takes the contents of one sequence and add them to the end of another sequence
  • The + operator concatenates sequences
    • This is why it is important to keep track of your variable types! + will add two integers, but will concatenate two strings
    >>> 'fish' + 'sticks'
    'fishsticks'
    >>> A = [1, 'fish']
    >>> B = [2, 'fish']
    >>> print(A + B)
    [1, 'fish', 2, 'fish']

Lengths

  • The number of elements in a sequence is commonly called its length, and can be given by the len( ) function

  • Simply place the sequence you desire to know the length of between the parentheses:

    >>> len("spaghetti")
    9
  • You can have sequences of 0 length as well!

    >>> A = ""
    >>> B = [ ]
    >>> print( len(A) + len(B) )
    0

Representing Characters

  • We use numeric encodings to represent character data inside the machine, where each character is assigned an integer value.
  • Character codes are not very useful unless standardized though!
    • Competing encodings in the early years made it difficult to share data across machines
  • First widely adopted character encoding was ASCII (American Standard Code for Information Interchange)
  • Originally just with 128 possible characters, even after expanding to 256, ASCII proved inadequate in the international world, and has therefore been superseded by Unicode.

ASCII

image/svg+xml 0 1 2 3 4 5 6 7 8 9 A B C D E F 0x 1x 2x 3x 4x 5x 6x 7x \0 \b \t \n \v \f \r ! " # $ % & ' ( ) * + , - . / 0 1 2 3 4 5 6 7 8 9 : ; < = > ? @ A B C D E F G H I J K L M N O P Q R S T U V W X Y Z [ \ ] ^ _ ` a b c d e f g h i j k l m n o p q r s t u v w x y z { | } ~
The ASCII subset of Unicode

Meeting chr and ord

  • Python includes two build in functions to simplify conversion between an integer and the corresponding Unicode character
  • chr takes a base-10 integer and returns the corresponding Unicode character as a string
    • chr(65) gives "A" (capital A)
    • chr(960) gives "π" (Greek letter pi)
  • ord goes the other direction, taking a single character string and returning the corresponding base-10 integer of that character in Unicode
    • ord("B") gives 66
    • ord(" ") gives 32
    • ord("π") gives 960

Abstract Strings

  • Characters (and their Unicode representation) are most often used in programming when combined to make collections of consecutive characters called strings.
  • Internally, strings are stored as a sequence of characters in a sequential chunk of memory.
  • You don’t have to (and generally don’t want to) think of the internal representation.
    • Better to think of the string as a single abstract unit
  • Python emphasizes this abstract view by defining a built-in string class that already defines a selection of higher-level operations on string objects

Character Picking Recap

  • A string is an ordered collection of characters
    • Character positions in the string are identified by an index, which starts at 0

  • You can select individual characters from the string using the syntax

    string[k]

    where string is the variable assigned to the desired string and k is the index integer of the character you want

    >>> print("spaghetti sauce"[5])
    e

Back it Up

  • Sometimes it is more useful to count from the end of the string, not the beginning
  • Python gives you a convenient way to do this, using negative indexes


  • A common use case is to grab the last character of the string, using

    s[-1]

    which is shorthand for

    s[len(s)-1]

Slicing

  • Often, you may want more than a single character

  • Python allows you to specify a starting and an ending index through an operation known as slicing

  • The syntax looks like:

    string_variable[start : limit]

    where start is the first index to be included and everything up to but not including the limit is included

  • start and limit are actually optional (but the : is not)

    • If start omitted, the slice will begin at the start of the string
    • If limit omitted, the slice will proceed to the end of the string

and Dicing

  • Can add a third component to the slice syntax, called a stride

    string_variable[start : limit : stride]
  • Specifies how large the steps are between each included index

  • Can also make the stride negative to proceed backwards through a string

    >>> s = "spaghetti sauce"
    >>> s[4:8]
    hett
    >>> s[10:]
    sauce
    >>> s[:10:2]
    sahti

Repeat again?

  • We’ve already seen how we can use addition (+) in Python to concatenate strings
  • In math, adding something many times is the same as multiplying

\[5+5+5+5+5+5 = 6 \times 5\]

  • The same logic holds true for Python strings!
    • You multiply by a integer: the number of times you want the concatenation repeated
    • You can not multiply two strings together, Python will not understand what you are trying to do
    print("Betelguese, " * 3)

Comparing Strings

  • Python lets you use normal comparison operators to compare strings

    string1 == string2

    is true if string1 and string2 contain the same characters in the same order

  • Comparisons involving greater than or less than are done similar to alphabetical ordering

    • Start at the beginning and compare a character. If they are the same, then compare the next character, etc
  • All comparisons are done according to their Unicode values.

    • Called lexicographic ordering
    • "cat" > "CAT"

Can’t change a string’s colors

  • Strings are what we call immutable: they can not be modified in place by clients.

  • You can “look” at different parts of the string, but you can not “change” those parts without making a whole new string

    s = "Cats!"
    s[0] = "R"   # THIS WILL ERROR!!
  • You can of course create a new string object with the desired traits:

    s = "R" + s[1:]
  • This applies to all methods that act on strings as well: they return a new string, they do not modify the original

Methods to find string patterns

Method Description
string.find(pattern) Returns the first index of pattern in string, or -1 if it does not appear
string.find(pattern, k) Same as the one-argument version, but starts searching at index k
string.rfind(pattern) Returns the last index of pattern is string, or -1 if missing
string.rfind(pattern, k) Same as the one-argument version, but searches backwards from index k
string.startswith(prefix) Returns True if the string starts with prefix
string.endswith(suffix) Returns True if the string ends with suffix

Transforming Methods

Method Description
string.lower() Returns a copy of string with all letters converted to lowercase
string.upper() Returns a copy of string with all letters converted to uppercase
string.capitalize() Returns a copy of string with the first character capitalized and the rest lowercase
string.strip() Returns a copy of string with whitespace and non-printing characters removed from both ends
string.replace(old, new) Returns a copy of string with all instances of old replaced by new

Classifying Character Methods

Method Description
char.isalpha() Returns True if char is a letter
char.isdigit() Returns True if char is a digit
char.isalnum() Returns True if char is letter or a digit
char.islower() Returns True if char is a lowercase letter
char.isupper() Returns True if char is an uppercase letter
char.isspace() Returns True if char is a whitespace character (space, tab, or newline)
char.isidentifier() Returns True if char is a legal Python identifier

Igpay Atinlay

  • Suppose we wanted to write a script that converted English to Pig Latin
  • Rules of Pig Latin:
    • If the word begins with a consonant, move everything up to the first vowel to the end and append on “ay” at the end
      fleeteetflay
    • If the word starts with a vowel, just append “way” to the end
      orangeorangeway
    • If the word has no vowels, do nothing
  • Our decomposition:
    • Find first vowel
    • Convert a single word

Indingfay Owelsvay

def find_first_vowel_index(word):
    """
    Find the first vowel in a word and return its index,
    or return None if no vowels found.
    """
    for i in range(len(word)):
        index = "aeiou".find(word[i].lower())
        if index != -1:
            return i
    return None

Onvertcay Oneway Ordway

def word_2_pig_latin(word):
    """
    Convert a single word with no special characters from
    English to Pig Latin.
    """
    vowel = find_first_vowel_index(word)
    if vowel is None:
        return word
    elif vowel == 0:
        return word + "way"
    else:
        return word[vowel:] + word[:vowel] + "ay"
// reveal.js plugins