The Secret Life of Python: The String Intern Pool - When Two Strings Are One Object

Timothy was debugging a performance bottleneck when he noticed something strange in his profiler output. "Margaret, look at this," he said, pointing at his screen. "I'm comparing thousands of string values, and sometimes the comparisons are incredibly fast - but other times they're slower. Both use ==, so why the difference?"

Margaret smiled knowingly. "Let me guess - the fast comparisons are with short strings or identifiers, and the slow ones are with longer, dynamically created strings?"

"Exactly! How did you know?"

"You've discovered Python's string interning," Margaret said. "It's like the integer cache, but more selective and sophisticated. Let's explore when Python decides two strings should be the same object in memory."

The Puzzle: Inconsistent String Identity

Timothy showed Margaret his confusing test results:

def demonstrate_string_identity_puzzle():
    """Sometimes strings share identity, sometimes they don't"""

    # Short strings that look identical
    a = "hello"
    b = "hello"
    print(f"'hello' is 'hello': {a is b}")  # True - same object

    # Longer strings that look identical
    x = "hello world from python"
    y = "hello world from python"
    print(f"Long string is check: {x is y}")  # True - compiled together

    # But dynamically created strings...
    s1 = "hello " + "world"
    s2 = "hello " + "world"
    print(f"Concatenated is check: {s1 is s2}")  # False - different objects!

    # Even though they're equal in value
    print(f"Concatenated == check: {s1 == s2}")  # True - same value

    # Check memory addresses
    print(f"\nMemory addresses:")
    print(f"s1: {id(s1)}")
    print(f"s2: {id(s2)}")  # Different addresses!

demonstrate_string_identity_puzzle()

Output:

'hello' is 'hello': True
Long string is check: True
Concatenated is check: False
Concatenated == check: True

Memory addresses:
s1: 140234567890123
s2: 140234567890456

"See the inconsistency?" Timothy pointed. "Sometimes identical strings share identity, sometimes they don't."

String Interning: Python's Selective Optimization

Margaret sketched out the concept:

"""
String Interning: Python's strategy for reusing string objects

Unlike the integer cache (which caches ALL integers from -5 to 256),
string interning is SELECTIVE and follows specific rules:

AUTOMATICALLY INTERNED:
1. String literals that look like identifiers (letters, digits, underscores)
2. String literals seen at compile time
3. Very short strings (often single characters)

NOT AUTOMATICALLY INTERNED:
1. Strings created at runtime (concatenation, user input, file reading)
2. Strings with special characters (spaces, punctuation in some cases)
3. Very long strings

WHY SELECTIVE?
- Strings can be arbitrary length (unlike integers: fixed size)
- Not all strings are frequently reused
- Interning has overhead (hash table lookup, storage)
- Better to intern selectively based on likely reuse
"""

def explore_interning_rules():
    """Discover Python's string interning rules"""

    # Rule 1: Identifier-like strings ARE interned
    var1 = "hello"
    var2 = "hello"
    print(f"Identifier-like: {var1 is var2}")  # True

    # Rule 2: Strings with spaces - depends on context
    str1 = "hello world"
    str2 = "hello world"
    print(f"With space (literals): {str1 is str2}")  # True (compile-time)

    # Rule 3: Runtime-created strings are NOT interned
    str3 = "hello" + " " + "world"
    str4 = "hello" + " " + "world"
    print(f"Runtime created: {str3 is str4}")  # False

    # Rule 4: Strings from input/files are NOT interned
    user_input = input("Type 'hello': ")  # User types: hello
    literal = "hello"
    print(f"User input vs literal: {user_input is literal}")  # False

# Note: Don't run explore_interning_rules() directly due to input()

The Implementation: How Interning Works

"Let me show you what's happening under the hood," Margaret said.

"""
Conceptual implementation of string interning:

# Python maintains a global intern dictionary
_INTERNED_STRINGS = {}

def intern_string(string_value):
    '''Check if string should be interned'''

    # Check if already interned
    if string_value in _INTERNED_STRINGS:
        return _INTERNED_STRINGS[string_value]

    # Decide if this string qualifies for interning
    if should_intern(string_value):
        _INTERNED_STRINGS[string_value] = string_value
        return _INTERNED_STRINGS[string_value]

    # Don't intern - return new object
    return create_new_string_object(string_value)

def should_intern(s):
    '''Rules for automatic interning'''
    # Compile-time string literals
    # Identifier-like strings (alphanumeric + underscore)
    # Short strings
    return is_identifier_like(s) or is_compile_time_constant(s)
"""

def demonstrate_intern_table():
    """Show how the intern table works conceptually"""
    import sys

    # These strings are interned automatically
    s1 = "python"
    s2 = "python"

    print(f"Automatic interning:")
    print(f"  s1 is s2: {s1 is s2}")
    print(f"  id(s1): {id(s1)}")
    print(f"  id(s2): {id(s2)}")

    # These are NOT interned automatically
    s3 = "hello " + "world"
    s4 = "hello " + "world"

    print(f"\nNo automatic interning:")
    print(f"  s3 is s4: {s3 is s4}")
    print(f"  id(s3): {id(s3)}")
    print(f"  id(s4): {id(s4)}")

demonstrate_intern_table()

Output:

Automatic interning:
  s1 is s2: True
  id(s1): 140234567890100
  id(s2): 140234567890100

No automatic interning:
  s3 is s4: False
  id(s3): 140234567890200
  id(s4): 140234567890300

Manual Interning with sys.intern()

"What if I want to intern a string that Python wouldn't automatically intern?" Timothy asked.

Margaret showed him the tool:

import sys

def demonstrate_manual_interning():
    """Use sys.intern() to force string interning"""

    # Create strings that won't be automatically interned
    str1 = "hello" + " " + "world"
    str2 = "hello" + " " + "world"

    print("Before manual interning:")
    print(f"  str1 is str2: {str1 is str2}")  # False
    print(f"  str1 == str2: {str1 == str2}")  # True

    # Manually intern them
    str1 = sys.intern(str1)
    str2 = sys.intern(str2)

    print("\nAfter manual interning:")
    print(f"  str1 is str2: {str1 is str2}")  # True!
    print(f"  Same object now: {id(str1) == id(str2)}")  # True

def when_to_use_manual_interning():
    """Real-world use case: comparing many strings repeatedly"""
    import sys
    import time

    # Scenario: Web framework comparing route paths
    routes = ["/" + str(i) for i in range(1000)]

    # Without interning - slower comparisons
    start = time.perf_counter()
    for _ in range(10000):
        for route in routes:
            if route == "/500":  # String comparison
                pass
    time_without = time.perf_counter() - start

    # With interning - faster comparisons
    interned_routes = [sys.intern(r) for r in routes]
    search_target = sys.intern("/500")

    start = time.perf_counter()
    for _ in range(10000):
        for route in interned_routes:
            if route is search_target:  # Identity check (fast!)
                pass
    time_with = time.perf_counter() - start

    print(f"Comparison performance:")
    print(f"  Without interning: {time_without:.4f} seconds")
    print(f"  With interning:    {time_with:.4f} seconds")
    print(f"  Speedup: {time_without / time_with:.2f}x")

demonstrate_manual_interning()
print("\n" + "="*50 + "\n")
when_to_use_manual_interning()

Output (approximate):

Before manual interning:
  str1 is str2: False
  str1 == str2: True

After manual interning:
  str1 is str2: True!
  Same object now: True

==================================================

Comparison performance:
  Without interning: 0.2341 seconds
  With interning:    0.0823 seconds
  Speedup: 2.84x

Why String Comparison Matters

Timothy was intrigued. "So interned strings can use identity checks instead of value comparison?"

def explain_comparison_performance():
    """Why identity checks are faster than value comparison"""

    # Identity check (is): O(1) - just compare memory addresses
    a = "hello"
    b = "hello"
    result = a is b  # Two pointer comparisons - instant

    # Value comparison (==): O(n) - must compare each character
    x = "hello" + " " + "world"
    y = "hello" + " " + "world"
    result = x == y  # Compare: h==h, e==e, l==l, l==l, o==o, etc.

    """
    For a string of length n:
    - Identity check: O(1) - constant time
    - Value comparison: O(n) - linear time

    For short strings, the difference is tiny.
    For long strings or millions of comparisons, it adds up!
    """

def demonstrate_performance_difference():
    """Show the performance difference"""
    import time

    # Create long strings
    long_str1 = "x" * 10000
    long_str2 = "x" * 10000

    # Identity check (fast)
    start = time.perf_counter()
    for _ in range(1000000):
        result = long_str1 is long_str2
    identity_time = time.perf_counter() - start

    # Value comparison (slower for long strings)
    start = time.perf_counter()
    for _ in range(1000000):
        result = long_str1 == long_str2
    equality_time = time.perf_counter() - start

    print(f"1 million comparisons of 10,000-char strings:")
    print(f"  Identity (is): {identity_time:.4f} seconds")
    print(f"  Equality (==): {equality_time:.4f} seconds")
    print(f"  Speedup: {equality_time / identity_time:.2f}x")

demonstrate_performance_difference()

Compile-Time vs Runtime String Creation

Margaret explained a subtle but important distinction:

def demonstrate_compile_time_vs_runtime():
    """Compile-time strings vs runtime strings"""

    # Compile-time: Python sees these when compiling the .py file
    compile_time_1 = "hello world"
    compile_time_2 = "hello world"
    print(f"Compile-time literals: {compile_time_1 is compile_time_2}")  # True

    # Also compile-time: constant folding during compilation
    compile_time_3 = "hello" + " " + "world"  # Folded to "hello world" at compile time
    print(f"Compile-time folding: {compile_time_1 is compile_time_3}")  # Often True

    # Runtime: created during program execution
    def create_at_runtime():
        return "hello" + " " + "world"

    runtime_1 = create_at_runtime()
    runtime_2 = create_at_runtime()
    print(f"Runtime creation: {runtime_1 is runtime_2}")  # False

    # User input is always runtime
    # file_content = open('file.txt').read()  # Runtime
    # api_response = requests.get(url).text    # Runtime

demonstrate_compile_time_vs_runtime()

Output:

Compile-time literals: True
Compile-time folding: True
Runtime creation: False

The Identifier Rule

"Why does Python intern identifier-like strings?" Timothy asked.

def explain_identifier_interning():
    """Why Python interns identifier-like strings"""

    """
    Identifiers = variable names, function names, class names, etc.

    In Python code, these appear CONSTANTLY:
    - Variable names in your code
    - Dictionary keys in JSON/configs
    - Attribute lookups (obj.attribute_name)
    - Module/function names in imports

    Example: In a typical Python program, you might have:
    - Thousands of dictionary lookups
    - Hundreds of attribute accesses per second
    - Many string comparisons in routing, parsing, etc.

    If these strings are interned, lookups become faster:
    - Dictionary keys can use identity for fast comparison
    - Attribute lookups benefit from interning
    - Module name checks are faster
    """

    # Simulate dictionary with many lookups
    import sys

    # Create dictionary with interned keys
    config = {
        sys.intern("database_host"): "localhost",
        sys.intern("database_port"): 5432,
        sys.intern("api_key"): "secret",
        sys.intern("max_connections"): 100
    }

    # Lookup with interned key (fast path)
    key = sys.intern("database_host")
    value = config[key]  # Identity check during lookup

    print("Dictionary lookups benefit from key interning")
    print(f"  Key 'database_host' is interned: {key is list(config.keys())[0]}")

explain_identifier_interning()

Real-World Use Cases

Margaret showed Timothy where interning matters in production:

import sys

def use_case_1_web_routing():
    """Use case: Web framework route matching"""

    class Router:
        def __init__(self):
            # Intern route patterns at startup
            self.routes = {
                sys.intern("/api/users"): "get_users",
                sys.intern("/api/posts"): "get_posts",
                sys.intern("/api/comments"): "get_comments",
            }

        def match_route(self, path):
            # Intern incoming path for fast comparison
            interned_path = sys.intern(path)
            # Identity check is faster than string comparison
            return self.routes.get(interned_path)

    router = Router()
    # In production, this happens thousands of times per second
    handler = router.match_route("/api/users")
    print(f"Route matched: {handler}")

def use_case_2_json_parsing():
    """Use case: Parsing JSON with repeated keys"""
    import json

    # JSON often has repeated key names across objects
    json_data = """
    [
        {"name": "Alice", "age": 30, "city": "NYC"},
        {"name": "Bob", "age": 25, "city": "LA"},
        {"name": "Charlie", "age": 35, "city": "NYC"}
    ]
    """

    # Parse JSON
    data = json.loads(json_data)

    # Keys "name", "age", "city" appear multiple times
    # If interned, dictionary lookups are faster
    for person in data:
        # Each lookup benefits if keys are interned
        name = person["name"]
        age = person["age"]
        city = person["city"]

def use_case_3_compiler_symbol_tables():
    """Use case: Compilers and interpreters"""

    class SymbolTable:
        """Compilers use symbol tables with many string lookups"""

        def __init__(self):
            # Variable names in source code are interned
            self.symbols = {}

        def define(self, name, value):
            # Intern symbol names for fast lookup
            interned_name = sys.intern(name)
            self.symbols[interned_name] = value

        def lookup(self, name):
            # Fast identity-based lookup
            interned_name = sys.intern(name)
            return self.symbols.get(interned_name)

    # In a compiler, you might see the same variable names
    # hundreds of times throughout a program
    symbols = SymbolTable()
    symbols.define("counter", 0)
    symbols.define("result", None)

    # These lookups happen frequently during compilation
    val = symbols.lookup("counter")

use_case_1_web_routing()
print()
use_case_2_json_parsing()
print("JSON keys benefit from interning")
print()
use_case_3_compiler_symbol_tables()
print("Compiler symbol tables use interning")

The Danger: Misusing Identity Checks

Timothy remembered the lesson from integer caching. "So I should still use == for string comparison, right?"

def demonstrate_identity_pitfalls():
    """Why you should NOT use 'is' for string comparison"""

    def buggy_string_check(user_input):
        # ❌ WRONG - Don't use 'is' for string comparison!
        if user_input is "admin":
            return "Welcome, admin!"
        return "Access denied"

    # Test with literal (might work by accident)
    result1 = buggy_string_check("admin")
    print(f"Literal input: {result1}")  # Might work

    # Test with runtime-created string (will fail!)
    username = input("Enter username: ")  # User types: admin
    result2 = buggy_string_check(username)
    print(f"User input: {result2}")  # FAILS! "Access denied"

    # Correct implementation
    def correct_string_check(user_input):
        # ✓ CORRECT - Always use == for value comparison
        if user_input == "admin":
            return "Welcome, admin!"
        return "Access denied"

def real_world_bug():
    """Real bug from production code"""

    # Configuration loading
    def load_config_buggy(config_string):
        # ❌ BUG: Using 'is' for comparison
        if config_string is "production":
            return {"debug": False, "logging": "error"}
        return {"debug": True, "logging": "debug"}

    # This works (compile-time constant)
    config1 = load_config_buggy("production")
    print(f"Literal config: {config1}")

    # This fails (runtime string from file/env)
    env = "prod" + "uction"  # Simulating runtime creation
    config2 = load_config_buggy(env)
    print(f"Runtime config: {config2}")  # Wrong config loaded!

    # Correct version
    def load_config_correct(config_string):
        # ✓ CORRECT
        if config_string == "production":
            return {"debug": False, "logging": "error"}
        return {"debug": True, "logging": "debug"}

# Note: Don't run demonstrate_identity_pitfalls() directly due to input()
real_world_bug()

Memory Implications

Margaret showed Timothy the memory benefits:

import sys

def demonstrate_memory_savings():
    """Show memory savings from string interning"""

    # Without interning: many duplicate strings
    words = ["hello"] * 1000000

    # Check if they're the same object
    all_same = all(word is words[0] for word in words)
    print(f"Million copies of 'hello':")
    print(f"  All point to same object: {all_same}")
    print(f"  Memory for one string: {sys.getsizeof('hello')} bytes")
    print(f"  Total memory if separate: {sys.getsizeof('hello') * 1000000:,} bytes")
    print(f"  Actual memory (interned): ~{sys.getsizeof('hello'):,} bytes")
    print(f"  Memory saved: ~{(sys.getsizeof('hello') * 1000000) - sys.getsizeof('hello'):,} bytes")

    # Runtime-created strings don't get this benefit
    runtime_words = ["hel" + "lo" for _ in range(1000)]
    all_same_runtime = all(word is runtime_words[0] for word in runtime_words)
    print(f"\nThousand runtime-created 'hello' strings:")
    print(f"  All point to same object: {all_same_runtime}")

demonstrate_memory_savings()

Output:

Million copies of 'hello':
  All point to same object: True
  Memory for one string: 54 bytes
  Total memory if separate: 54,000,000 bytes
  Actual memory (interned): ~54 bytes
  Memory saved: ~53,999,946 bytes

Thousand runtime-created 'hello' strings:
  All point to same object: False

When NOT to Intern

"Should I intern all my strings?" Timothy asked.

def when_not_to_intern():
    """
    DON'T INTERN:

    1. Large strings (waste of memory in intern table)
    2. Rarely-used strings (no performance benefit)
    3. Temporary strings (short-lived, not reused)
    4. User-generated content (unpredictable, potentially huge)
    5. Strings that will be modified (can't modify, so why intern?)

    Interning has overhead:
    - Hash table lookup cost
    - Memory in the intern table
    - The string lives forever (can't be garbage collected)
    """

    import sys

    # ❌ DON'T: Intern large strings
    large_text = "x" * 1000000
    # sys.intern(large_text)  # Bad idea - wastes memory

    # ❌ DON'T: Intern unique/temporary strings
    for i in range(10000):
        # Don't intern - each is unique and temporary
        temp = f"temp_string_{i}"
        # sys.intern(temp)  # Bad idea - no reuse benefit

    # ✓ DO: Intern small, frequently-reused strings
    status_codes = ["OK", "ERROR", "WARNING", "INFO"]
    interned_codes = [sys.intern(code) for code in status_codes]

    # ✓ DO: Intern dictionary keys that appear repeatedly
    config_keys = ["host", "port", "username", "password"]
    interned_keys = [sys.intern(key) for key in config_keys]

when_not_to_intern()

Testing String Interning

Margaret showed Timothy how to test interning behavior:

import sys
import pytest

def test_automatic_interning():
    """Test that identifier-like strings are interned"""
    # Identifier-like strings should be interned
    s1 = "python"
    s2 = "python"
    assert s1 is s2  # Same object

    # Single characters usually interned
    c1 = "a"
    c2 = "a"
    assert c1 is c2

def test_no_automatic_interning_for_runtime():
    """Test that runtime strings are NOT interned"""
    # Runtime concatenation
    s1 = "hel" + "lo"
    s2 = "hel" + "lo"
    assert s1 == s2  # Same value
    assert s1 is not s2  # Different objects

def test_manual_interning():
    """Test sys.intern() forces interning"""
    # Create non-interned strings
    s1 = "hello " + "world"
    s2 = "hello " + "world"
    assert s1 is not s2

    # Manually intern
    s1 = sys.intern(s1)
    s2 = sys.intern(s2)
    assert s1 is s2  # Now same object!

def test_intern_is_idempotent():
    """Test that interning same string multiple times is safe"""
    s = "test"
    s1 = sys.intern(s)
    s2 = sys.intern(s)
    s3 = sys.intern(s1)

    assert s1 is s2 is s3  # All the same object

def test_intern_with_equality():
    """Test that interned strings still compare correctly"""
    s1 = sys.intern("hello")
    s2 = sys.intern("hello")

    assert s1 == s2  # Value equality
    assert s1 is s2  # Identity

# Run with: pytest test_string_interning.py -v

The Library Metaphor

Margaret brought it back to the library:

"Think of string interning like the library's reference section policy," she said.

"For commonly-referenced books - dictionaries, encyclopedias, style guides - we keep permanent copies that everyone shares. When someone needs 'The Python Style Guide,' they don't get their own copy. They get access to the shared reference copy that lives permanently in the reference section.

"Python does the same with identifier-like strings and compile-time constants. Strings like 'name', 'value', 'error' appear constantly in programs, so Python keeps one shared copy rather than creating duplicates.

"But when someone requests an obscure book title - especially one we've never seen before, like a user-generated title from a form submission - we create a temporary copy just for them. If someone else requests that exact obscure title later, we create another separate copy, because the cost of tracking and checking every possible string ever created would be too expensive.

"You can manually 'add to the reference section' using sys.intern(), telling Python 'this string will be used repeatedly, please keep one shared copy.' But use this wisely - you don't want to fill your reference section with strings nobody will ever look up again!"

Common Misconceptions

Timothy compiled a list:

"""
STRING INTERNING MYTHS vs REALITY:

MYTH: "All identical strings are the same object"
REALITY: Only identifier-like strings and compile-time constants

MYTH: "String interning works like integer caching"
REALITY: Much more selective - depends on context and content

MYTH: "I should use 'is' to compare interned strings"
REALITY: Always use '==' for value comparison

MYTH: "sys.intern() makes string comparison faster"
REALITY: Only if you're doing MANY comparisons of the SAME strings

MYTH: "All strings under 20 characters are interned"
REALITY: Length doesn't determine interning - content and context do

MYTH: "Interning is free and always beneficial"
REALITY: Has overhead - only beneficial for frequently-reused strings

MYTH: "Python 2 and Python 3 intern the same way"
REALITY: Python 3 is more conservative about automatic interning
"""

Performance Optimization Pattern

Margaret showed a real optimization:

import sys
import time

class OptimizedRouter:
    """Web router using string interning for fast path matching"""

    def __init__(self, routes):
        # Intern all route patterns at initialization
        self.routes = {
            sys.intern(path): handler
            for path, handler in routes.items()
        }
        print(f"Initialized with {len(self.routes)} interned routes")

    def match(self, path):
        """Match incoming path to handler"""
        # Intern the incoming path
        interned_path = sys.intern(path)

        # Dictionary lookup with interned key is fast
        # (identity check first, then hash comparison if needed)
        return self.routes.get(interned_path, "not_found")

def benchmark_routing():
    """Benchmark interned vs non-interned routing"""

    # Create routes
    routes = {f"/api/endpoint{i}": f"handler{i}" for i in range(100)}

    # Test with non-interned router
    simple_router = dict(routes)

    start = time.perf_counter()
    for _ in range(100000):
        path = "/api/endpoint50"
        handler = simple_router.get(path)
    time_simple = time.perf_counter() - start

    # Test with interned router
    optimized_router = OptimizedRouter(routes)

    start = time.perf_counter()
    for _ in range(100000):
        path = "/api/endpoint50"
        handler = optimized_router.match(path)
    time_optimized = time.perf_counter() - start

    print(f"\n100,000 route lookups:")
    print(f"  Without interning: {time_simple:.4f} seconds")
    print(f"  With interning:    {time_optimized:.4f} seconds")
    print(f"  Speedup: {time_simple / time_optimized:.2f}x")

benchmark_routing()

Key Takeaways

Margaret summarized the lesson:

"""
STRING INTERNING KEY TAKEAWAYS:

1. String interning is selective, not universal
   - Identifier-like strings (alphanumeric + underscore)
   - Compile-time string literals
   - NOT runtime-created strings

2. Use sys.intern() for manual interning
   - When you have frequently-reused strings
   - Dictionary keys used repeatedly
   - String comparisons in hot paths
   - But only for small, frequently-used strings

3. Identity vs Equality (again!)
   - ALWAYS use '==' for string value comparison
   - NEVER use 'is' for string comparison
   - Interning is an optimization, not a guarantee

4. Performance benefits:
   - Identity checks (O(1)) vs value comparison (O(n))
   - Faster dictionary lookups with interned keys
   - Memory savings when many references to same string

5. When to intern manually:
   - Web routing with repeated paths
   - JSON parsing with repeated keys
   - Compiler symbol tables
   - Configuration keys used throughout app

6. When NOT to intern:
   - Large strings (memory waste)
   - Unique/temporary strings (no reuse)
   - User input (unpredictable)
   - Strings that live forever (can't be GC'd once interned)

7. Interning has overhead:
   - Hash table lookup
   - Memory in intern table
   - Object lives forever in memory
   - Only beneficial when strings are reused many times

8. Real-world impact:
   - Modest performance gains (2-3x for hot paths)
   - Significant memory savings for repeated strings
   - Critical for performance in parsers, compilers, web frameworks
"""

Timothy nodded, understanding. "So string interning is like integer caching, but smarter and more selective. Python interns what's likely to be reused - identifiers and compile-time strings - but leaves runtime strings alone unless I explicitly intern them with sys.intern()."

"Perfect," Margaret said. "And the golden rule remains: use == for value comparison, never is. Interning is an optimization detail, not something to rely on in your comparison logic."

With that, Timothy understood how Python optimizes string memory and when to take control with manual interning for performance-critical code.

Aaron Rose is a software engineer and technology writer at tech-reader.blog and the author of Think Like a Genius.

Search This Blog

Tech-Reader.blog