The Secret Life of Python: The String Intern Pool - When Two Strings Are One Object
Timothy was debugging a performance bottleneck when he noticed something strange in his profiler output. "Margaret, look at this," he said, pointing at his screen. "I'm comparing thousands of string values, and sometimes the comparisons are incredibly fast - but other times they're slower. Both use ==, so why the difference?"
Margaret smiled knowingly. "Let me guess - the fast comparisons are with short strings or identifiers, and the slow ones are with longer, dynamically created strings?"
"Exactly! How did you know?"
"You've discovered Python's string interning," Margaret said. "It's like the integer cache, but more selective and sophisticated. Let's explore when Python decides two strings should be the same object in memory."
The Puzzle: Inconsistent String Identity
Timothy showed Margaret his confusing test results:
def demonstrate_string_identity_puzzle():
"""Sometimes strings share identity, sometimes they don't"""
# Short strings that look identical
a = "hello"
b = "hello"
print(f"'hello' is 'hello': {a is b}") # True - same object
# Longer strings that look identical
x = "hello world from python"
y = "hello world from python"
print(f"Long string is check: {x is y}") # True - compiled together
# But dynamically created strings...
s1 = "hello " + "world"
s2 = "hello " + "world"
print(f"Concatenated is check: {s1 is s2}") # False - different objects!
# Even though they're equal in value
print(f"Concatenated == check: {s1 == s2}") # True - same value
# Check memory addresses
print(f"\nMemory addresses:")
print(f"s1: {id(s1)}")
print(f"s2: {id(s2)}") # Different addresses!
demonstrate_string_identity_puzzle()
Output:
'hello' is 'hello': True
Long string is check: True
Concatenated is check: False
Concatenated == check: True
Memory addresses:
s1: 140234567890123
s2: 140234567890456
"See the inconsistency?" Timothy pointed. "Sometimes identical strings share identity, sometimes they don't."
String Interning: Python's Selective Optimization
Margaret sketched out the concept:
"""
String Interning: Python's strategy for reusing string objects
Unlike the integer cache (which caches ALL integers from -5 to 256),
string interning is SELECTIVE and follows specific rules:
AUTOMATICALLY INTERNED:
1. String literals that look like identifiers (letters, digits, underscores)
2. String literals seen at compile time
3. Very short strings (often single characters)
NOT AUTOMATICALLY INTERNED:
1. Strings created at runtime (concatenation, user input, file reading)
2. Strings with special characters (spaces, punctuation in some cases)
3. Very long strings
WHY SELECTIVE?
- Strings can be arbitrary length (unlike integers: fixed size)
- Not all strings are frequently reused
- Interning has overhead (hash table lookup, storage)
- Better to intern selectively based on likely reuse
"""
def explore_interning_rules():
"""Discover Python's string interning rules"""
# Rule 1: Identifier-like strings ARE interned
var1 = "hello"
var2 = "hello"
print(f"Identifier-like: {var1 is var2}") # True
# Rule 2: Strings with spaces - depends on context
str1 = "hello world"
str2 = "hello world"
print(f"With space (literals): {str1 is str2}") # True (compile-time)
# Rule 3: Runtime-created strings are NOT interned
str3 = "hello" + " " + "world"
str4 = "hello" + " " + "world"
print(f"Runtime created: {str3 is str4}") # False
# Rule 4: Strings from input/files are NOT interned
user_input = input("Type 'hello': ") # User types: hello
literal = "hello"
print(f"User input vs literal: {user_input is literal}") # False
# Note: Don't run explore_interning_rules() directly due to input()
The Implementation: How Interning Works
"Let me show you what's happening under the hood," Margaret said.
"""
Conceptual implementation of string interning:
# Python maintains a global intern dictionary
_INTERNED_STRINGS = {}
def intern_string(string_value):
'''Check if string should be interned'''
# Check if already interned
if string_value in _INTERNED_STRINGS:
return _INTERNED_STRINGS[string_value]
# Decide if this string qualifies for interning
if should_intern(string_value):
_INTERNED_STRINGS[string_value] = string_value
return _INTERNED_STRINGS[string_value]
# Don't intern - return new object
return create_new_string_object(string_value)
def should_intern(s):
'''Rules for automatic interning'''
# Compile-time string literals
# Identifier-like strings (alphanumeric + underscore)
# Short strings
return is_identifier_like(s) or is_compile_time_constant(s)
"""
def demonstrate_intern_table():
"""Show how the intern table works conceptually"""
import sys
# These strings are interned automatically
s1 = "python"
s2 = "python"
print(f"Automatic interning:")
print(f" s1 is s2: {s1 is s2}")
print(f" id(s1): {id(s1)}")
print(f" id(s2): {id(s2)}")
# These are NOT interned automatically
s3 = "hello " + "world"
s4 = "hello " + "world"
print(f"\nNo automatic interning:")
print(f" s3 is s4: {s3 is s4}")
print(f" id(s3): {id(s3)}")
print(f" id(s4): {id(s4)}")
demonstrate_intern_table()
Output:
Automatic interning:
s1 is s2: True
id(s1): 140234567890100
id(s2): 140234567890100
No automatic interning:
s3 is s4: False
id(s3): 140234567890200
id(s4): 140234567890300
Manual Interning with sys.intern()
"What if I want to intern a string that Python wouldn't automatically intern?" Timothy asked.
Margaret showed him the tool:
import sys
def demonstrate_manual_interning():
"""Use sys.intern() to force string interning"""
# Create strings that won't be automatically interned
str1 = "hello" + " " + "world"
str2 = "hello" + " " + "world"
print("Before manual interning:")
print(f" str1 is str2: {str1 is str2}") # False
print(f" str1 == str2: {str1 == str2}") # True
# Manually intern them
str1 = sys.intern(str1)
str2 = sys.intern(str2)
print("\nAfter manual interning:")
print(f" str1 is str2: {str1 is str2}") # True!
print(f" Same object now: {id(str1) == id(str2)}") # True
def when_to_use_manual_interning():
"""Real-world use case: comparing many strings repeatedly"""
import sys
import time
# Scenario: Web framework comparing route paths
routes = ["/" + str(i) for i in range(1000)]
# Without interning - slower comparisons
start = time.perf_counter()
for _ in range(10000):
for route in routes:
if route == "/500": # String comparison
pass
time_without = time.perf_counter() - start
# With interning - faster comparisons
interned_routes = [sys.intern(r) for r in routes]
search_target = sys.intern("/500")
start = time.perf_counter()
for _ in range(10000):
for route in interned_routes:
if route is search_target: # Identity check (fast!)
pass
time_with = time.perf_counter() - start
print(f"Comparison performance:")
print(f" Without interning: {time_without:.4f} seconds")
print(f" With interning: {time_with:.4f} seconds")
print(f" Speedup: {time_without / time_with:.2f}x")
demonstrate_manual_interning()
print("\n" + "="*50 + "\n")
when_to_use_manual_interning()
Output (approximate):
Before manual interning:
str1 is str2: False
str1 == str2: True
After manual interning:
str1 is str2: True!
Same object now: True
==================================================
Comparison performance:
Without interning: 0.2341 seconds
With interning: 0.0823 seconds
Speedup: 2.84x
Why String Comparison Matters
Timothy was intrigued. "So interned strings can use identity checks instead of value comparison?"
def explain_comparison_performance():
"""Why identity checks are faster than value comparison"""
# Identity check (is): O(1) - just compare memory addresses
a = "hello"
b = "hello"
result = a is b # Two pointer comparisons - instant
# Value comparison (==): O(n) - must compare each character
x = "hello" + " " + "world"
y = "hello" + " " + "world"
result = x == y # Compare: h==h, e==e, l==l, l==l, o==o, etc.
"""
For a string of length n:
- Identity check: O(1) - constant time
- Value comparison: O(n) - linear time
For short strings, the difference is tiny.
For long strings or millions of comparisons, it adds up!
"""
def demonstrate_performance_difference():
"""Show the performance difference"""
import time
# Create long strings
long_str1 = "x" * 10000
long_str2 = "x" * 10000
# Identity check (fast)
start = time.perf_counter()
for _ in range(1000000):
result = long_str1 is long_str2
identity_time = time.perf_counter() - start
# Value comparison (slower for long strings)
start = time.perf_counter()
for _ in range(1000000):
result = long_str1 == long_str2
equality_time = time.perf_counter() - start
print(f"1 million comparisons of 10,000-char strings:")
print(f" Identity (is): {identity_time:.4f} seconds")
print(f" Equality (==): {equality_time:.4f} seconds")
print(f" Speedup: {equality_time / identity_time:.2f}x")
demonstrate_performance_difference()
Compile-Time vs Runtime String Creation
Margaret explained a subtle but important distinction:
def demonstrate_compile_time_vs_runtime():
"""Compile-time strings vs runtime strings"""
# Compile-time: Python sees these when compiling the .py file
compile_time_1 = "hello world"
compile_time_2 = "hello world"
print(f"Compile-time literals: {compile_time_1 is compile_time_2}") # True
# Also compile-time: constant folding during compilation
compile_time_3 = "hello" + " " + "world" # Folded to "hello world" at compile time
print(f"Compile-time folding: {compile_time_1 is compile_time_3}") # Often True
# Runtime: created during program execution
def create_at_runtime():
return "hello" + " " + "world"
runtime_1 = create_at_runtime()
runtime_2 = create_at_runtime()
print(f"Runtime creation: {runtime_1 is runtime_2}") # False
# User input is always runtime
# file_content = open('file.txt').read() # Runtime
# api_response = requests.get(url).text # Runtime
demonstrate_compile_time_vs_runtime()
Output:
Compile-time literals: True
Compile-time folding: True
Runtime creation: False
The Identifier Rule
"Why does Python intern identifier-like strings?" Timothy asked.
def explain_identifier_interning():
"""Why Python interns identifier-like strings"""
"""
Identifiers = variable names, function names, class names, etc.
In Python code, these appear CONSTANTLY:
- Variable names in your code
- Dictionary keys in JSON/configs
- Attribute lookups (obj.attribute_name)
- Module/function names in imports
Example: In a typical Python program, you might have:
- Thousands of dictionary lookups
- Hundreds of attribute accesses per second
- Many string comparisons in routing, parsing, etc.
If these strings are interned, lookups become faster:
- Dictionary keys can use identity for fast comparison
- Attribute lookups benefit from interning
- Module name checks are faster
"""
# Simulate dictionary with many lookups
import sys
# Create dictionary with interned keys
config = {
sys.intern("database_host"): "localhost",
sys.intern("database_port"): 5432,
sys.intern("api_key"): "secret",
sys.intern("max_connections"): 100
}
# Lookup with interned key (fast path)
key = sys.intern("database_host")
value = config[key] # Identity check during lookup
print("Dictionary lookups benefit from key interning")
print(f" Key 'database_host' is interned: {key is list(config.keys())[0]}")
explain_identifier_interning()
Real-World Use Cases
Margaret showed Timothy where interning matters in production:
import sys
def use_case_1_web_routing():
"""Use case: Web framework route matching"""
class Router:
def __init__(self):
# Intern route patterns at startup
self.routes = {
sys.intern("/api/users"): "get_users",
sys.intern("/api/posts"): "get_posts",
sys.intern("/api/comments"): "get_comments",
}
def match_route(self, path):
# Intern incoming path for fast comparison
interned_path = sys.intern(path)
# Identity check is faster than string comparison
return self.routes.get(interned_path)
router = Router()
# In production, this happens thousands of times per second
handler = router.match_route("/api/users")
print(f"Route matched: {handler}")
def use_case_2_json_parsing():
"""Use case: Parsing JSON with repeated keys"""
import json
# JSON often has repeated key names across objects
json_data = """
[
{"name": "Alice", "age": 30, "city": "NYC"},
{"name": "Bob", "age": 25, "city": "LA"},
{"name": "Charlie", "age": 35, "city": "NYC"}
]
"""
# Parse JSON
data = json.loads(json_data)
# Keys "name", "age", "city" appear multiple times
# If interned, dictionary lookups are faster
for person in data:
# Each lookup benefits if keys are interned
name = person["name"]
age = person["age"]
city = person["city"]
def use_case_3_compiler_symbol_tables():
"""Use case: Compilers and interpreters"""
class SymbolTable:
"""Compilers use symbol tables with many string lookups"""
def __init__(self):
# Variable names in source code are interned
self.symbols = {}
def define(self, name, value):
# Intern symbol names for fast lookup
interned_name = sys.intern(name)
self.symbols[interned_name] = value
def lookup(self, name):
# Fast identity-based lookup
interned_name = sys.intern(name)
return self.symbols.get(interned_name)
# In a compiler, you might see the same variable names
# hundreds of times throughout a program
symbols = SymbolTable()
symbols.define("counter", 0)
symbols.define("result", None)
# These lookups happen frequently during compilation
val = symbols.lookup("counter")
use_case_1_web_routing()
print()
use_case_2_json_parsing()
print("JSON keys benefit from interning")
print()
use_case_3_compiler_symbol_tables()
print("Compiler symbol tables use interning")
The Danger: Misusing Identity Checks
Timothy remembered the lesson from integer caching. "So I should still use == for string comparison, right?"
def demonstrate_identity_pitfalls():
"""Why you should NOT use 'is' for string comparison"""
def buggy_string_check(user_input):
# ❌ WRONG - Don't use 'is' for string comparison!
if user_input is "admin":
return "Welcome, admin!"
return "Access denied"
# Test with literal (might work by accident)
result1 = buggy_string_check("admin")
print(f"Literal input: {result1}") # Might work
# Test with runtime-created string (will fail!)
username = input("Enter username: ") # User types: admin
result2 = buggy_string_check(username)
print(f"User input: {result2}") # FAILS! "Access denied"
# Correct implementation
def correct_string_check(user_input):
# ✓ CORRECT - Always use == for value comparison
if user_input == "admin":
return "Welcome, admin!"
return "Access denied"
def real_world_bug():
"""Real bug from production code"""
# Configuration loading
def load_config_buggy(config_string):
# ❌ BUG: Using 'is' for comparison
if config_string is "production":
return {"debug": False, "logging": "error"}
return {"debug": True, "logging": "debug"}
# This works (compile-time constant)
config1 = load_config_buggy("production")
print(f"Literal config: {config1}")
# This fails (runtime string from file/env)
env = "prod" + "uction" # Simulating runtime creation
config2 = load_config_buggy(env)
print(f"Runtime config: {config2}") # Wrong config loaded!
# Correct version
def load_config_correct(config_string):
# ✓ CORRECT
if config_string == "production":
return {"debug": False, "logging": "error"}
return {"debug": True, "logging": "debug"}
# Note: Don't run demonstrate_identity_pitfalls() directly due to input()
real_world_bug()
Memory Implications
Margaret showed Timothy the memory benefits:
import sys
def demonstrate_memory_savings():
"""Show memory savings from string interning"""
# Without interning: many duplicate strings
words = ["hello"] * 1000000
# Check if they're the same object
all_same = all(word is words[0] for word in words)
print(f"Million copies of 'hello':")
print(f" All point to same object: {all_same}")
print(f" Memory for one string: {sys.getsizeof('hello')} bytes")
print(f" Total memory if separate: {sys.getsizeof('hello') * 1000000:,} bytes")
print(f" Actual memory (interned): ~{sys.getsizeof('hello'):,} bytes")
print(f" Memory saved: ~{(sys.getsizeof('hello') * 1000000) - sys.getsizeof('hello'):,} bytes")
# Runtime-created strings don't get this benefit
runtime_words = ["hel" + "lo" for _ in range(1000)]
all_same_runtime = all(word is runtime_words[0] for word in runtime_words)
print(f"\nThousand runtime-created 'hello' strings:")
print(f" All point to same object: {all_same_runtime}")
demonstrate_memory_savings()
Output:
Million copies of 'hello':
All point to same object: True
Memory for one string: 54 bytes
Total memory if separate: 54,000,000 bytes
Actual memory (interned): ~54 bytes
Memory saved: ~53,999,946 bytes
Thousand runtime-created 'hello' strings:
All point to same object: False
When NOT to Intern
"Should I intern all my strings?" Timothy asked.
def when_not_to_intern():
"""
DON'T INTERN:
1. Large strings (waste of memory in intern table)
2. Rarely-used strings (no performance benefit)
3. Temporary strings (short-lived, not reused)
4. User-generated content (unpredictable, potentially huge)
5. Strings that will be modified (can't modify, so why intern?)
Interning has overhead:
- Hash table lookup cost
- Memory in the intern table
- The string lives forever (can't be garbage collected)
"""
import sys
# ❌ DON'T: Intern large strings
large_text = "x" * 1000000
# sys.intern(large_text) # Bad idea - wastes memory
# ❌ DON'T: Intern unique/temporary strings
for i in range(10000):
# Don't intern - each is unique and temporary
temp = f"temp_string_{i}"
# sys.intern(temp) # Bad idea - no reuse benefit
# ✓ DO: Intern small, frequently-reused strings
status_codes = ["OK", "ERROR", "WARNING", "INFO"]
interned_codes = [sys.intern(code) for code in status_codes]
# ✓ DO: Intern dictionary keys that appear repeatedly
config_keys = ["host", "port", "username", "password"]
interned_keys = [sys.intern(key) for key in config_keys]
when_not_to_intern()
Testing String Interning
Margaret showed Timothy how to test interning behavior:
import sys
import pytest
def test_automatic_interning():
"""Test that identifier-like strings are interned"""
# Identifier-like strings should be interned
s1 = "python"
s2 = "python"
assert s1 is s2 # Same object
# Single characters usually interned
c1 = "a"
c2 = "a"
assert c1 is c2
def test_no_automatic_interning_for_runtime():
"""Test that runtime strings are NOT interned"""
# Runtime concatenation
s1 = "hel" + "lo"
s2 = "hel" + "lo"
assert s1 == s2 # Same value
assert s1 is not s2 # Different objects
def test_manual_interning():
"""Test sys.intern() forces interning"""
# Create non-interned strings
s1 = "hello " + "world"
s2 = "hello " + "world"
assert s1 is not s2
# Manually intern
s1 = sys.intern(s1)
s2 = sys.intern(s2)
assert s1 is s2 # Now same object!
def test_intern_is_idempotent():
"""Test that interning same string multiple times is safe"""
s = "test"
s1 = sys.intern(s)
s2 = sys.intern(s)
s3 = sys.intern(s1)
assert s1 is s2 is s3 # All the same object
def test_intern_with_equality():
"""Test that interned strings still compare correctly"""
s1 = sys.intern("hello")
s2 = sys.intern("hello")
assert s1 == s2 # Value equality
assert s1 is s2 # Identity
# Run with: pytest test_string_interning.py -v
The Library Metaphor
Margaret brought it back to the library:
"Think of string interning like the library's reference section policy," she said.
"For commonly-referenced books - dictionaries, encyclopedias, style guides - we keep permanent copies that everyone shares. When someone needs 'The Python Style Guide,' they don't get their own copy. They get access to the shared reference copy that lives permanently in the reference section.
"Python does the same with identifier-like strings and compile-time constants. Strings like 'name', 'value', 'error' appear constantly in programs, so Python keeps one shared copy rather than creating duplicates.
"But when someone requests an obscure book title - especially one we've never seen before, like a user-generated title from a form submission - we create a temporary copy just for them. If someone else requests that exact obscure title later, we create another separate copy, because the cost of tracking and checking every possible string ever created would be too expensive.
"You can manually 'add to the reference section' using sys.intern(), telling Python 'this string will be used repeatedly, please keep one shared copy.' But use this wisely - you don't want to fill your reference section with strings nobody will ever look up again!"
Common Misconceptions
Timothy compiled a list:
"""
STRING INTERNING MYTHS vs REALITY:
MYTH: "All identical strings are the same object"
REALITY: Only identifier-like strings and compile-time constants
MYTH: "String interning works like integer caching"
REALITY: Much more selective - depends on context and content
MYTH: "I should use 'is' to compare interned strings"
REALITY: Always use '==' for value comparison
MYTH: "sys.intern() makes string comparison faster"
REALITY: Only if you're doing MANY comparisons of the SAME strings
MYTH: "All strings under 20 characters are interned"
REALITY: Length doesn't determine interning - content and context do
MYTH: "Interning is free and always beneficial"
REALITY: Has overhead - only beneficial for frequently-reused strings
MYTH: "Python 2 and Python 3 intern the same way"
REALITY: Python 3 is more conservative about automatic interning
"""
Performance Optimization Pattern
Margaret showed a real optimization:
import sys
import time
class OptimizedRouter:
"""Web router using string interning for fast path matching"""
def __init__(self, routes):
# Intern all route patterns at initialization
self.routes = {
sys.intern(path): handler
for path, handler in routes.items()
}
print(f"Initialized with {len(self.routes)} interned routes")
def match(self, path):
"""Match incoming path to handler"""
# Intern the incoming path
interned_path = sys.intern(path)
# Dictionary lookup with interned key is fast
# (identity check first, then hash comparison if needed)
return self.routes.get(interned_path, "not_found")
def benchmark_routing():
"""Benchmark interned vs non-interned routing"""
# Create routes
routes = {f"/api/endpoint{i}": f"handler{i}" for i in range(100)}
# Test with non-interned router
simple_router = dict(routes)
start = time.perf_counter()
for _ in range(100000):
path = "/api/endpoint50"
handler = simple_router.get(path)
time_simple = time.perf_counter() - start
# Test with interned router
optimized_router = OptimizedRouter(routes)
start = time.perf_counter()
for _ in range(100000):
path = "/api/endpoint50"
handler = optimized_router.match(path)
time_optimized = time.perf_counter() - start
print(f"\n100,000 route lookups:")
print(f" Without interning: {time_simple:.4f} seconds")
print(f" With interning: {time_optimized:.4f} seconds")
print(f" Speedup: {time_simple / time_optimized:.2f}x")
benchmark_routing()
Key Takeaways
Margaret summarized the lesson:
"""
STRING INTERNING KEY TAKEAWAYS:
1. String interning is selective, not universal
- Identifier-like strings (alphanumeric + underscore)
- Compile-time string literals
- NOT runtime-created strings
2. Use sys.intern() for manual interning
- When you have frequently-reused strings
- Dictionary keys used repeatedly
- String comparisons in hot paths
- But only for small, frequently-used strings
3. Identity vs Equality (again!)
- ALWAYS use '==' for string value comparison
- NEVER use 'is' for string comparison
- Interning is an optimization, not a guarantee
4. Performance benefits:
- Identity checks (O(1)) vs value comparison (O(n))
- Faster dictionary lookups with interned keys
- Memory savings when many references to same string
5. When to intern manually:
- Web routing with repeated paths
- JSON parsing with repeated keys
- Compiler symbol tables
- Configuration keys used throughout app
6. When NOT to intern:
- Large strings (memory waste)
- Unique/temporary strings (no reuse)
- User input (unpredictable)
- Strings that live forever (can't be GC'd once interned)
7. Interning has overhead:
- Hash table lookup
- Memory in intern table
- Object lives forever in memory
- Only beneficial when strings are reused many times
8. Real-world impact:
- Modest performance gains (2-3x for hot paths)
- Significant memory savings for repeated strings
- Critical for performance in parsers, compilers, web frameworks
"""
Timothy nodded, understanding. "So string interning is like integer caching, but smarter and more selective. Python interns what's likely to be reused - identifiers and compile-time strings - but leaves runtime strings alone unless I explicitly intern them with sys.intern()."
"Perfect," Margaret said. "And the golden rule remains: use == for value comparison, never is. Interning is an optimization detail, not something to rely on in your comparison logic."
With that, Timothy understood how Python optimizes string memory and when to take control with manual interning for performance-critical code.
Aaron Rose is a software engineer and technology writer at tech-reader.blog and the author of Think Like a Genius.
.jpeg)

Comments
Post a Comment