(With thanks to Seth Larson for taking me down this rabbit hole.)
I always assumed that Python's str.splitlines() split strings by "universal newlines", i.e.,
, \r , and \r
.
But it turns out it does a lot more than that. From the docs:
This method splits on the following line boundaries. In particular, the boundaries are a superset of universal newlines. Representation Description
Line Feed \r Carriage Return \r
Carriage Return + Line Feed \v or \x0b Line Tabulation \f or \x0c Form Feed \x1c File Separator \x1d Group Separator \x1e Record Separator \x85 Next Line (C1 Control Code) \u2028 Line Separator \u2029 Paragraph Separator
This results in some surprising (to me) splitting behavior:
> > > s = " line1
line2 \r line3 \r
line4 \v line5 \x1d hello " > > > s . splitlines ( ) [ ' line1 ' , ' line2 ' , ' line3 ' , ' line4 ' , ' line5 ' , ' hello ' ]
Whereas I would have expected:
[ " line1 " , " line2 " , " line3 " , " line4 \v line5 \x1d hello " ]
This was a good periodic reminder that Unicode does not mean "printable," and that there are still plenty of ecosystems that assign semantics to C0 and C1 control codes.