How to develop libraries and program with human API
In a prompt one writes commands, they executes and they are forgotten. This is the most common case for bash, but can be done in python as well once one is sufficiently well versed in it using smart shells like IPython
A script can be seen as a way of repeating a series of commands. It has a very well defined scope and does not need configurations, option managements, and so on. They are mostly run just once to perform a specific job and that's it
(continue...)
A library is an organized collection of functions, routines and objects that are designed to be used by someone else than the original writer. This is where good programming practice start to matter seriously. Library need an API (Application Programming Interface) that defines what and how is possible to do with the library.
Library design is one of the main reasons for object oriented programming: it allows to design consistent and well structured interfaces that helps the user obtaining what they wants and guide them down the right path avoiding bad practices
(continue...)
A program is a higher level interaction: the user does not need to write code, aside some configuration files in case. It performs a specific goal, and might be run very often or even continuously by the user.
they are often classified based on their interface (even if this is often just different possibilities of interacting with the same underlying commands):
there are more exotic ones, such are Voice User Interface (Siri), Tangible User Interface (joysticks and buttons) and so on, but they go out of the scope of this discussion.
(continue...)
a framework is basically a program (or a set of libraries that can be compiled as such) that can be configured by writing code, typically in the form of classes.
Basically they are the opposite of a library. Your program call the library, while a framework calls the code you wrote.
A very well known framework in python is Django, a web content management systems, where the user need to write the code for database connection and display, and runs calling the user code.
In python you've been using objects without knowing what they are already.
Literally everything is an object in python.
And I mean LITERALLY
Today we will talk about Object Oriented Programming (OOP), with particular focus on Python's approach to it.
The advantage of OOP is that it allows to write beautiful and robust libraries and programs.
If done well.
It requires a specific way of thinking about your program, and it need some time to be internalized.
OOP is a great programming paradigm to implement libraries that are easy to explore and to program with. This (when done properly) allow to write simpler and better programs.
It allow us to write nice, readable code.
This is true both for creating a new library, or to do wrapping: taking another library and create a nice interface to it
Starting from a very abstract perspective, an object is a black box with whom we can communicate, giving commands to perform and asking questions (think of the routines and functions we talked previously).
How does the object work internally should not interest us (in an ideal world), we just care about how to communicate with it.
Take for example the python's list
.
We know it stores things inside it, and we can ask to retrieve them, add and remove them, put them in order and so forth.
How does a list work under the hood? I don't have to care
We might choose to use a list or an array based on their properties, but I don't (almost) ever need to know how they function to use them.
A good object should be self sufficient (have high cohesion) and not depend on specific other object (have low coupling)
In OOP one of the most common concepts is the idea of a specific object being referred to as an instance of a class, which implements several interfaces.
Rather than trying to describe them in abstract terms, I'll give you an example.
Consider the ideas of:
producer | vehicle | |
---|---|---|
animal | cow | horse |
mechanical | coffee machine | car |
indexable | non-indexable | |
---|---|---|
mutable | list | set |
non-mutable | string | int |
[ ]
syntaxmethods, we communicate with methods.
Methods have the same logic as normal functions, and they can behave like routines (change the state of the object and don't return anything) of functions (return something without changing the state of the object).
the syntax of method calling is:
<instance>.<method>(<parameters>)
for example, for lists:
a = [1, 2, 3]
a.append(4) # add a new element to the list
a.index(3) # return the index of the element, without changing the object
As users we are interested in the object only for its behavior, so ideally we only approch it by its methods, but to do something useful they have to have a way to store some internal state.
This internal state is represented by the object's attributes.
Attributes management is one point where python tends to differ from most other programming languages (that I know of), but we'll talk about that later
When we pass objects around, what we care about is some idea of how we can interact with it.
the function:
def add(a, b):
return a+b
assumes that a and b know how to be added together, but does not make assumptions on what they are.
It can work with integers, floating points, strings, numpy array, etc..
As long as they implement the +
interface, they are fine to work with.
def head(sequence):
return sequence[0]
In this case we only care that our object can be indexed with numbers: lists, dictionaries, numpy arrays, strings, they are all fine.
the idea is that when I define a function, I'm defining the interface I expect from my objects.
In some languages I do that explicitely (C++, Java, etc..), in others I do it implicitely (Python, R, etc...), but the idea is the same. O ften this is done by fixing the class that the function accept, but this is generally incorrect, we should be fixing the interface we want our objects to provide.
We should try to be as general as possible in the definition of the interface we require: if we only need to iterate over an object, we should just state that it should be an Iterable
, not a list
, we would be limiting ourselves for no reason.
The principle of substitusion states that once I've defined the interface, any object that follows that interface should be accepted (the original one is referred to inheritance, but don't worry about that for now).
rule number 1:
don't start writing the code to implement the features, write the code you want to be able to write!
Let's say that we want to implement a class to represent and manage some kind of object.
to do this, the best starting point is to try and write the code that we would like to have with that class, and then proceed to implement it.
This has two advantages:
When designing classes, think of the unix principles:
previous years' lectures used seaborn distplot
for this demonstration, but the function has been deprecated and not suitable for use anymore, don't use that code if you find it!
For our simulations we might want to have a timer/logger of events, but without the overhead of printing to screen during execution to avoid slowind down the computation!
>>> timer = Timer()
>>> timer.tick("event")
>>> timer.log()
'event' at time xx:xx
import time
class Timer:
def __init__(self):
self._ticks: dict[str, int] = {}
def tick(self, event_name: str):
self._ticks[event_name] = time.time()
def log(self):
for name, epoch in self._ticks.items():
print(repr(name), "happened at", epoch)
timer = Timer()
timer.tick("hello, world!")
timer.log()
'hello, world!' happened at 1615649618.2291648
how do we deal with duplicated events?
>>> timer = Timer()
>>> timer.tick("event")
>>> timer.tick("event")
>>> timer.log()
'event' at time xx:xx and time xx:xx
import time
from collections import defaultdict
class Timer:
def __init__(self):
self._ticks: dict[str, list[int]] = defaultdict(list)
def tick(self, event_name: str):
self._ticks[event_name].append(time.time())
def log(self):
for name, epochs in self._ticks.items():
joined_str = " and ".join(str(e) for e in epochs)
print(repr(name), "happened at", joined_str)
timer = Timer()
timer.tick("hello, world!")
timer.tick("hello, world!")
timer.log()
'hello, world!' happened at 1615649833.8234465 and 1615649833.8235137
We could decide to filter the events we want to see:
>>> timer = Timer()
>>> timer.tick("fun1/event")
>>> timer.tick("fun2/event")
>>> timer.log("fun1/")
'fun1/event' at time xx:xx and time xx:xx
import time
from collections import defaultdict
class Timer:
def __init__(self):
self._ticks: dict[str, list[int]] = defaultdict(list)
def tick(self, event_name: str):
self._ticks[event_name].append(time.time())
def log(self, filter=""):
for name, epochs in self._ticks.items():
if filter not in name:
continue
joined_str = " and ".join(str(e) for e in epochs)
print(repr(name), "happened at", joined_str)
timer = Timer()
timer.tick("fun1/event")
timer.tick("fun2/event")
timer.log()
print("-"*60)
timer.log("fun1/")
'fun1/event' happened at 1615651954.5938675 'fun2/event' happened at 1615651954.5939393 ------------------------------------------------------------ 'fun1/event' happened at 1615651954.5938675
scikit learn is the main machine learning library in python. For our current discussion, we can say that sklearn is based on two pillars:
Sklearn employs a very simple interface api for the first group, and leverage the second group to hierarchically create the whole analysis pipeline
This allowed anyone to write classes compatible with all the other sklearn compatible classes, driving an explosion of methods and libraries that all work together nicely.
These good API choices basically singlehandedly put python in the machine learning community attention
The basic idea of Classifiers
, Regressors
and Transformers
, is that all of them implement a compatible interface:
fit
and a method transform
fit
and predict
the idea is that with the fit
function this object "learns" from the data, then they apply to (ideally) new data to predict the expected value or to transform them.
The real magic bits are pipelines and unions.
the fact that both unions and pipelines expose the same interface as the objects that they are wrapping means that they can be used inside other unions and pipelines, creating a full data-flow structure that behaves correctly (for example about data leaking).
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import LinearRegression
import numpy as np
import numpy.random as rn
rn.seed(42) # fix the random seed to have replicability
data = rn.randn(200, 2) # 200 bidimensional data points
y = (
+ 1.0 * data[:, 0] # first component, weight 1
+ 0.5 * data[:, 1] # second component, weight 0.5
+ rn.randn(len(data)) # noise component
)
# this is the data that we want to do the prediction about
new_data = rn.randn(10, 2)
sc = StandardScaler()
sc.fit(data)# estimate the parameters for the standardization
print(sc.transform(data)[:5, :]) # apply the standardization to the OLD data
# apply the standardization to the NEW data with the estimates of the old
print(sc.transform(new_data)[:5, :])
[[ 0.51326213 -0.18391283] [ 0.67128389 1.542076 ] [-0.25172177 -0.28351899] [ 1.64629106 0.75705623] [-0.4980274 0.5234244 ]] [[ 0.78568602 -0.99834079] [ 0.90356022 1.36816536] [ 0.42609546 1.90961842] [-0.81654741 -1.33338866] [-1.86838793 1.51403957]]
lr = LinearRegression()
lr.fit(data, y) # fit the regressor on the OLD data
lr.predict(new_data)[:5] # use the learned parameter to predict the NEW ones
array([ 0.21489663, 1.58054749, 1.37919105, -1.59348275, -1.16617208])
lr.coef_
array([1.06614468, 0.54683588])
pipe = make_pipeline(
StandardScaler(), # scale the data
LinearRegression(), # perform the linear regression
)
pipe.fit(data, y) # fit the linear regressor
pipe.predict(new_data) # predict for the new values
array([ 0.21489663, 1.58054749, 1.37919105, -1.59348275, -1.16617208, 0.57936311, -0.4048604 , 2.59029857, 0.42561625, 0.54736359])
Can we make a class that can blend in with the native sklearn ones?
Quite easily! (at least a basic version, we'll see a more advanced one later)
we just need to implement the same interface of fit and predict (for a predictor, transform otherwise)
class AvgPredictor:
def fit(self, X, y):
self.avg = np.mean(y)
return self # flow interface, debatable
def predict(self, X):
return np.ones(len(X))*self.avg
pipe = make_pipeline(
StandardScaler(),
AvgPredictor(),
)
pipe.fit(data, y)
pipe.predict(new_data)
array([-0.05993531, -0.05993531, -0.05993531, -0.05993531, -0.05993531, -0.05993531, -0.05993531, -0.05993531, -0.05993531, -0.05993531])
To fully understand how to implement good quality OOP in python, we have to discuss what an object is and how it does work internally.
This is contrary to how we usually approach topics, but the fact is that in python one does not really needs to write objects if they are not trying to write a nice library, and to use them effectively to do so one needs to have a basic understaing on what is possible with them.
Objects are usually used to represent two things:
Where the important information is given by the type of the object (for example, exceptions)
the object represent an interface for the user to interact with something, and it doesn't really matter what the object is, but only only how the object respond to messages
In Python, the concept of OOP follows some basic principles:
OOP in python is quite different from OOP in languages like C++ or Java.
On the surface they look similar, and all the traditional OOP patterns and construct can be applied in Python.
Once you get confident with the python approach to OOP it will become clear that they are radically different about what is the relationship between classes and instances, and how that affect what is the best approach to solve problems.
To make it more explicit:
For this lesson we will use a library called pdir, that replace the traditional dir
function with one that allows more control on what gets shown.
This will allow us to explore in more detail what is going on under the hood of an object. This library is not necessary in any way to develop objects, we'll just use it to explore them.
conda install pdir2
or
pip install -U pdir2
import pdir as dir
For space constraint, I will skip writing docstring for all the classes, methods and functions that we will discuss in this lesson.
This is of course completely against good practices, but it was necessary to keep the discussion short.
Always write docstrings in real code!
When we want to do object oriented programming, we usually start by creating a new class.
This is done using the class
reserved keyword, in a similar fashion to how we can define a function using def
.
class MyclassName:
<classbody>
Once we have our class, we can instantiate this class in a specific object.
In their simplest interpretation, classes represent the platonic idea of how an object should behave, while instances are the actual objects that we can interact with.
For example, a class could be the concept of Dog
, describing how a dog would move, bark and so on.
Then we can instantiate this class in a specific dog, for example spotty
, by calling the name of the class as if it was a function.
spotty = Dog()
the minimal class is one that does nothing and have nothing to show for it
class Empty:
"""this class does not do anything.
it's just a stub for explainations, and this is its docstring"""
some_object = Empty()
Once we instantiate, we can determine that this object belong to that class
type(some_object)
__main__.Empty
isinstance(some_object, Empty)
True
help(some_object)
Help on Empty in module __main__ object: class Empty(builtins.object) | this class does not do anything. | | it's just a stub for explainations, and this is its docstring | | Data descriptors defined here: | | __dict__ | dictionary for instance variables (if defined) | | __weakref__ | list of weak references to the object (if defined)
That was a quite boring class and object, but it's our starting point.
Once we have an instance of an object, we can store and retrieve attributes in that object.
These represents the data that are stored inside the object. Having them inside the object allows us to keep them all together and give them some form of identity.
by default, an object does not have any attributes
dir(some_object).public
or better, there are already a lot, but they are the underlying machinery that make class work, and we will see what (some) of these mean
dir(some_object)
special attribute: __class__, __dict__, __doc__, __module__, __weakref__ abstract class: __subclasshook__ object customization: __format__, __hash__, __init__, __new__, __repr__, __sizeof__, __str__ rich comparison: __eq__, __ge__, __gt__, __le__, __lt__, __ne__ attribute access: __delattr__, __dir__, __getattribute__, __setattr__ class customization: __init_subclass__ pickle: __reduce__, __reduce_ex__
class Empty:
pass
namespace_0 = Empty()
namespace_1 = Empty()
namespace_0.a = 2
namespace_0.a
2
namespace_1.a
--------------------------------------------------------------------------- AttributeError Traceback (most recent call last) <ipython-input-134-e8dada81b736> in <module> ----> 1 namespace_1.a AttributeError: 'Empty' object has no attribute 'a'
dir(namespace_0).public.properties
property: a
the public attributes are stored internally as a dictionary, that we can poke and probe.
it is literally a normal dictionary
namespace_0.__dict__
{'a': 2}
namespace_1.__dict__
{}
attribute management is normally done using the form object.attribute
, but can be also done programmatically using 4 different functions.
They all treat the attribute name as a string
getattr
: retrieve the chosen attribute (given as a string)setattr
: set the given attributedelattr
: delete an attributehasattr
: check if the object has the given attribute (employ getattr
)hasattr(namespace_0, 'a')
True
hasattr(namespace_1, 'a')
False
setattr(namespace_1, 'a', 4)
getattr(namespace_1, 'a')
4
getattr
also support a default
parameter, that is returned if the object does not have the given attribute (instead of raising an AttributeError
)
getattr(namespace_1, 'b', 'default_value')
'default_value'
__class__
attribute¶one of the special attributes of an instance is the __class__
attribute: this stores a reference to the class that the instance belongs to
namespace_0.__class__
__main__.Empty
At this point the class has been pretty much useless, all the work has been done by the instances
The advantage of a class is the ability to store shared attributes across all instances.
This ensure that all the instances have access to the same information, and this basically is what establish their behavior
class Something:
b = 3
namespace_0 = Something()
namespace_1 = Something()
if we check the object __dict__
, it is empty, but we can access the attribute anyway!
namespace_0.__dict__
{}
namespace_0.b
3
How does the attribute b
gets shown to the instance?
This is done using the so-called Method Resolution Order (MRO).
When we ask for an attribute to an instance the following happen
AttributeError
Something.mro()
[__main__.Something, object]
This means that we can set an attribute on an instance, and this will shadow the same attribute derived from the class.
We will see this behavior again when we will discuss class inheritance.
dir(Something).public
property: b
namespace_0.b = 5
namespace_0.b
5
namespace_1.b
3
del namespace_0.b
namespace_0.b
3
A useful information that you can store in your class is it's version.
A canonical way to represent changes in your code is to use semantic versioning.
One can use the same process to check for the class version of an item. For libraries that create and destroy objects inside a single process it might not be relevant, but if you want to store objects long term, keeping track of the version of the class might be useful!
the common use is to use a __version__
attribute of the class, and fill it with a tuple with the (major, minor, fix)
values, where all of those numbers are stored as integers
class MyClass:
__version__ = (0, 1, 0)
Methods are functions that are defined inside the class body.
They are (almost) the same as any other attribute.
the common trait of all the methods is that the first argument, usually called self
, is used to refer to the instance calling the method, allowing to access its attribute from inside the function.
class Something:
a = 2
def multiply(self, b=2):
return self.a*b
namespace_0 = Something()
namespace_0.multiply
<bound method Something.multiply of <__main__.Something object at 0x7fbf46e70ba8>>
note that I don't need to specify the instance as the parameter self
.
This is because there is a bit of magic when method are recalled as attributes: they are bound to the instance.
This means that we do not need to specify the instance, but it will get automatically provided to the function
namespace_0.multiply(b=2)
4
If we want we can access the unbound method directly to the class: in this case we have to explicitely provide the instance on which the function is supposed to apply.
Something.multiply(namespace_0, b=3)
6
Note: we are not bound to apply this function to an instance of the same class, but only to instances that behave similarly
class Empty:
a = 3
weird = Empty()
Something.multiply(weird, b=3)
9
Note that, due to the substitution principle, calling obj.method(*args)
and Class.method(obj, *args)
is not the same!
In a function that takes a certain kind Class, we might receive a subclass that implements a variant of that methods, and if we use Class.method(obj)
we are forcing it to use the other class' one.
if we want to retain the "functional" look while ensuring to call the right method, we can use the methodcaller
method of the operator
module
from operator import methodcaller
obj_set = set([1, 2])
updater = methodcaller("update", [3])
updater(obj_set)
print(obj_set)
{1, 2, 3}
obj_dict = dict(a=1, b=2)
updater = methodcaller("update", [('c', 3)])
updater(obj_dict)
print(obj_dict)
{'a': 1, 'b': 2, 'c': 3}
If you don't like the two steps of creating it and executing it, you can define an apply function:
# wrapping
def apply(fname, obj, *args, **kwargs):
return methodcaller(fname, *args, **kwargs)(obj)
obj_dict = dict(a=1, b=2)
apply("update", obj_dict, [('c', 3)])
print(obj_dict)
{'a': 1, 'b': 2, 'c': 3}
An even more advanced option is the following... not it looks weird, but by the end of the day you will be able to understand what is going on, don't worry!
from functools import partial
class DeferredMethodExecution:
def __getattr__(self, fname):
return partial(apply, fname)
do = DeferredMethodExecution()
# actual use
obj_dict = dict(a=1, b=2)
do.update(obj_dict, [('c', 3)])
print(obj_dict)
{'a': 1, 'b': 2, 'c': 3}
Functions can be added as methods to a class in any moment.
This is often referred as monkey patching, and is a powerful yet dangerous technique.
One can do it for their own script (maybe to allow a library to act on a class it is not designed for), and that is perfectly ok as long as you know what you're doing.
There are libraries that works by patching other libraries to extend their capabilities: there are several that do this on pandas object, to extend them with new functionalities.
Doing that in libraries that will be used by other libraries is not a good practice, as it can lead to undescribed behaviors by third party users.
def triple_function(self):
return self.a*3
Something.triple = triple_function
namespace_0 = Something()
namespace_0.triple()
6
One could technically patch a single instance with a method, but it's not usually a good useful practice.
to do this we need to explicitly bound the instance to the method
from types import MethodType
def external_function(self):
return self.a*2
namespace = Something()
namespace.func = MethodType(external_function, namespace)
namespace.func()
4
The next step that we might take is to allow the class to "personalize" the instantiation process.
this is done using the magic method __init__
.
__init__
gets automatically called after the instance is created, and is tipically used to assign values to some instance attributes based on some arguments given to the function creating it.
aside of this automatic call, it is a completely normal method.
class Something:
def __init__(self, a):
self.a = a
namespace = Something(a=3)
namespace.a
3
In the recent versions of python (3.7 natively, 3.6 with backporting), there is a simplified version to define instantiation methods, using the dataclass decorator.
It takes variable annotations and automatically generates the corresponding __init__
method, dramatically reducing boilerplat code
this is a simplified version of a more general library, called attrs.
from dataclasses import dataclass
@dataclass
class InventoryItem:
name: str
unit_price: float
quantity_on_hand: int = 0
def total_cost(self) -> float:
return self.unit_price * self.quantity_on_hand
item = InventoryItem('hammers', 10.0, 12)
print(item.total_cost())
120.0
That code will generate the following constructor under the hood:
def __init__(self, name: str, unit_price: float, quantity_on_hand: int=0):
self.name = name
self.unit_price = unit_price
self.quantity_on_hand = quantity_on_hand
we would have repeated each variable name 3 times without dataclasses!
that's a good reduction in boilerplate!
python relies heavily on magical operators, methods that get called when we try to perform some operation on an object.
All these functions are characterized by the presence of the dunder names: __magic_func__
the __init__
method is one of these magic methods.
Any kind of matematical operator is linked to a special method, and so are several special functions such as len
these methods are the python equivalent of operator overloading and are what allows more advanced forms of polymorphism.
for example, the addition operator +
is linked to the magic method __add__
, so if we want our class to be able to perform addition, they need to implement it.
my_object + something
becomes:
my_object.__add__(something)
@dataclass
class Something:
a: int
def __add__(self, other):
return self.__class__(self.a+other)
namespace_0 = Something(a=3)
print(namespace_0.a)
namespace_1 = namespace_0 + 2
print(namespace_1.a)
3 5
Note that if we write in the other order, and the other object does not know how to handle the sum, it will crash with a TypeError
namespace_2 = 2 + namespace_0
--------------------------------------------------------------------------- TypeError Traceback (most recent call last) <ipython-input-193-6ca245b48fe2> in <module> ----> 1 namespace_2 = 2 + namespace_0 TypeError: unsupported operand type(s) for +: 'int' and 'Something'
The solution is to add a second magic method, __radd__
, that is called if the first term does not know how to handle the given class.
to signal that it does not know how to handle the sum, the first addendum should return NotImplemented
(a singleton) instead of an actual result
if the adding operation is symmetrical for the object (it doesn't have to be!) one can avoid reimplementing the __radd__
by simply saying:
__radd__ = __add__
@dataclass
class Something:
a: int
def __add__(self, other):
if not isinstance(other, int):
return NotImplemented
return self.__class__(self.a+other)
__radd__ = __add__
namespace_0 = Something(a=3)
namespace_2 = 2 + namespace_0
namespace_2.a
5
namespace_2 + "a"
--------------------------------------------------------------------------- TypeError Traceback (most recent call last) <ipython-input-240-52e8ac33a07f> in <module> ----> 1 namespace_2 + "a" TypeError: unsupported operand type(s) for +: 'Something' and 'str'
when we try to obtain something out of an we (almost) always use the dot operator: <class>.<attribute>
this is a request to the object to try and give us the object with the name we asked for.
There is no actual need for that attribute to actually exist, we could even obtain them on the fly.
Even when we try to obtain a method, we are asking for a function, and then we call it.
This is what makes possible to create very clever API with python, but also why it's so slow.
When we execute code such as:
a = [1, 2]
a.append(3)
python has to do the following:
a
if it does have the append
attribute (no)a
(the list
class) if it does have itonce the attribute have been obtained, we try to perform the call of the function.
Calling a function is exactly the same as calling the __call__
method.
So we have to start all over: does append
have a __call__
method?
This is true for all operators as well, for example addition (+
operator).
when we run code such as:
a = np.array([1, 2])
b = 1 + a
python has to go and:
1
if it does have a __add__
method (and its class and so on, you know the drill)NotImplementedError
a
and ask if it does have __radd__
(the right side add) and its class and so onI can leverage this property using the descriptors, or the syntactic sugar @property
decorator.
A descriptor hides behing a normal attribute access the following operations:
class Something:
@property
def myattr(self):
return 1
namespace = Something()
namespace.myattr
1
The basic definition of a property does not include informations on how to set the attribute, so if ones tries to modify it, it crashes
namespace.myattr = 2
--------------------------------------------------------------------------- AttributeError Traceback (most recent call last) <ipython-input-206-41b7a728e224> in <module> ----> 1 namespace.myattr = 2 AttributeError: can't set attribute
To allow this behavior I have to explicitely define the setting function using the decorator @<propertyname>.setter
on a function with the same name as the initial property
this allows us to also do attribute checking and modification on setting on the fly
class Something:
@property
def name(self):
return self._name
@name.setter
def name(self, value):
self._name = value.title()
namespace = Something()
namespace.name = "eNRICO"
namespace.name
'Enrico'
dir(namespace).own.public
descriptor: name: @property with getter, setter
In python there is no such thing as a private attribute. The sentence that is commonly used in the community is:
we're all consenting adults
But, on the other end, for some attributes is common to require for the user to not modify the directly (like the _name
attribute in the previous cose).
By the default the attributes and methods staring with a single underscore represents methods that the users of the class should not use. They are not guarantee to be a stable interface and one risks breaking things using them.
if you expose an attribute, expect the users to change it, plan accordingly!
One way of creating a more explicit private attribute (not really, but more explicitely saying "don't use me")
is to use getattr
and settattr
with strings that are invalid identifiers and thus cannot be retrieved by the dot operator.
This is a bit exotic and I don't suggest using it, but if you're curious there is some indication on how to do it in the notebook that underpin these slides
A common mistake is to think that attributes with the name starting with double underscore (but not ending in them) are a way to make attributes private.
names that looks like __my_attribute
, to be clear.
That is a completely different feature, called name mangling, and is used to solve some nasty inheritance problems, not as private attributes.
One might actually make a mess in the inheritance chain by using them without understanding them, so refrain unless you know what you're doing!
Properties can be used also to control the behavior of the attribute, such as the types and values that it can accept
class Something:
@property
def name(self):
if not hasattr(self, "_name"):
raise ValueError("the `name` attribute has not been set")
return self._name
@name.setter
def name(self, value):
if not isinstance(value, str):
s = ("name need to be a string, "
"a value of <{}> was provided")
raise TypeError(s.format(repr(value)))
self._name = value.title()
namespace = Something()
namespace.name
--------------------------------------------------------------------------- ValueError Traceback (most recent call last) <ipython-input-213-36921b093d67> in <module> 1 namespace = Something() ----> 2 namespace.name <ipython-input-212-e3b8550f5f83> in name(self) 3 def name(self): 4 if not hasattr(self, "_name"): ----> 5 raise ValueError("the `name` attribute has not been set") 6 return self._name 7 ValueError: the `name` attribute has not been set
namespace.name = [1, 2]
--------------------------------------------------------------------------- TypeError Traceback (most recent call last) <ipython-input-215-2fb97a563220> in <module> ----> 1 namespace.name = [1, 2] <ipython-input-212-e3b8550f5f83> in name(self, value) 9 def name(self, value): 10 if not isinstance(value, str): ---> 11 raise TypeError("name need to be a string, a value of <{}> was provided".format(repr(value))) 12 self._name = value.title() 13 TypeError: name need to be a string, a value of <[1, 2]> was provided
namespace.name = "eNRICO"
namespace.name
'Enrico'
What happends when the dot operator fails to find the requested attribute in the instance and class?
it falls back to a magic method, of course: __getattr__
.
If that is not defined, it gives up and fails, but if we implement it we can return dynamically attributes we never defined in the class or the instance!
class Something:
def __getattr__(self, name):
return "{} has been requested".format(repr(name))
namespace = Something()
namespace.b
"'b' has been requested"
inheritance is a way of constructing classes on the basis of other classes.
This can be used for both goals of object oriented programming:
the last one has been reduced from python 3.4+ with the introduction of the abstract base classes subclass hook.
The syntax to inherit from other classes is the following:
class ClassName(SuperClass_0, SuperClass_1):
<class_body>
This is very common in Exception management: when one is writing their own library, sometimes is more convenient to define specialized exceptions that can be caught in a more focused way.
This can be specialized versions (and should usually be) of the most common one, such as ValueError
, TypeError
, IndexError
, etc...
This allows the user to distinguish between exceptions that are expected (the one raised from the library) and the one that are not (other exceptions that might be caused by some other code).
It might also be a way to limit the things a user has to worry about in error management, by catching and reraising possible exceptions as different ones.
class ZeroError(ValueError): pass
class NegativeError(ValueError): pass
def myfunction(a):
if a<0:
raise NegativeError("the value should not be negative")
if a==0:
raise ZeroError("the value should not be zero")
return a*2
If I use a ValueError I can catch both kind of errors:
try:
myfunction(0)
except ValueError:
pass
try:
myfunction(-1)
except ValueError:
pass
And I can selectively catch only the kind of error I'm interested in
try:
myfunction(0)
except ZeroError:
pass
try:
myfunction(-1)
except ZeroError:
pass
--------------------------------------------------------------------------- NegativeError Traceback (most recent call last) <ipython-input-380-07678d184437> in <module> 1 try: ----> 2 myfunction(-1) 3 except ZeroError: 4 pass <ipython-input-377-674f95bb9925> in myfunction(a) 4 def myfunction(a): 5 if a<0: ----> 6 raise NegativeError("the value should not be negative") 7 if a==0: 8 raise ZeroError("the value should not be zero") NegativeError: the value should not be negative
A similar approach is to employ multiple inheritance to represent an error that can fit in various category.
If you have a dynamical dictionary that generate results on the fly, one error might be at the same time and IndexError
(because it is a nonsensical index) and a ValueError
(because if you look at it as a function, the user just passed a wrong value).
It makes sense for the user to be able to catch them in both ways, so one can use the principle of least surprise and raise an exception that can be caught both ways.
From a formal point of view this is not an ontology anymore, but practicaly beats purity
class ValueOrIndexError(ValueError, IndexError): pass
import string
class MyDict(dict):
def __getitem__(self, char):
if char not in string.ascii_letters:
raise ValueOrIndexError(f"'{char}' is not a valid character!")
else:
return ord(char)
temp = MyDict()
temp['a']
97
temp['2']
--------------------------------------------------------------------------- ValueOrIndexError Traceback (most recent call last) <ipython-input-407-c40d2261fa0f> in <module> ----> 1 temp['2'] <ipython-input-399-1cfb1db28486> in __getitem__(self, char) 5 def __getitem__(self, char): 6 if char not in string.ascii_letters: ----> 7 raise ValueOrIndexError(f"'{char}' is not a valid character!") 8 else: 9 return ord(char) ValueOrIndexError: '2' is not a valid character!
try:
temp['2']
except ValueError:
pass
try:
temp['2']
except IndexError:
pass
Traditional inheritance is mostly about code reuse: we have a base class that performs some basic funcionality, we don't want to rewrite all that code!
a child class can:
an example, taken from Raymond Hettinger's lectures:
@dataclass
class Animal:
name: str
def walk(self):
print(f"{self.name} is walking")
es: dog add bark, snake override walk with a slithering behavior, cat extend walk with tail wiggles
class Dog(Animal):
def bark(self):
print(f"{self.name} is barking!")
Dog adds a completely new method to the superclass Animal
class Snake(Animal):
def walk(self):
print(f"{self.name} is slithering!")
Snake completely override the superclass original method, and the new one have no relationship with the old one aside of the intended use
class Cat(Animal):
def walk(self):
super().walk()
print(f"{self.name} is wagging is tail!")
Cat is extending the old function by adding new behavior on top of it.
the super
function is a special function that allows to reference the method implemented by the superclass instead of the current one, and is commonly used during inheritance, especially while overrriding
A common use of inheritance is to use an class that defines more methods given some basic ones.
for example, the "collections" module provides a "MutableMapping" abstract base class to help create various methods from a base set of methods:
__getitem__
__setitem__
__delitem__
keys
import collections
class Empty(collections.MutableMapping):
def __getitem__(): pass
def __setitem__(): pass
def __delitem__(): pass
def keys(): pass
dir(Empty).public
function: clear: D.clear() -> None. Remove all items from D. get: D.get(k[,d]) -> D[k] if k in D, else d. d defaults to None. items: D.items() -> a set-like object providing a view on D's items keys: D.keys() -> a set-like object providing a view on D's keys pop: D.pop(k[,d]) -> v, remove specified key and return the corresponding value. popitem: D.popitem() -> (k, v), remove and return some (key, value) pair setdefault: D.setdefault(k[,d]) -> D.get(k,d), also set D[k]=d if k not in D update: D.update([E, ]**F) -> None. Update D from mapping/iterable E and F. values: D.values() -> an object providing a view on D's values
It can also replace some of the methods of the parent class, for example a more performing one when some assumptions are valid
in this case when can define a class checking for an interface and makes it so that all the classes following that inteface will be appear as subclasses when checked for subclassing or being an instance of that interface
Is a way of describing interfaces without changing the class definition and force inheritance.
the typing library (since python 3.6.7) provides us with a Protocol
object, that allows us to declar interfaces that a class should respect, without actually having to inherit from it.
This is very useful for both run-time checks, and mypy
typing
from typing import Protocol, runtime_checkable
# this class check if the object has an attribute called handles that is integer
@runtime_checkable
class Portable(Protocol):
handles: int
# this class check if the class provides the method with the given signature
@runtime_checkable
class Ducky(Protocol):
def quack(self) -> None:
pass
class Empty:
pass
obj = Empty()
assert not isinstance(obj, Portable)
obj.handles = 3
assert isinstance(obj, Portable)
class Duck:
def quack(self):
pass
assert issubclass(Duck, Ducky)
duck_instance = Duck()
assert isinstance(duck_instance, Ducky)
this allows me to write code such as:
def talk_with(pet: Ducky):
pet.quack()
and have mypy accept my signature and verify against it. I can also use it for a more sane subclass and instance check if one wants to program defensively instead!
Beware that the runtime check only verifies the presence of the method, not its signature, that is a work for mypy!
%%file temp_mypy.py
from typing import Protocol, runtime_checkable
@runtime_checkable
class TalkativeDuck(Protocol):
def quack(self) -> str:
return "quack"
class MechanicalDuck:
def quack(self) -> int:
return 0
assert issubclass(MechanicalDuck, TalkativeDuck)
def talk(pet: TalkativeDuck):
print(pet.quack())
talk(MechanicalDuck())
Overwriting temp_mypy.py
This code run, even if it is technically incorrect, but mypy warns us that there is an issue with the definition!
!python temp_mypy.py
0
!mypy temp_mypy.py
temp_mypy.py:17: error: Argument 1 to "talk" has incompatible type "MechanicalDuck"; expected "TalkativeDuck" temp_mypy.py:17: note: Following member(s) of "MechanicalDuck" have conflicts: temp_mypy.py:17: note: Expected: temp_mypy.py:17: note: def quack(self) -> str temp_mypy.py:17: note: Got: temp_mypy.py:17: note: def quack(self) -> int Found 1 error in 1 file (checked 1 source file)
Often inheritance is used to specify some very corner case of the class.
In many cases this can be replace with composition.
Composition is a way to replace the framework approach: instead of subclassing a class to specialize its behavior, one can set an attribute as a completely different object to configure its behavior
Let's consider a very simple class that print a message.
This class prints to the terminal, but we might implement a subclass that replace the print with something else
class BasicPrinter:
def printer(self, message):
print(message)
def say_hello(self, name):
s = f"hello {name}!\n"
self.printer(s)
class LoggingPrinter(BasicPrinter):
def printer(self, message):
import logging
logging.warning(message)
BasicPrinter().say_hello("everybody")
LoggingPrinter().say_hello("everybody")
WARNING:root:hello everybody!
hello everybody!
this approach works, and it's quite common in more structured languages, but you can see that it still feels somewhat verbose.
If we have several behaviors, we would need a different subclass for each behaviors.
If we have several kind of behaviors, we might need to write a class for each of the combination in the cartesian product of the options!
composition present a better option, by allowing to simply replace a component to change the behavior, no subclassing needed.
@dataclass
class Printer:
printer: callable = print
def say_hello(self, name):
s = f"hello {name}!\n"
self.printer(s)
myprinter = Printer()
myprinter.say_hello("everybody")
hello everybody!
from io import StringIO
s = StringIO()
myprinter.printer = s.write
myprinter.say_hello("everybody")
myprinter.say_hello("nobody")
print(s.getvalue())
hello everybody! hello nobody!
This is a another very common pattern that can be replaced by composition.
The parent class controls the behavior, the child class implement the actual methods needed for actual acting (the basic idea for the abstract classes).
An animal could delegate how to sense food and how to move from one place to another, but control how the decision to move is made after the food is sensed.
On the other end, it's a very common pattern in Python: all the magic methods are nothing more than framewok logic to perform various operations
Any class that implements these features can inherit from BaseEstimator
and then from sklearn.base.ClassifierMixin
, sklearn.base.RegressorMixin
or sklearn.base.TransformerMixin
to have their interface completed.
due to the requirements for this extension, I strongly suggest to create the classes using the dataclass
decorator.
It is also a good idea to just store the __init__
parameters without modifying them.
If you have to perform some processing and transformations, it's usually better to do them inside the fit
function.
from dataclasses import dataclass
from typing import List
@dataclass
class ColumnSelector(BaseEstimator, TransformerMixin):
columns: List[str]
def fit(self, X, y=None):
return self
def transform(self, X):
return X[self.columns]
cs = ColumnSelector(columns=["a", "e"])
print(cs.get_params())
{'columns': ['a', 'e']}
import pandas as pd
df = pd.DataFrame(rn.randn(5, 5), columns=list("abcde"))
cs.fit_transform(df)
a | e | |
---|---|---|
0 | -1.606346 | -0.300311 |
1 | 0.020452 | -0.678270 |
2 | 1.406233 | 0.581932 |
3 | -2.292684 | 0.420558 |
4 | 0.959172 | 1.054905 |
Polymorphism is the ability of a function, routine or method to behave differently depending on the class of the objects it receive as parameters.
in general OOP each type signature can activate a different function
for example, a function defined as
def times(a, b):
return a*b
could have different implementations specifying the behaviour in different cases:
a
and b
are numbersa
is a string, b
is a numbera
is a list, b
is a numberin python's OOP only the first object type can be used with the standard libraries, called single dispatch
libraries to allow multiple dispatch exists, but i wouldn't recommend them
one can alywas define multiple dispatch in term of various levels of single dispatch
considering how easy it is to use
in python, it is actually rarely a necessity to rely on polymorphism at all
Single dispatch is a way of writing functions that recognize the type of the first object called, allowing for a low-level object oriented code.
This allow to write generic functions that can be easily combined in iteration functions such map, as they can receive iterable containing a mix of different object and use the best implementation based on the specific object received.
This can be managed with a combinations of if-else and isinstance calls, but this approach allows a more readable approach
Let's say that I want to write a function that calculates the mean of an iterable, but uses more efficient functions when available, such as for numpy arrays
from functools import singledispatch
from statistics import mean
import numpy as np
def average(iterable):
if isinstance(iterable, np.ndarray):
print("using the specific (and fast) numpy mean")
return iterable.mean()
else:
print("using the generic (and slow) python mean")
return mean(iterable)
average([1, 2, 3])
using the generic (and slow) python mean
2
data = np.array([1, 2, 3])
average(data)
using the specific (and fast) numpy mean
2.0
Single dispatch methods allow us to avoid using all those checks explicitely, doing it instead under the hood for us.
@singledispatch
def average(iterable):
print("using the generic (and slow) python mean")
return mean(iterable)
@average.register
def _(np_array: np.ndarray):
print("using the specific (and very fast) numpy mean")
return np_array.mean()
average([1, 2, 3])
using the generic (and slow) python mean
2
data = np.array([1, 2, 3])
average(data)
using the specific (and very fast) numpy mean
2.0
It also add the advantage that, if we want to write a specific version of the average
function for a class of our writing, we don't need to tamper with the original one, but can extend it in a (relatively) simple way
we have to register it after the class has been defined, outside the class body, as the class does not exist before the execution of the class body
class AverageAware:
def __init__(self, value):
self.value = value
def _my_avg(self):
return self.value
average.register(AverageAware)(AverageAware._my_avg)
a = AverageAware(5)
average(a)
5
A concept similar to single dispatch is function hooking:
we can implement this quite easily in python using the Protocol definition we discussed in the previous section on inheritance
this is the basic working underneat the len
function:
__len__
function, it defers to itfrom typing import Protocol, runtime_checkable
@singledispatch
def average(iterable):
"when not defined, try to use python"
print("python mean used")
return mean(iterable)
@runtime_checkable
class Provide_mean(Protocol):
"this is the protocol to identify classes that have a `mean` function"
def mean(self):
pass
@average.register
def _(instance: Provide_mean):
"if the class has a `mean` function, calls it"
print("object's own mean function used")
return instance.mean()
print(average([1, 2, 3]))
python mean used 2
data = np.array([1, 2, 3])
average(data)
object's own mean function used
2.0
class MyClass:
def mean(self):
return 0.0
pippo = MyClass()
print(average(pippo))
object's own mean function used 0.0
in all fairness we could have defined the function hook without the need for single dispatch in this simple case
it would still be useful if we need to combine it with standard single dispatch
def average2(iterable):
avg_fun = getattr(iterable, "mean", None)
if callable(avg_fun):
return avg_fun()
return mean(iterable)
print(average2([1, 2, 3]))
print(average2(np.array([4, 5, 6])))
pippo = MyClass()
print(average2(pippo))
2 5.0 0.0
python 3.8 introduced also singledispatchmethod
, that allow to perform single dispatch from methods.
a dedicated function is required to avoid weird interactions with the bounding process of method calling
from functools import singledispatchmethod
from dataclasses import dataclass
from numbers import Number
@dataclass
class Container:
value: Number
@singledispatchmethod
def __add__(self, other):
return NotImplemented
@__add__.register
def _(self, other: Number):
return self.__class__(self.value+other)
cont = Container(3)
print(cont)
print(cont+"1")
Container(value=3)
--------------------------------------------------------------------------- TypeError Traceback (most recent call last) Input In [28], in <cell line: 3>() 1 cont = Container(3) 2 print(cont) ----> 3 print(cont+"1") TypeError: unsupported operand type(s) for +: 'Container' and 'str'
cont = Container(3)
print(cont)
print(cont+1)
Container(value=3) Container(value=4)
Note that by default we cannot reference the class itself in the type annotation, as the class is defined only after the class body is executed
class Broken:
def test(self, a: Broken):
pass
--------------------------------------------------------------------------- NameError Traceback (most recent call last) Cell In[3], line 1 ----> 1 class Broken: 2 def test(self, a: Broken): 3 pass Cell In[3], line 2, in Broken() 1 class Broken: ----> 2 def test(self, a: Broken): 3 pass NameError: name 'Broken' is not defined
a way to circumvent the current problem with forward referencing, is to define the method that refer to the class itself outside of the class (all the other definition can still be inside the class body)
from functools import singledispatchmethod
class SelfReferent:
@singledispatchmethod
def __add__(self, other):
"""generic version of the function"""
return NotImplemented
@SelfReferent.__add__.register
def _(self, other: SelfReferent):
return 2
p = SelfReferent()
q = SelfReferent()
p+q
2
p+1
--------------------------------------------------------------------------- TypeError Traceback (most recent call last) Cell In[8], line 1 ----> 1 p+1 TypeError: unsupported operand type(s) for +: 'SelfReferent' and 'int'
Our goal is to implement a predictor transformer: use the prediction of one or more Regressor/Classifier as the input for another one.
I want to use come like:
predictor_transformer = PT(LinearRegression)
predictor_transformer.fit(X_train, y_train)
predictor_transformer.transform(X_test) # should return a 2d array Nx1 with the predicted y_test
Our goal might be to implement a groupby standardizer.
a standardizer subtract the average of one or more columns, and divide the data by their standard deviation.
The standard standardizer does use the whole data average, but we might want to do this operation separatedly on different groups, for example by country.
Assuming that all the groupby categories are present in the traning data, try to implement it
group_standardizer = GS(groupby_column='country')
group_standardizer.fit(X_train, y_train)
group_standardizer.transform(X_train)
this is not part of the lesson material, but it is interesting to have here as a reference. You can explore these topics in the notebook corresponding to these slides.
topics:
black magic topics