Decentralized Machine Learning
In the era of machine learning, characterised by powerful hardware, accurate models, and larger quantities of data, both organisations and individuals feel the urge to treat their data like they have never done before.
With a few exceptions, data analysis facilitates the extraction of value from what is otherwise little more than just raw data. For businesses, the added value lies in the way in which it facilitates the data-to-knowledge conversion.
Data driven decisions are essential to any modern organisation characterised by complex business dynamics in which human intervention is: usually less effective; often prone to error; and generally speaking much slower than the pace of the data collection itself.
The sheer value of data is finally being acknowledged by those who produce them daily with their devices and connections.
IT giants like Google, Facebook and Twitter were among the few corporations that appreciated the value of data years ago, and have built empires at times markedly more influential than governments and public institutions. It should come as little surprise that the figure of the data scientist has acquired a far more central role within data-driven organisations.
Democratizing Machine Learning
In recent years, the advent of data science and data driven decisions have been accompanied by a number of side effects.
On the one hand, it has segregated communities of data scientists around rich datasets, and on the other hand it has led to the centralization of existing hubs of data.
In the-rich-get-richer model, companies at the center of the data collection process have become more and more accurate in training their machine learning models, practically controlling the analytics market and taking advantage of all the benefits of machine learning. It is broadly accepted that machine learning models can only be trained on data that can be accessed by the data scientist. Unfortunately, data scientists rarely had access to the most valuable data in the market, if not constrained by Non-Disclosure Agreements and other forms of contracts.
In the attempt to democratize machine learning, data scientists should have the possibility to train their models on data they do not necessarily own, nor see.
Reputation and Reproducibility
In machine learning, model accuracy and reliability depend on the data such models have been trained with. The principle garbage in, garbage out (GIGO), popular in computer science, also applies to analysis, logic and of course machine learning.
As a matter of fact, simple models trained with high quality datasets can produce more accurate results than fancy models and low quality data.
As a consequence, the reputation of a model is strictly connected to the training data rather than the presence or absence of a GPU during the training process.
With more emphasis to research institutions, reproducibility is one of the most requested features from a data science pipeline. From academic papers to industrial technical reports, it is important to reproduce the claims about a machine learning model before it is adopted in production environments. Training models with private data clearly adds more complexity and uncertainty to reproducible research. Any claim about a specific data science pipeline executed in a sealed and non reachable environment would be impossible to verify.
A model that can be verified and uniquely identified can also be tracked across its entire life cycle, from its random initialization to setting the optimal values of its parameters. This is essential to measure how valuable the model is (tracking the resources exposed by the data owners) and how reliable it is expected to be on the testing datasets, tracking the validation process of verifiers).
Keeping track of these two metrics allows one to build a reputation score for every machine learning model. To be more specific, the structure of a neural network together with its parameters and reputation score allow one to set a monetary value to such a model that can, in fact, be traded as any other asset, generating revenue by performing predictions, much like a data scientist.
The marketplace
Machine learning performed on private data on a global scale requires a marketplace in order to exchange data and models. In such a data-model marketplace, data owners and data scientists collaborate in training their machine learning models.
Such a cooperation allows them to take advantage of all the benefits of data science, namely data confidentiality and model reputation.
The flexibility of a marketplace platform allows all parties to agree (or disagree) about sharing the parameters of the trained models and/or splitting the revenues generated by such models, whenever other actors purchase them.
Data owners, data scientists and regular users are all entities of the same marketplace, a requirement that allows them to perform machine learning in a secure and confidential way.
Introducing Fitchain
At fitchain we have developed a technology that allows data scientists to operate on private data. Moreover, the fitchain platform allows data owners to keep their data private, securely stored to the cloud or any storage infrastructure they control.
The platform performs two essential tasks:
- During training, fitchain builds the reputation score of the machine learning model that can be traded afterwards. Such score is calculated from a combination of metrics stored to the Ethereum blockchain via transactions and code executed by smart contracts. We refer to such a combination of metrics to as the proof-of-train
- After the model is trained, fitchain allows data scientists and data owners to share the parameters of the machine learning model. This will allow them to split the revenue that is eventually generated by the new model-entity, much like in the Decentralized Autonomous Organization (DAO) approach.
We are developing the decentralized machine learning marketplace.
Join us at fitchain.io