Interpretable Model-Based Hierarchical RL Using Inductive Logic Programming

albertzeyer · on Sept 12, 2021

So, what does interpretable really mean?

When you train a neural network, it is also not really a black box. You can exactly see what it does, and inspect all weights and activations. You have a very clear logical reasoning why you end up with some decision. Only that this is maybe a huge decision rule when written out.

When you would now take such symbolic or logic based approach and apply it on some real-world task, and depending how much freedom you leave to the training, the learned logic rules can look exactly the same as the logic rules when you write down the decision rule of the neural network.

Maybe the logic rules derived this way will look shorter, simpler, more understandable. But this is not really studied here. And this is questionable. And maybe even not so easy to measure anyway.

More interesting is the increased data efficiency by the proposed method. This is something you can measure. And any improvement here is good and speaks for the new method by itself.

But this does not imply better interpretability.

aabaker99 · on Sept 12, 2021

Can you inspect the weights in GPT-3 and tell me what does?

The mathematics of neural networks leads to huge numbers of parameters with small contributions on the overal result. There's tons of work in ML to try and go in the opposite direction, LASSO being probably the most famous.

Yes, you can inspect the weights. But compared to other models, neural networks are still a black box.

Interpretable means a human being able to explain what the model does. LASSO gets you this by narrowing down on a small number of important parameters. Decision trees are interpretable.

Even allowing for some groupings of neural network parameters like convolutional filters or certain network architecture choices that can characterize an entire layer, there's still too many parameters to explain.

teruakohatu · on Sept 12, 2021

> When you train a neural network, it is also not really a black box. You can exactly see what it does, and inspect all weights and activations. You have a very clear logical reasoning why you end up with some decision. Only that this is maybe a huge decision rule when written out.

In some networks eg. CNN you may be able to visualize which of the inputs have the greatest impact on the output(s) but that is a far cry from understanding how the network got there in the first place.

When a network gets something consistently wrong the answer is usually "more/better training data" not a deep understanding of why the network is wrong.

This is in contrast to a simple decision tree, whose decisions can be understood by even a layperson with no statistics background. It is simple to go back and understand why the wrong decision was made.

gavinray · on Sept 11, 2021

I don't work in the field, but I sort of passively follow it.

A year ago I made this comment, in another ML thread:

https://news.ycombinator.com/item?id=23315739

  "I often wonder about whether neural networks might need to meet at a crossroads with other techniques."

  "Inductive Logic/Answer Set Programming or Constraints Programming seems like it could be a good match for this field. Because from my ignorant understanding, you have a more "concrete" representation of a model/problem in the form of symbolic logic or constraints and an entirely abstract "black box" solver with neural networks. I have no real clue, but it seems like they could be synergistic?"

I can't interpret the paper -- is this roughly in this vein?

infogulch · on Sept 11, 2021

I've been thinking along the same lines, it seems like logic + ML would complement each other well. Acquiring trustworthy labeled data is "THE" problem in ML, and figuring out which predicates to string together is "THE" problem in logic programming, seems like a perfect match.

A logic program can produce a practically infinite number of perfectly consistent test cases for the ML model to learn from, and the ML model can predict which problem should be solved. I'd like to see a conversational interface that combines these two systems, ML generates logic statements and observes the results, repeat. That might help to keep it from going off the rails like a long GPT-3 session tends to do.

nextos · on Sept 11, 2021

This is already starting to happen, albeit quite slowly. I think it will gain a lot of momentum and it will lead to very interesting progress in AI.

For example, deep functions + probabilistic models yield things such as deep markov models, which are interpretable and can represent really complex distributions such as music.

Deep functions can also be used during sampling to generate sophisticated proposals in problems where standard algorithms struggle to navigate the posterior.

There are also equivalent ideas being explored in RL, such as the OP.

medo-bear · on Sept 12, 2021

can you name some references please? also, what do you mean by deep functions?

nextos · on Sept 14, 2021

Apologies, for the late reply.

By deep functions I simply meant small deep networks embedded in probabilistic models as a drop-in replacement for probability distributions, like e.g. a neuralHMM: https://pyro.ai/examples/hmm_funsor.html

Aside from looking into Pyro docs and all references, another good entry point is this publication: https://arxiv.org/pdf/1610.05735.pdf. Here the authors show how to use deep networks to aid sampling of complex probabilistic models.

medo-bear · on Sept 18, 2021

Thank you. I didn't notice that you replied. These links are quite helpful. I think that eventually AI will start to incorporate all the new ML stuff into the old symbolic stuff and I think probabilistic models are a good way to link the two

amelius · on Sept 11, 2021

> Acquiring trustworthy labeled data is "THE" problem in ML, and figuring out which predicates to string together is "THE" problem in logic programming, seems like a perfect match.

Can't this be generalized into using the ML to prune a search tree, and using the logic to generate the search tree? And didn't we already successfully try this, see e.g. AlphaGo?

westurner · on Sept 11, 2021

AutoML is RL? The entire exercise of publishing and peer review is an exercise in cybernetics?

https://en.wikipedia.org/wiki/Probabilistic_logic_network :

> The basic goal of PLN is to provide reasonably accurate probabilistic inference in a way that is compatible with both term logic and predicate logic, and scales up to operate in real time on large dynamic knowledge bases.

> The goal underlying the theoretical development of PLN has been the creation of practical software systems carrying out complex, useful inferences based on uncertain knowledge and drawing uncertain conclusions. PLN has been designed to allow basic probabilistic inference to interact with other kinds of inference such as intensional inference, fuzzy inference, and higher-order inference using quantifiers, variables, and combinators, and be a more convenient approach than Bayesian networks (or other conventional approaches) for the purpose of interfacing basic probabilistic inference with these other sorts of inference. In addition, the inference rules are formulated in such a way as to avoid the paradoxes of Dempster–Shafer theory.

Has anybody already taught / reinforced an OpenCog [PLN, MOSES] AtomSpace hypergraph agent to do Linked Data prep and also convex optimization with AutoML and better than grid search so gradients?

Perhaps teaching users to bias analyses with e.g. Yellowbrick and the sklearn APIs would be a good curriculum traversal?

opening/baselines "Logging and vizualizing learning curves and other training metrics" https://github.com/openai/baselines#logging-and-vizualizing-...

https://en.wikipedia.org/wiki/AlphaZero

There's probably an awesome-automl by now? Again, the sklearn interfaces.

TIL that SymPy supports NumPy, PyTorch, and TensorFlow [Quantum; TFQ?]; and with a Computer Algebra System something for mutating the AST may not be necessary for symbolic expression trees without human-readable comments or symbol names? Lean mathlib: https://github.com/leanprover-community/mathlib , and then reasoning about concurrent / distributed systems (with side channels in actual physical component space) with e.g. TLA+.

There are new UUID formats that are timestamp-sortable; for when blockchain cryptographic hashes aren't enough entropy. "New UUID Formats – IETF Draft" https://news.ycombinator.com/item?id=28088213

... You can host online ML algos through SingularityNet, which also does PayPal now for the RL.

westurner · on Sept 12, 2021

Our visual / auditory biological neural networks do appear to be hierarchical and relatively highly plastic as well.

If you're planning to mutate, crossover, and select expression trees, you'll need a survival function (~cost function) in order to reinforce; RL.

Blockchains cost immutable data storage with data integrity protections by the byte.

Smart contracts cost CPU usage with costed opcodes. eWASM (Ethereum WebAssembly) has costed opcodes for redundantly-executed smart contracts (that execute on n nodes of a shard) https://ewasm.readthedocs.io/en/mkdocs/determining_wasm_gas_...

AFAIU, while there are DLTs that cost CPU, RAM, and Data storage between points in spacetime, none yet incentivize energy efficiency by varying costs depending upon whether the instructions execute on a FPGA, ASIC, CPU, GPU, TPU, or QPU?

To be 200% green - to put a 200% green footer with search-discoverable RDFa on your site - I think you need PPAs and all directly sourced clean energy.

(Energy efficiency is very relevant to ML/AI/AGI, because while it may be the case that the dumb universal function approximator will eventually find a better solution, "just leave it on all night/month/K12+postdoc" in parallel is a very expensive proposition with no apparent oracle; and then to ethically filter solutions still costs at least one human)

westurner · on Sept 14, 2021

> Perhaps teaching users to bias analyses with e.g. Yellowbrick and the sklearn APIs would be a good curriculum traversal?

Yellowbrick > Third Party Estimaters: (yellowbrick.contrib.wrapper: https://www.scikit-yb.org/en/latest/api/contrib/wrapper.html

From https://www.scikit-yb.org/en/latest/quickstart.html#using-ye... :

> The Yellowbrick API is specifically designed to play nicely with scikit-learn. The primary interface is therefore a Visualizer – an object that learns from data to produce a visualization. Visualizers are scikit-learn Estimator objects and have a similar interface along with methods for drawing. In order to use visualizers, you simply use the same workflow as with a scikit-learn model, import the visualizer, instantiate it, call the visualizer’s fit() method, then in order to render the visualization, call the visualizer’s show() method.

> For example, there are several visualizers that act as transformers, used to perform feature analysis prior to fitting a model. The following example visualizes a high-dimensional data set with parallel coordinates:

  from yellowbrick.features import ParallelCoordinates
  
  visualizer = ParallelCoordinates()
  visualizer.fit_transform(X, y)
  visualizer.show()

> As you can see, the workflow is very similar to using a scikit-learn transformer, and visualizers are intended to be integrated along with scikit-learn utilities. Arguments that change how the visualization is drawn can be passed into the visualizer upon instantiation, similarly to how hyperparameters are included with scikit-learn models.

IIRC, some automl tools - which test various combinations of, stacks of, ensembles of e.g. Estimators - do test hierarchical ensembles? Are those 'piecewise' and ultimately not the unified theory we were looking for here either (but often a good enough, fast enough, sufficient approximate solution with a sufficiently low error term)?

/? hierarchical automl "sklearn" site:github.com : https://www.google.com/search?q=hierarchical+automl+%22sklea...

YeGoblynQueenne · on Sept 12, 2021

You're not too far off. The big advantage of ILP with respect to statistical techniques is the really extraordinary sample efficiency. Just a couple of examples are enough to learn many programs [1]. Moreover, ILP is pretty much the only machine learning approach that can make use of background knowledge [2] and in fact is characterised by this ability, and it is the ability to use background knowledge that is responsible for the excellent sample complexity, because if you don't have any background knowledge you are forced to learn everything from scratch - as in the "end-to-end" approach that is popular today.

So the work described in the paper basically benefits from ILP's ability to use background knowlede to reduce sample complexity. The terminology they use in the paper to describe this is a little clunky and I was a bit confused by it, so I imagine it must be even more confusing to someone who doesn't even have an ILP background.

To be honest I was most interested in the paper because it is based on ϑILP [3], which I found a bit surprising. For various reasons. ϑILP was described in a paper published by DeepMind a while ago and I had thought the line of work had been abandoned, but now it seems others are building on it, which is interesting to see.

_____________

[1] ILP learns logic programs and since these can be recursive, a very small program may be enough to represent even an infinite concept. As a for instance, think of counting over the natural numbers. Pretty much the only way to do that with a finite program is to use recursion, such as:

  n(0) ∧ n(n(x))← n(x)

Which you can find as "the axiom of induction" in various different notations.

[2] Not just priors as in Bayesian learners, but entire programs from which a target program is built. A typical example: say you have definitions of "father" and "parent" as background knowledge and you want to learn a definition of "grandfather". A natural definition is "grandfather is the father-of-the-parent".

[3] https://www.ijcai.org/Proceedings/2018/0792.pdf

And a longer version on arxiv:

https://arxiv.org/abs/1711.04574v1