Thursday, April 18, 2024

How one can Encode Constraints to the Output of Neural Networks | by Runzhong Wang | Apr, 2024

Must read


A abstract of accessible approaches

Towards Data Science
Picture generated by ChatGPT based mostly on this text’s content material.

Neural networks are certainly highly effective. Nevertheless, as the applying scope of neural networks strikes from “customary” classification and regression duties to extra advanced decision-making and AI for Science, one downside is changing into more and more obvious: the output of neural networks is normally unconstrained, or extra exactly, constrained solely by easy 0–1 bounds (Sigmoid activation operate), non-negative constraints (ReLU activation operate), or constraints that sum to 1 (Softmax activation operate). These “customary” activation layers have been used to deal with classification and regression issues and have witnessed the vigorous improvement of deep studying. Nevertheless, as neural networks began to be broadly used for decision-making, optimization fixing, and different advanced scientific issues, these “customary” activation layers are clearly now not ample. This text will briefly talk about the present methodologies out there that may add constraints to the output of neural networks, with some private insights included. Be happy to critique and talk about any associated matters.

[中文版本(知乎)]

In case you are aware of reinforcement studying, you might already know what I’m speaking about. Making use of constraints to an n-dimensional vector appears tough, however you may break an n-dimensional vector into n outputs. Every time an output is generated, you may manually write the code to limit the motion area for the following variable to make sure its worth stays inside a possible area. This so-called “autoregressive” technique has apparent benefits: it’s easy and may deal with a wealthy number of constraints (so long as you may write the code). Nevertheless, its disadvantages are additionally clear: an n-dimensional vector requires n calls to the community’s ahead computation, which is inefficient; furthermore, this technique normally must be modeled as a Markov Choice Course of (MDP) and skilled by means of reinforcement studying, so widespread challenges in reinforcement studying similar to massive motion areas, sparse reward features, and lengthy coaching occasions are additionally unavoidable.

Within the area of fixing combinatorial optimization issues with neural networks, the autoregressive technique coupled with reinforcement studying was as soon as mainstream, however it’s presently being changed by extra environment friendly strategies.

Throughout coaching, a penalty time period may be added to the target operate, representing the diploma to which the present neural community output violates constraints. Within the conventional optimization subject, the Lagrangian twin technique additionally gives an analogous trick. Sadly, when utilized to neural networks, these strategies have to date solely been confirmed on some easy constraints, and it’s nonetheless unclear whether or not they’re relevant to extra advanced constraints. One shortcoming is that inevitably a few of the mannequin’s capability is used to learn to meet corresponding constraints, thereby limiting the mannequin’s means in different instructions (similar to optimization fixing).

For instance, Karalias and Loukas, NeurIPS’21 “Erdo˝s Goes Neural: an Unsupervised Studying Framework for Combinatorial Optimization on Graphs” demonstrated that the so-called “field constraints”, the place variable values lie between [a, b], may be discovered by means of a penalty time period, and the community can remedy some comparatively easy combinatorial optimization issues. Nevertheless, our additional research discovered that this technique lacks generalization means. Within the coaching set, the neural community can keep constraints nicely; however within the testing set, the constraints are nearly fully misplaced. Furthermore, though including a penalty time period in precept can apply to any constraint, it can not deal with tougher constraints. Our paper Wang et al, ICLR’23 “In direction of One-Shot Neural Combinatorial Optimization Solvers: Theoretical and Empirical Notes on the Cardinality-Constrained Case” discusses the above phenomena and presents the theoretical evaluation.

Alternatively, the design philosophy of generative fashions, the place outputs want to adapt to a particular distribution, appears extra suited to the “studying constraints” method. Solar and Yang, NeurIPS’23 “DIFUSCO: Graph-based Diffusion Solvers for Combinatorial Optimization” confirmed that Diffusion fashions can output options that meet the constraints of the Touring Salesman Drawback (i.e., can output an entire circuit). We additional introduced Li et al, NeurIPS’23 “T2T: From Distribution Studying in Coaching to Gradient Search in Testing for Combinatorial Optimization”, the place the generative mannequin (Diffusion) is chargeable for assembly constraints, with one other optimizer offering optimization steerage throughout the gradual denoising technique of Diffusion. This technique carried out fairly nicely in experiments, surpassing all earlier neural community solvers.

Possibly you’re involved that autoregressive is simply too inefficient, and generative fashions could not remedy your drawback. You is likely to be fascinated by a neural community that does just one ahead move, and the output wants to satisfy the given constraints — is that attainable?

The reply is sure. We are able to remedy a convex optimization drawback to venture the neural community’s output right into a possible area bounded by convex constraints. This system makes use of the property {that a} convex optimization drawback is differentiable at its KKT circumstances in order that this projection step may be thought to be an activation layer, embeddable in an end-to-end neural community. This system was proposed and promoted by Zico Kolter’s group at CMU, and so they presently supply the cvxpylayers package deal to ease the implementation steps. The corresponding convex optimization drawback is

the place y is the unconstrained neural community output, x is the constrained neural community output. As a result of the aim of this step is only a projection, a linear goal operate can obtain this (including an entropy regularizer can also be affordable). Axb are the linear constraints you’ll want to apply, which may also be quadratic or different convex constraints.

It’s a private word: there appear to be some identified points, and plainly this repository has not been up to date/maintained for a very long time (04/2024). I would really recognize it if anybody is prepared to analyze what’s going on.

Deriving gradients utilizing KKT circumstances is theoretically sound, however it can not sort out non-convex or non-continuous issues. In actual fact, for non-continuous issues, when adjustments in drawback parameters trigger resolution jumps, the true gradient turns into a delta operate (i.e., infinite on the bounce), which clearly can’t be utilized in coaching neural networks. Happily, there are some gradient approximation strategies that may sort out this drawback.

The Georg Martius group at Max Planck Institute launched a black-box approximation technique Vlastelica et al, ICLR’2020 “Differentiation of Blackbox Combinatorial Solvers”, which views the solver as a black field. It first calls the solver as soon as, then perturbs the issue parameters in a particular course, after which calls the solver once more. The residual between the outputs of the 2 solver calls serves because the approximate gradient. If this technique is utilized to the output of neural networks to implement constraints, we will outline an optimization drawback with a linear goal operate:

the place y is the unconstrained neural community output, and x is the constrained neural community output. The next step is to implement an algorithm to resolve the above drawback (not essentially to be optimum), after which it may be built-in into the black-box approximation framework. A downside of the black-box approximation technique is that it could actually solely deal with linear goal features, however a linear goal operate simply occurs to work if you’re in search of some strategies to implement constraints; furthermore, since it’s only a gradient approximation technique if the hyperparameters will not be well-tuned, it’d encounter sparse gradients and convergence points.

One other technique for approximating gradients includes utilizing a considerable amount of random noise perturbation, repeatedly calling the solver to estimate a gradient, as mentioned in Berthet et al, NeurIPS’2020 “Studying with Differentiable Perturbed Optimizers”. Theoretically, the gradient obtained this fashion must be much like the gradient obtained by means of the LinSAT technique (which will probably be mentioned within the subsequent part), being the gradient of an entropy-regularized linear goal operate; nevertheless, in observe, this technique requires numerous random samples, which is sort of impractical (at the very least on my use instances).

Whether or not it’s deriving gradients from KKT circumstances for convex issues or approximating gradients for non-convex strategies, each require calling/writing a solver, whereby the CPU-GPU communication may very well be a bottleneck as a result of most solvers are normally designed and applied for CPUs. Is there a strategy to venture particular constraints instantly on the GPU like an activation layer, with out fixing optimization issues explicitly?

The reply is sure, and our Wang et al, ICML’2023 “LinSATNet: The Optimistic Linear Satisfiability Neural Networks” presents a viable path and derives the convergence property of the algorithm. LinSAT stands for Linear SATisfiability Community.

LinSAT may be seen as an activation layer, permitting you to use normal constructive linear constraints to the output of a neural community.

Picture by writer

The LinSAT layer is absolutely differentiable, and the true gradients are computed by autograd, identical to different activation layers. Our implementation now helps PyTorch.

You’ll be able to set up it by

pip set up linsatnet

And get began with

from LinSATNet import linsat_layer

For those who obtain and run the supply code, you’ll discover a easy instance. On this instance, we apply doubly stochastic constraints to a 3×3 matrix.

To run the instance, first clone the repo:

git clone https://github.com/Thinklab-SJTU/LinSATNet.git

Go into the repo, and run the instance code:

cd LinSATNet
python LinSATNet/linsat.py

On this instance, we attempt to implement doubly-stochastic constraints to a 3×3 matrix. The doubly stochastic constraint implies that all rows and columns of the matrix ought to sum to 1.

The 3×3 matrix is flattened right into a vector, and the next constructive linear constraints are thought-about (for Ex=f):

E = torch.tensor(
[[1, 1, 1, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 1, 1, 1, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 1, 1, 1],
[1, 0, 0, 1, 0, 0, 1, 0, 0],
[0, 1, 0, 0, 1, 0, 0, 1, 0],
[0, 0, 1, 0, 0, 1, 0, 0, 1]], dtype=torch.float32
)
f = torch.tensor([1, 1, 1, 1, 1, 1], dtype=torch.float32)

We randomly init w and regard it because the output of some neural networks:

w = torch.rand(9) # w may very well be the output of neural community
w = w.requires_grad_(True)

We even have a “ground-truth goal” for the output of linsat_layer, which is a diagonal matrix on this instance:

x_gt = torch.tensor(
[1, 0, 0,
0, 1, 0,
0, 0, 1], dtype=torch.float32
)

The ahead/backward passes of LinSAT observe the usual PyTorch type and are readily built-in into current deep studying pipelines.

The ahead move:

linsat_outp = linsat_layer(w, E=E, f=f, tau=0.1, max_iter=10, dummy_val=0)

The backward move:

loss = ((linsat_outp — x_gt) ** 2).sum()
loss.backward()

You may as well set E as a sparse matrix to enhance the time & reminiscence effectivity (particularly for large-sized enter). Here’s a dumb instance (think about to assemble E in sparse for one of the best effectivity):

linsat_outp = linsat_layer(w, E=E.to_sparse(), f=f, tau=0.1, max_iter=10, dummy_val=0)

We are able to additionally do gradient-based optimization over w to make the output of linsat_layer nearer to x_gt. That is what occurs whenever you practice a
neural community.

niters = 10
decide = torch.optim.SGD([w], lr=0.1, momentum=0.9)
for i in vary(niters):
x = linsat_layer(w, E=E, f=f, tau=0.1, max_iter=10, dummy_val=0)
cv = torch.matmul(E, x.t()).t() — f.unsqueeze(0)
loss = ((x — x_gt) ** 2).sum()
loss.backward()
decide.step()
decide.zero_grad()
print(f’{i}/{niters}n’
f’ underlying obj={torch.sum(w * x)},n’
f’ loss={loss},n’
f’ sum(constraint violation)={torch.sum(cv[cv > 0])},n’
f’ x={x},n’
f’ constraint violation={cv}’)

And you’re prone to see the loss lowering throughout the coaching steps.

For full API references, please take a look at the GitHub repository.

Warning, tons of math forward! You’ll be able to safely skip this half if you’re simply utilizing LinSAT.

If you wish to study extra particulars and proofs, please consult with the principle paper.

Right here we introduce the mechanism inside LinSAT. It really works by extending the Sinkhorn algorithm to a number of units of marginals (to our greatest data, we’re the primary to check Sinkhorn with multi-sets of marginals). The constructive linear constraints are then enforced by reworking the constraints into marginals.

Basic Sinkhorn with single-set marginals

Let’s begin with the basic Sinkhorn algorithm. Given non-negative rating matrix S with dimension m×n, and a set of marginal distributions on rows (non-negative vector v with dimension m) and columns (non-negative vector u with dimension n), the place

the Sinkhorn algorithm outputs a normalized matrix Γ with dimension m×n and values in [0,1] in order that

Conceptually, Γᵢ ⱼ means the proportion of uⱼ moved to vᵢ.

The algorithm steps are:

Notice that the above formulation is modified from the traditional Sinkhorn formulation. Γᵢ ⱼ uⱼ is equal to the weather within the “transport” matrix in papers similar to (Cuturi 2013). We favor this new formulation because it generalizes easily to Sinkhorn with multi-set marginals within the following.

To make a clearer comparability, the transportation matrix in (Cuturi 2013) is P with dimension m×n, and the constraints are

Pᵢ ⱼ means the actual mass moved from uⱼ to vᵢ.

The algorithm steps are:

Prolonged Sinkhorn with multi-set marginals

We uncover that the Sinkhorn algorithm can generalize to a number of units of marginals. Recall that Γᵢ ⱼ ∈ [0,1] means the proportion of uⱼ moved to vᵢ. Curiously, it yields the identical formulation if we merely change u, v with one other set of marginal distributions, suggesting the potential of extending the Sinkhorn algorithm to a number of units of marginal distributions. Denote that there are okay units of marginal distributions which are collectively enforced to suit extra difficult real-world eventualities. The units of marginal distributions are

and we have now:

It assumes the existence of a normalized Z ∈ [0,1] with dimension m×n, s.t.

i.e., the a number of units of marginal distributions have a non-empty possible area (you might perceive the that means of “non-empty possible area” after studying the following part about the best way to deal with constructive linear constraints). A number of units of marginal distributions may very well be collectively enforced by traversing the Sinkhorn iterations over okay units of marginal distributions. The algorithm steps are:

In our paper, we show that the Sinkhorn algorithm for multi-set marginals shares the identical convergence sample with the basic Sinkhorn, and its underlying formulation can also be much like the basic Sinkhorn.

Reworking constructive linear constraints into marginals

Then we present the best way to rework the constructive linear constraints into marginals, that are dealt with by our proposed multi-set Sinkhorn.

Encoding neural community’s output
For an l-length vector denoted as y (which may be the output of a neural community, additionally it’s the enter to linsat_layer), the next matrix is constructed

the place W is of dimension 2 × (l + 1), and β is the dummy variable, the default is β = 0. y is put on the upper-left area of W. The entropic regularizer is then enforced to manage discreteness and deal with potential destructive inputs:

The rating matrix S is taken because the enter of Sinkhorn for multi-set marginals.

From linear constraints to marginals

1) Packing constraint Axb. Assuming that there’s just one constraint, we rewrite the constraint as

Following the “transportation” view of Sinkhorn, the output x strikes at most b unit of mass from a, a, …, aₗ, and the dummy dimension permits the inequality by shifting mass from the dummy dimension. It is usually ensured that the sum of u equals the sum of v. The marginal distributions are outlined as

2 ) Protecting constraint Cx d. Assuming that there’s just one constraint, we rewrite the constraint as

We introduce the multiplier

as a result of we at all times have

(else the constraint is infeasible), and we can not attain a possible resolution the place all components in x are 1s with out this multiplier. Our formulation ensures that at the very least d unit of mass is moved from c₁, c, …, cₗ by x, thus representing the overlaying constraint of “higher than”. It is usually ensured that the sum of u_c equals the sum of v_c. The marginal distributions are outlined as

3) Equality constraint Ex = f. Representing the equality constraint is extra simple. Assuming that there’s just one constraint, we rewrite the constraint as

The output x strikes e₁, e, …, eₗ to f, and we want no dummy ingredient in uₑ as a result of it’s an equality constraint. It is usually ensured that the sum of uₑ equals the sum of vₑ. The marginal distributions are outlined as

After encoding all constraints and stacking them as a number of units of marginals, we will name the Sinkhorn algorithm for multi-set marginals to encode the constraints.

Experimental Validation of LinSAT

In our ICML paper, we validated the LinSATNet technique for routing constraints past the final case (used for fixing variants of the Touring Salesman Drawback), partial graph matching constraints (utilized in graph matching the place solely subsets of graphs match one another), and normal linear constraints (utilized in particular desire with portfolio optimization). All these issues may be represented with constructive linear constraints and dealt with utilizing the LinSATNet technique. In experiments, neural networks are able to studying the best way to remedy all three issues.

It must be famous that the LinSATNet technique can solely deal with constructive linear constraints, that means that it’s unable to deal with constraints like x₁ — x₂ ≤ 0 which comprise destructive phrases. Nevertheless, constructive linear constraints already cowl an enormous array of eventualities. For every particular drawback, the mathematical modeling is usually not distinctive, and in lots of instances, an inexpensive constructive linear formulation may very well be discovered. Along with the examples talked about above, let the community output natural molecules (represented as graphs, ignoring hydrogen atoms, contemplating solely the skeleton) can think about constraints similar to C atoms having not more than 4 bonds, O atoms having not more than 2 bonds.

Including constraints to neural networks has a variety of utility eventualities, and to date, a number of strategies can be found. It’s essential to notice that there isn’t a golden customary to evaluate their superiority over one another — one of the best technique is normally related to a sure situation.

After all, I like to recommend attempting out LinSATNet! Anyway, it is so simple as an activation layer in your community.

For those who discovered this text useful, please be at liberty to quote:

@inproceedings{WangICML23,
title={LinSATNet: The Optimistic Linear Satisfiability Neural Networks},
writer={Wang, Runzhong and Zhang, Yunhao and Guo, Ziao and Chen, Tianyi and Yang, Xiaokang and Yan, Junchi},
booktitle={Worldwide Convention on Machine Studying (ICML)},
yr={2023}
}

All aforementioned content material has been mentioned on this paper.



Supply hyperlink

More articles

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Latest article