1 Introduction
model  semantic  structural  

TransE  
ComplEx  
PTransE  add  
multiply  
RNN  
ChainR  
RSN  
S2E (NAS)  a recurrent network searched by natural gradient descent 
is a nonlinear activation function.
Knowledge Graph (KG) [3, 30, 36]
, as a special kind of graph with lots of relational facts, is important to data mining and machine learning. In KGs, each relational fact is represented as a triplet in the form of
(subject entity, relation, object entity), and is abbreviated as . KGs have gradually been employed in many knowledgedriven tasks, e.g., structured search [13], question answering [25] and recommendation [40]. However, the discrete representation of triplets can not support these applications well [36]. KG embedding methods, which map the discrete entities and relations into continuous vector spaces, have recently emerged and been developed as a promising direction serving this purpose [7, 16, 31, 36]. The vector representations, i.e. embeddings, can preserve information in original triplets and can be used together with other machine learning techniques [25, 40].As a relational graph, the semantic information, which models the interactions between entities and relations, is the main focus in KG. When learning KG embedding, the most fundamental issue is how to preserve information in the relational triplets ’s. TransE [7], a representative embedding model, interprets the triplet as a translation from subject to object , i.e. the embeddings satisfy . Following works, such as TransD [19], ComplEx [34], ConvE [12], etc., also learn the embeddings based on the single triplet . PTransE [23] extends TransE with multihop relation translation. By using manually selected paths, PTransE can learn compositional relation patterns, but fail to explore the structural information. These models mainly focus on the semantic information.
As a kind of graph structured data, the structural information, which encodes local topology among entities, gradually arises attention. GCNAlign [37] learns cross lingual alignment by using graph convolutional network to leverage the structural information. However, leaving semantic information alone prevents this model from fully capturing properties in KG. More recently, [16] proposes recurrent skipping network (RSN) to process the relational paths with RNNs. By sampling paths along the KG and incorporating both entity and relation embeddings, RSN becomes a promising model to exploit structural information along with semantic information. However, it focuses more on learning the structural information along paths and does not adapts well to tasks where semantic information is more important.
Given a KG task or dataset, what compositional relation patterns it has and what structure it forms are diverse [36]. How to exploit the structural information and combine it with semantic information for a specific KG task is nontrivial. Inspired by the success of neural architecture search (NAS) (ElsKen et al. 2019). we propose S2E in this paper, which can automatically combine Structural and Semantic information for KG Embeddings. Similar as [16], we generate paths to extract both structural information and semantic information in KGs. Based on the paths, we define the path distiller, i.e. a recurrent network, to distill the structural information in the paths and combine it with the semantic information. By reviewing lots of humandesigned KG embedding models, we set up the search space for the recurrent network, which is quite different from the general NAS for RNN problem. In order to search efficiently, we propose a natural gradient based search algorithm for the network architecture. Extensive experiments on entity alignment and link prediction tasks show the effectiveness of the designed search space and efficiency of the search algorithm. Our searched recurrent networks are novel and can well adapt to different KG tasks by more effectively combining the structural and semantic information. Contributions of our work are summarized as follows:

[parsep=0pt,partopsep=0pt]

We analyze the difficulty and importance of distilling structural information and combining it with semantic information for KGs. Based on the analysis, we formulate the above problem as an NAS problem which targets at searching recurrent networks specific to the KG domain.

We propose a domainspecific search space for the recurrent network. Different from searching RNN cells, the recurrent network in our space is specifically designed for KG tasks and covers many humandesigned embedding models. This also throws light upon designing of future embedding models.

We identify the problems of adopting oneshot architecture search in our search space. The problems are rarely discussed in NAS literature. and motivate us to design a natural gradient based searching algorithm to give pseudo gradient for architecture search, which is much more efficient than existing derivativefree NAS methods.

We conduct experiments on entity alignment and link prediction tasks. Empirical experiments demonstrate that the searched models by S2E outperform humandesigned ones, and the search algorithm is more efficient compared with other NAS baselines.
2 Related Works
Before we present the related work, we first give the notations used in this paper. We denote vectors by lowercase boldface, and matrix by uppercase boldface. Knowledge Graph is defined by the set of entities , relations and triplets . A single triplet represents a relation that links from the subject entity to the object entity . The embeddings in this paper are denoted as boldface letters of indexes, e.g. are embeddings of respectively. Vectors are denoted by lowercase boldface, and matrices by uppercase boldface.
2.1 Knowledge Graph Embedding
Knowledge graph (KG) embedding aims to embed entities and relations into lowdimensional vector space, while preserving relevant properties in original KG. Given a single triplet , TransE [7], and its following works TransD [19], ComplEx [34], ConvE [12], etc., interpret the interactions between embeddings and in different ways. PTransE [23] manually selects some composite relations with semantic reliability in KG and models the sequence of relations instead of the single triplet. However, these works focus on learning semantic information, i.e. interactions of entities with single or compositional relations.
Definition 1 (Relational Path [16]).
A path (of length ) is formed by a set of triplets where for all .
Structural information, which means the topology among entities, gradually arises attention in learning KG embeddings. GCNAlign [37], incorporates graph convolutional network (GCN) [21] to aggregate neighbors of each entity to capture structural information. However, ignoring semantic information among the triplets leads to their inferior performance. IPTransE [41] iteratively aligns entities in two KGs based on PTransE to implicitly leverage the internal structural information. More recently, recurrent skipping network (RSN) [16] is proposed to exploit longterm relational dependencies in KGs. The idea is to model relational paths (Definition 1), which contains both structural and semantic information, instead of the single triplet. By processing paths with a basic RNN and a skip connection, RSN incorporates the structural information well, but is still limited in learning semantic information. The expressions of several humandesigned models are given in Tab. 1, and a graphical illustration of TransE, PTransE and RSN model is shown in Fig. 1. They have distinct connection patterns and composition operators, leading to different ways in dealing with structural and semantic information.
2.2 Neural Architecture Search (NAS)
NAS is a promising way to design reasonable neural network architectures for different tasks. It has two important perspectives
[42, 14, 18]: i). search space: it defines which architectures can be represented in principle like CNN or RNN. The search space should be carefully designed for specific scenarios and cover human wisdom as special cases. ii). search algorithm: universal optimization tools are generally inefficient to find good architectures. In order to make optimization fast, algorithms should be designed to utilize properties of the search space and domain information of the NAS problem.The search space of CNN has been developed from searching basic convolution units for the whole architecture [42], to repeated cells with fixed macro architecture based on humandesigned networks [24]. Many architectures have been searched to outperform CNNs in literature. However, designing the search space for RNN attracts little attention and the searched architectures do not outperform humandesigned ones [24].
Recently, oneshot architecture search methods, e.g., DARTS [24], ASNG [1] and SNAS [39]
, have become the most popular NAS methods that can efficiently find good architectures. These methods directly construct a supernet, which contains all possible architectures spanned by selected operations, and jointly optimize network weights and architectures’ parameters by stochastic gradient descent. However, their success counts on two conditions. First, network weights can be shared, which means the performance of direct training the supernet is not bad
[39] and removing a good operation can deteriorate the supernet’s performance more than a bad operation [5]. Second, the validation measurement needs to be differentiable. Otherwise, even with differentiable relaxation to architectures, we cannot get gradient information to update architectures [1].3 Search Problem
As discussed in Sec.2.1, structural information and semantic information are both important in KG, and they can be captured by relational paths (Definition 1). However, given a specific KG dataset or task, the interactions between entities and relations, and the degree distribution of entities are complex [29, 26, 36]. Designing a model that can adapt well to different tasks and datasets is nontrivial. Single triplet models, e.g., TransE and ComplEx, process each triplet in the paths individually, thus no structural information is utilized. PTransE aggregates relations rather than entities, and thus mainly focus on semantics. ChainR and RSN recurrently process the entities and relations along paths. However, they emphasize too much on the structural information and fail to learn the semantics well. Therefore, we are motivated to search the model architectures to adaptively distill structural information and combine it with semantic information.
3.1 Search Objective
The key problem now becomes how to leverage the relational path (Definition 1) to distill the structural information and to combine with semantic information. Based on the existing embedding models, and inspired by the success of RSN [16] as well as the fact that a path is a sequence of many triplets, which is similar as that a sentence is a sequence of many words [32], we model the paths as a recurrent function in Definition 2.
Definition 2 (Path Distiller).
A path distiller processes the embeddings of to recurrently. In each recurrent step , the distiller combines embeddings of , and a distillation of preceding information to get an output . The distiller is formulated as a recurrent function
(1) 
where ’s are hidden state of recurrent steps and . The output should approach object entity .
Specifically, at each step , we focus on one triplet to preserve the semantic information. The subject entity embedding and relation embedding are the inputs to the model. We use hidden state to combine structural and semantic information along the path. Besides, we collect an output in each step and use it to predict object entity in the current step. Therefore, how to form the recurrent function for certain KG tasks is the key issue. As it is difficult to design an that can adapt well to various KG tasks and datasets and motivated by the recent success of NAS [14] in searching neural network architectures, we propose to search the recurrent function as a RNN. The network here is not a general RNN but one specific to KG embedding tasks. Therefore, searching is modeled as an NAS problem in Definition 3. We use to denote the architecture of the recurrent network .
Definition 3 (NAS Problem).
Let the training set be and validation set be . returns the embeddings trained on with , of which the architecture is . measure the performance of embeddings on . The problem is to find an architecture for the path distiller such that validation performance is maximized, i.e.,
(2) 
where is the search space of (i.e., containing all possible architectures of ).
The training data is a set of paths. The validation and testing data , are task dependent. In next part, we will introduce how to design proper search space, which should both consider human wisdom and search efficiently.
3.2 Search Space
The human designed KG embedding models in Tab.1 have various forms. To deal with the structural information, the network should be able to model nonrecurrent architectures covering TransE, ComplEx, architectures without entity embedding like PTransE, and fully recurrent architectures like ChainR and RSN, etc. To ensure different ways of processing semantic information, we need to determine how to use different combinator like adding, multiplying, and Hermitian product to make different transformations in the embedding space. Therefore, choosing an appropriate space for the distillers is not trivial. Based on the various models in Tab.1, we are motivated to design the search space in Fig. 2.

[topsep=0pt,parsep=0pt,partopsep=0pt]

Connections: In the left part, we focus on structural information by controlling connections. First, a single operator OP1 is used to combine entity embedding with the distilled recurrent information . As discussed in Sec.3.1, we should deal with different connection types to distill the structural information. Specifically, considering nonrecurrent structures like TransE, hidden state may not be involved in the architecture; OP1 can be unconnected to leave entity embedding alone like PTransE. Therefore, we select one out of four vectors, i.e. , , output of OP1 and , to combine with relation embedding in OP2, and to get the output in OP3.

Operations (Table 2): Once the connections are fixed, the operators, i.e. OP1, OP2 and OP3, should choose among different combination functions: adding, multiplying or Hermitian product. To further enhance the distiller’s learning ability, we add the gated operator GRU [10] as an additional combinator for OP1 and OP2. Followed by the combination functions, OP1 and OP2 choose activation functions from identity, tanh, sigmoid, relu
to enable nonlinear transformations.
operations  activation functions  

OP1  adding, multiplying,  identity, tanh, 
OP2  Hermitian product, GRU  sigmoid, relu 
OP3  adding, multiplying, Hermitian product  identity 
Each edge is either a trainable square matrix
or an identity matrix
. To summarize, there are three or four candidate operators for OP1OP3, four activation functions for OP1OP2, and four different connections for OP2OP3, resulting in about possible models. More details of the search space are given in Appendix B.3.3 Comparison with NAS for RNN
When forming the recurrent network , a basic consideration here is to search an RNN cell using standard NAS methods [42]. Even though both the distiller here and RNN cells target at learning to control information flow between recurrent steps, there are quite a few differences between them (Tab. 3). First, the search space of RNN focuses more on different operators with fixed connections. However, the search space in Fig. 2 focuses on combining structural and semantic information in KGs. The search space in our work can be explicitly categorized to fit different connections, such as nonrecurrent architecture, architectures without entity embedding, fully recurrent architectures, etc. The domainspecific considerations make our space better regularized and generalize many humandesigned models. Besides, directly searching RNN cells has not shown advantage over humandesigned models [42, 24] since it lacks domain understanding for specific tasks. However, S2E consistently finds better architectures for various KG tasks.
perspective  NAS for RNN  S2E 

space  general  KG specific 
algorithm  oneshot NAS  natural gradient 
performance  worse than human designed models  consistently better 
4 Search Algorithm
In previous section, we have introduced the search space , which contains tens of thousands of different models. Therefore, how to efficiently search in is an important problem. Designing appropriate optimization algorithm for the discrete architecture parameters is a big challenge.
4.1 Failure of OneShot Architecture Search
Oneshot architecture search has been widely used for NAS [24, 39, 1] by jointly optimizing the architecture parameters in a supernet, i.e., a network containing all possible architectures spanned by selected operations. Even though the problem of searching the recurrent function can be regarded as a special kind of NAS, i.e., RNN search problem, those are not applicable for the following two reasons:

[leftmargin = 22px]

validation performance measurement is not differentiable w.r.t architecture parameters ;

sharing embedding parameters leads to problematic gradients.
For the first point, training and evaluation are quite different in KG embedding models. In the training procedure, the objective is to maximize score on positive triplets and minimize for the negative ones, while during validation, the ranking metric [7], which is discrete and not differentiable, is generally used to measure the performance. Thus, algorithms that rely on differentiable relaxations to operations, like DARTS [24], SNAS [39], are not applicable in this case. For the second point, embedding parameter sharing is problematic. Different from NAS applications, the training samples are the learned embedding, and architecture performance highly depends on the status of embeddings (see Section 5.5).
4.2 Searching Architectures by Natural Gradient
As oneshot architecture search methods cannot be applied, we may turn to derivativefree optimization methods [11]
, such as reinforcement learning
[42, 4] and Bayes optimization [6]. However, they are generally too slow [11]. In this sequel, we propose a search algorithm based on natural gradient (NG) [2, 27], which guides more efficient search by generating pseudo gradient.Let be the distribution of architectures . To enable the usage of NG, we first use stochastic relaxation [39, 1] to transform (2) into a maximization problem over , i.e.,
(3) 
where the performance is averaged over the distribution of . Then, the NG updates by where is the stepsize, is Fisher matrix at , and
(4) 
where each architecture is sampled i.i.d. from , and is the number of samples to approximate the . The key observation in NG to avoid problems P1 is that we do not need to take gradient w.r.t. in (4), which is nondifferentiable. Instead we only need to get validation performance, which is directly measured by . In this way, we can see (4) offers pseudo gradients of in a parameterized space by . Then, for problem P2, the embedded vectors are not shared across different ’s, instead it is directly obtained by model training, i.e., each is different in (4).
Besides, NG itself has several nice properties. As discussed in [2, 27], NG descent is parameterizationinvariant, has good generalization ability, and moreover can be regarded as a secondorder method in the space of parameters . Details of the NG descent used in our work are given in Appendix A. These facts together with NG’s ability of overcoming problem P1 and P2 make NG an ideal choice here. In Sec.5.4, we also show the proposed NG based search method is much efficient than other derivativefree optimization methods.
Remark 4.1.
models  DBPWD  DBPYG  ENFR  ENDE  

H@1  H@10  MRR  H@1  H@10  MRR  H@1  H@10  MRR  H@1  H@10  MRR  
semantic  TransE  28.4  51.4  0.36  27.0  57.4  0.37  16.2  39.0  0.24  40.3  60.9  0.47 
TransD*  27.7  57.2  0.37  17.3  41.6  0.26  21.1  47.9  0.30  24.4  50.0  0.33  
PTransE  16.7  40.2  0.25  7.4  14.7  0.10  7.3  19.7  0.12  27.0  51.8  0.35  
structural  BootEA*  32.3  63.1  0.42  31.3  62.5  0.42  31.3  62.9  0.42  44.2  70.1  0.53 
IPTransE*  23.1  51.7  0.33  22.7  50.0  0.32  25.5  55.7  0.36  31.3  59.2  0.41  
ChainR  32.2  60.0  0.42  35.3  64.0  0.45  31.4  60.1  0.41  41.3  68.9  0.51  
RSN*  38.8  65.7  0.49  40.0  67.5  0.50  34.7  63.1  0.44  48.7  72.0  0.57  
NASRNN  38.3  65.1  0.47  37.4  67.1  0.47  33.4  62.8  0.43  46.0  72.4  0.54  
S2E (proposed)  41.1  71.0  0.51  40.6  73.4  0.52  36.1  67.5  0.46  50.2  75.7  0.59 
5 Experiments
5.1 Experiment setup
Following [15, 16], we sample the relational paths from biased random walks (details in Appendix C.2). The paths make up the training set and will not be resampled for each datasets when evaluating different models.
We use two basic tasks in KG, i.e. entity alignment and link prediction. Same as the literature [7, 41, 16], we use the ranking based metrics: mean reciprocal ranking (MRR) and Hit@. MRR is computed by average of the reciprocal ranks , where is the validation or testing set and is a set of ranking results. Hit@ is the percentage of appearance in top, namely , where is the indicator function. To avoid underestimating the different models, we report the performance in a “filtered” setting, i.e., all the correct pairs or triplets that exist in training, validation or testing set are filtered out [7]. All the models are compared with the same embedding dimension.
5.2 Comparison with Humandesigned models
Entity alignment
In this task, two KGs and are given. The two KGs have a set of entities that are aligned , and is split into training/validation/testing sets, During training, the paths can walk through the aligned entity pairs in training set to interact between and
. And we evaluate the cosine similarity of each
in validating set or testing set . We use the four crosslingual and crossdatabase subset from DBpedia and Wikidata generated by [16]: DBPWD, DBPYG, ENFR, ENDE. Details are given in Appendix C.1.In this task, structural information is more important since aligned entities usually have similar local topologies. The comparison of the testing performance of the models searched by S2E and humandesigned ones is given in Tab. 4. We roughly categorize the human designed models into “semantic” and “structural” based on their main focus. We can observe that, models that leverage structural information generally perform better than those that only focus on semantics. BootEA and IPTransE win over TransE and PTransE respectively by iteratively aligning discovered entity pairs to implicitly leverage the structural information. Performance of PTransE is very bad since it emphasizes too much on learning compositional relations, and ignores entities’ topology along paths. ChainR, RSN and NASRNN outperform BootEA and IPTransE by explicitly processing both entities and relations along path. NASRNN wins over ChainR by adapting the RNN architectures but is inferior to RSN due to lack of domainspecific constraint. The searched models by S2E are given in Appendix C.4.
Link prediction
To further demonstrate the effectiveness of S2E, we do link prediction task which is a general testbed on single triplet models like TransE, ComplEx, SimplE, etc. In this task, an incomplete KG is given. Our target is to predict the missing entities in unknown link. The evaluation is conducted in the following ways. For each triplet in the validation set or testing set , we compute the score of for all and get the rank of , the same for based on scores of over all . We use two famous benchmark datasets, WN18RR [12] and FB15k237 [33] (statistics in Appendix C.1), which are more realistic than their superset WN18 and FB15k [7].
models  WN18RR  FB15k237  

H@1  H@10  MRR  H@1  H@10  MRR  
TransE  12.5  44.5  0.18  17.3  37.9  0.24 
ComplEx  41.4  49.0  0.44  21.7  47.5  0.30 
PTransE  27.2  46.4  0.34  20.3  45.1  0.29 
RSN  38.0  44.8  0.40  19.2  41.8  0.27 
S2E  42.1  52.4  0.45  21.9  48.1  0.30 
Due to the high computation cost in this task, we compare different models with a small dimension size 64. As shown in Fig. 5, PTransE outperforms TransE by modeling compositional relations, but worse than ComplEx due to the adding operation is inferior to Hermitian product when modeling the interaction between entities and relations [34]. RSN is worse than ComplEx since it pays more attention on structures rather than semantics. And our S2E can search models that perform better than the humandesigned models by adaptively searching the architectures.
5.3 Case Study
In this part, we use the synthetic data Countries [8], which contains 271 countries and regions, and 2 relations neighbor and locatedin. This dataset contains three tasks: S1 and S2 require inferring 2 hop relations, i.e. ; S3 is harder and requires modeling 3 hop relations .
In order to show the difficulty of designing a proper recurrent function for given tasks, we extract four patterns within the search space in Fig. 2 and individually search in each pattern to fit S1S3. Specifically, as shown in Fig. 3,

[parsep=0pt,partopsep=0pt]

P1 represents the single hop embedding models;

P2 processes 2 successive steps;

P3 models long relational path without entity embedding;

P4 includes both entities and relations along path.
For each task in S1S3, we randomly generate 100 models for each pattern and record the model with best area under curve of precision recall (AUCPR) [9] on validation set. The procedure is repeated 5 times for each pattern to evaluate the testing performance given in Tab. 6. We can see that no single pattern perform well on all tasks. For easy tasks S1 and S2, semantic information is more important. Incorporating entity embeddings along paths like P4 will hinder the learning ability of directly modeling 2 hop relationships. The twostep pattern P2 consistently performs good in S1 and S2 since it fits best in the two tasks. For the harder task S3, P3 and P4 win over the others since it can model longer steps. P4 slightly outperform P3 by incorporating structural information along path.
S1  S2  S3  
P1  0.9980.001  0.9970.002  0.9180.031 
P2  1.0000.000  0.9990.001  0.9230.023 
P3  0.9920.001  1.0000.000  0.9440.016 
P4  0.9770.028  0.9840.010  0.9490.015 
S2E  1.0000.000  1.0000.000  0.968 0.007 
We evaluate the best model searched in each task. As in the last line of Tab. 6, the model searched by S2E achieves good performance on the hard task S3. Besides, S2E prefers different candidates (see Appendix C.4) for S1S3 over the same search space. This verifies our analysis that distilling structural information and combining it with semantic information is difficult, and there does not exist a unified model that can adapt different KG tasks and datasets well.
5.4 Comparison with NAS Methods
We compare the proposed algorithm and the other search algorithms here. Basically, we use random search (Rand) as the simplest baseline. Reinforcement learning (RL) has similar objective as (3
), but only uses first order gradient. Besides, we use tree parzen estimator (
TPE) [6] as a representative Bayes optimization method, which is widely used in searching architectures. Finally, our natural gradient based search algorithm is given as S2E.We show the top1 and top5 mean MRR values w.r.t. the sampled model in Fig. 4
to show the effectiveness of natural gradient. Entity alignment task is evaluated on DBPWD dataset, and link prediction uses WN18RR. All the sampled models are trained into convergence and the MRR value of each model is the converged value on validation set. As shown, all heuristic algorithms, i.e. S2E, RL and TPE, outperform random search. Among them, S2E converges faster and performs better than the other searching algorithms as indicated by the blue lines. The reason is that, using REINFORCE gradient
[38] in RL will lead to suboptimal solution when is not well parameterized. TPE easily falls into local optimum without warm start training [6]. In comparison, the natural gradient we used is parameterizationinvariant and can be regarded as a secondorder method, thus performs the best.Efficiency comparison of S2E with various NAS methods. Each model is evaluated with the same number of epochs. In entity alignment task, DBPWD is used. In link prediction, WN18RR is used.
5.5 Problem on Embedding Sharing
In this part, we empirically show the inherent problems by training a supernet with shared embedding and model parameters. When searching CNN architectures under the oneshot framework, parameters can be shared in the supernet such that the model can focus on the operations that are most useful for generating good predictions [5]. However, the input training set is fixed for all the operations in CNN, while the KG embeddings keep changing. As in Fig. 5(a), the distributions of entity embeddings trained by two different KG embedding models are rather distinct. Simultaneously updating the embeddings with different operations will lead to distorted performance as shown in Fig. 5.(b). Besides, it is not reliable to compare different operators when the embedding distribution is different. Combining the discussion in Sec. 4.1 and the empirical analysis in this part, we conclude that oneshot architecture search is not adoptable in our searching problem.
6 Conclusion
In this paper, we propose an NAS method S2E to search RNN for learning from relational paths, which contains structural and semantic information, in KGs. By designing a specific search space based on humandesigned KG embedding models, S2E can adaptively distill structural information and combine with semantic information for different KG tasks. Since the oneshot architecture framework is not applicable in this problem, we propose an NG based search algorithm that is efficient compared with the other baselines.
References
 [1] (2019) Adaptive stochastic natural gradient method for oneshot neural architecture search. In ICML, pp. 171–180. Cited by: §2.2, §4.1, §4.2, Remark 4.1.
 [2] (1998) Natural gradient works efficiently in learning. Neural Computation 10 (2), pp. 251–276. Cited by: §4.2, §4.2.
 [3] (2007) Dbpedia: a nucleus for a web of open data. In The semantic web, pp. 722–735. Cited by: §1.
 [4] (2017) Designing neural network architectures using reinforcement learning. In ICLR, Cited by: §4.2.
 [5] (2018) Understanding and simplifying oneshot architecture search. In ICML, pp. 549–558. Cited by: §2.2, §5.5.
 [6] (2011) Algorithms for hyperparameter optimization. In NeurIPS, pp. 2546–2554. Cited by: §4.2, §5.4, §5.4.
 [7] (2013) Translating embeddings for modeling multirelational data. In NeurIPS, pp. 2787–2795. Cited by: §1, §1, §2.1, §4.1, §5.1, §5.2.
 [8] (2015) On approximate reasoning capabilities of lowrank vector spaces. In AAAI Spring Symposium Series, Cited by: §5.3.

[9]
(2013)
Area under the precisionrecall curve: point estimates and confidence intervals
. In Joint ECMLKDD, pp. 451–466. Cited by: §5.3. 
[10]
(2015)
Gated feedback recurrent neural networks
. In ICML, pp. 2067–2075. Cited by: §B.2, 2nd item.  [11] (2009) Introduction to derivativefree optimization. Vol. 8, Siam. Cited by: §4.2.
 [12] (2017) Convolutional 2D knowledge graph embeddings. In AAAI, Cited by: §1, §2.1, §5.2.
 [13] (2014) Knowledge vault: a webscale approach to probabilistic knowledge fusion. In SIGKDD, pp. 601–610. Cited by: §1.
 [14] (2019) Neural architecture search: a survey.. JMLR 20 (55), pp. 1–21. Cited by: §2.2, §3.1.
 [15] (2016) Node2vec: scalable feature learning for networks. In SigKDD, pp. 855–864. Cited by: §C.2, §5.1.
 [16] (2019) Learning to exploit longterm relational dependencies in knowledge graphs. In ICML, Cited by: §C.1, §C.2, §C.3, §1, §1, §1, §2.1, §3.1, §5.1, §5.1, §5.2, Definition 1.
 [17] (2006) A closer look at skipgram modelling.. In LREC, pp. 1222–1225. Cited by: §C.2.
 [18] F. Hutter, L. Kotthoff, and J. Vanschoren (Eds.) (2018) Automated machine learning: methods, systems, challenges. Springer. Note: In press, available at http://automl.org/book. Cited by: §2.2.
 [19] (2015) Knowledge graph embedding via dynamic mapping matrix. In ACL, Vol. 1, pp. 687–696. Cited by: §1, §2.1, Table 4.
 [20] (2014) Adam: a method for stochastic optimization. arXiv:1412.6980. Cited by: §C.2.
 [21] (2016) Semisupervised classification with graph convolutional networks. arXiv:1609.02907. Cited by: §2.1.

[22]
(2018)
Canonical tensor decomposition for knowledge base completion
. In ICML, Cited by: §C.2.  [23] (2015) Modeling relation paths for representation learning of knowledge bases. Technical report arXiv:1506.00379. Cited by: §1, §2.1.
 [24] (2019) DARTS: differentiable architecture search. In ICLR, Cited by: §2.2, §2.2, §3.3, §4.1, §4.1.
 [25] (2017) Neural networkbased question answering over knowledge graphs on word and character level. In WWW, pp. 1211–1220. Cited by: §1.
 [26] (2016) A review of relational machine learning for knowledge graphs. Proceedings of the IEEE 104 (1), pp. 11–33. Cited by: §3.
 [27] (2013) Revisiting natural gradient for deep networks. Technical report arXiv preprint arXiv:1301.3584. Cited by: §4.2, §4.2.
 [28] (2017) Automatic differentiation in pytorch. In ICLR, Cited by: §5.1.
 [29] (2014) Deepwalk: online learning of social representations. In SIGKDD, pp. 701–710. Cited by: §3.
 [30] (2013) Reasoning with neural tensor networks for knowledge base completion. In NeurIPS, pp. 926–934. Cited by: §1.
 [31] (2018) Bootstrapping entity alignment with knowledge graph embedding.. In IJCAI, pp. 4396–4402. Cited by: §1, Table 4.
 [32] (2012) LSTM neural networks for language modeling. In CISCA, Cited by: §3.1.
 [33] (2015) Observed versus latent features for knowledge base and text inference. In Workshop on CVSMC, pp. 57–66. Cited by: §5.2.
 [34] (2017) Knowledge graph completion via complex tensor factorization. JMLR 18 (1), pp. 4735–4772. Cited by: §B.1, §B.1, Table 1, §1, §2.1, §5.2, Proposition 1.
 [35] (2008) Visualizing data using tSNE. JMLR. Cited by: Figure 5.
 [36] (2017) Knowledge graph embedding: a survey of approaches and applications. TKDE 29 (12), pp. 2724–2743. Cited by: §1, §1, §3.
 [37] (2018) Crosslingual knowledge graph alignment via graph convolutional networks. In EMNLP, pp. 349–357. Cited by: §1, §2.1.
 [38] (1992) Simple statistical gradientfollowing algorithms for connectionist reinforcement learning. MLJ 8 (34), pp. 229–256. Cited by: §5.4.
 [39] (2018) SNAS: stochastic neural architecture search. In ICLR, Cited by: §2.2, §4.1, §4.1, §4.2.
 [40] (2016) Collaborative knowledge base embedding for recommender systems. In SIGKDD, pp. 353–362. Cited by: §1.
 [41] (2017) Iterative entity alignment via joint knowledge embeddings.. In IJCAI, pp. 4258–4264. Cited by: §2.1, Table 4, §5.1.
 [42] (2017) Neural architecture search with reinforcement learning. In ICLR, Cited by: §2.2, §2.2, §3.3, §4.2.
Appendix A NG details
To get the optimization direction of architectures , we use natural gradient to generate pseudo gradient. The key point here is to use stochastic relaxation to give differentiable feedback. Specifically, we introduce a parametric family
of probability distributions defined on the architecture space
. Assume that for all , the distribution admits the density function w.r.t. the reference measure on . The density approaches the DiracDelta distribution and is differentiable w.r.t. . Then we define the stochastic relaxation of given as follows:leading to the maximization problem in (3).
Since the limit of converges to a DiracDelta distribution , maximization of leads to the maximization of depending on the condition . The gradient is computed as
where a wellknown logtrick is used in this deduction.
Then the NG algorithm updates by
(5) 
where is the stepsize, and is Fisher information matrix at . Usually, computing is time consuming. To reduce computation cost, the exponential family , where , and are known functions depending on the target distribution, is used to parameterize the distribution . For simplicity, we set and choose the expectation parameters . Then the gradient reduces to , and inverse Fisher information matrix is . In this work, since the architecture parameters are all categorical, for a component with categories, we use a dimensional vector to model the distribution . The probability of sampling the category is and . Note that, the natural gradient is only used to update the first dimension of , and is a onehot vector (without dimension ) of the category .
Appendix B Search Space Details
b.1 Hermitian product
The Hermitian product is defined on complex space same as in [34]. Denote two complex vectors , where and are real and image part respectively, and , the Hermitian product is
Based on above equation, we can give the Hermitian product in the real space. Given two realvalued vectors which have the same dimension size and , we can evenly split both vectors into two parts, i.e. and , where have the same dimension . Then
(6) 
Thus, we get a dimensional composition vector .
Proposition 1.
Given a triplet and the embeddings , let , then is the same as ComplEx [34].
b.2 Space size reduction
As discussed, there are possible architectures. However, not all of them need to be trained and evaluated. We can filter out these combinations that are not promising.
Proposition 2.
When zero vector is connected to OP2 and combinator in OP2 is multiply or Hermitian product, then the output of OP2 will be .
It is easy to see that multiplying with will lead to zero output. Based on the introduction of Hermitian product in B.1, it will also output if and are combined with Hermitian product. In this case, will be removed and there will be no information propagated recurrently. This case is the same for OP3 since multiplying with will get .
Besides, the GRU [10] unit already contains weighted edge as well as activation functions. As a function specially designed for recurrent structures, OP2 is only allowed to choose GRU when OP1 or is connected. When or is connected to OP2, a recurrent architecture will not exist.
Based on above analysis, we manually avoid evaluating the architectures that are not promising as in Prop. 2 or not making sense as the nonrecurrent connection of GRU.
Appendix C Experiment details
c.1 Datasets
The datasets we used for entity alignment task are subset of DBpedia and Wikidata. They are crosslingual or crossKG in DBpedia and Wikidata. Detailed statistics and informations are given in Tab. 7. They are the same as the Normal version, which is very sparse, in [16].
The data for link prediction is WN18RR and FB15k237, which are subset of WN18 and FB15k respectively. Compared with their superset, the two selected sets are more realistic.
data source  # relations  # triplets  

DBPWD  DBpediaEnglish  253  38,421 
WikidataEnglish  144  40,159  
DBPYG  DBpediaEnglish  219  33,571 
YAGO3English  30  34,660  
ENFR  DBpediaEnglish  211  36,508 
DBpediaFrench  177  33,532  
ENDE  DBpediaEnglish  225  38,281 
DBpediaGerman  118  37,069 
Dataset  #entity  #rels  #train  #valid  #test 

WN18RR  40,943  11  86,835  3,034  3,134 
FB15k237  14,541  237  272,115  17,535  20,466 
c.2 Path sampling
Following [15, 16], we sample path from biased random walks. In conventional random walk, all the neighbors have the same probability to be sampled as the next step. In biased random walk, we sample the neighbor that can go deeper or go to another KG with larger probability. For single KG, we use one parameter to give the depth bias. Specifically, the next step has probability of to go to the neighbors that are two step away from the previous one, and probability of to go to other neighbors. Similarly, we have another parameter for two KGs to give crossKG bias. The next step is encouraged to jump to another KG through the entity alignment pairs in . Since the aligned pairs are usually rare in the training set, encouraging crossKG walk can learn better about the aligned entities in the two KGs.
Since the KGs are usually large, we sample two path for each triplet in . The length of paths is 7 for entity alignment task on two KGs and 3 for link prediction task on single KG. The paths are fixed for each dataset to keep a fair comparison among different models. The parameters used for sampling path are summarized in Tab. 9.
parameters  length  

entity alignment  0.9  0.9  7 
link prediction  0.7  –  3 
Once the paths are sampled, we start training based on paths. The loss for each path is given in (8)
(8) 
In each recurrent step , we focus on one triplet . The subject embedding and relation embedding process along with the distilled information to get the output . The output is encouraged to approach the object embedding . Thus, in (8), the objective is a multiclass logloss while the object is the true label. Rather than use the skipgram model [17] based on negative samples, the multiclass loss introduced by [22] is more stable. Training is based on minibatch gradient descent. Adam [20] is used as the optimization method for updating model parameters.
c.3 Hyperparameters and training details
We use RSN [16] as a standard baseline to tune parameters. We use the HyperOpt package, which is based on tree parsen estimator, to search the learning rate , L2 penalty , decay rate , batch size , as well as a dropout rate , which applies on input embeddings. The tuning ranges are given in Tab. 10.
hyperparam  ranges 

The embedding dimension for entity alignment task is 256, and for link prediction is 64. After the hyperparameter tuning, the proposed S2E starts searching with this hyperparameter setting. For all the tasks and datasets, we search and evaluate 100 models. Among them, we select the best model indicated by the MRR performance on validation set and finetune hyperparameters for this model.
As for the searching time, about half GPU day for tuning hyperparameters, two GPU days for searching the model in entity alignment datasets. Link prediction takes longer time since the evaluation process is more expensive. It takes one GPU day for hyperparameters, two GPU days for WN18RR and three GPU days for FB15k237. Note that, the cost for searching a better model is in the same order of magnitude as tuning hyperparameters. This means our search space is reasonable and the search algorithm is efficient.
c.4 Searched models
Entity alignment
The searched models by S2E, which consistently perform best on each dataset, are graphically illustrated in Fig. 6. As shown, all of the searched recurrent networks processing recurrent information in , entity embedding and relation embedding together. But they have different connections, different composition operator and different activation functions. The functions learned on DBPWD and ENFR, joint two successive triplets to learn a small neighborhood, while the other two process longer steps.
Link prediction
The best models searched in link prediction tasks are given in Fig 7. And we make a statistics of the distance, i.e. the shortest path distance when regarding the KG as a simple undirected graph, between two entities in the validation and testing set in Tab. 11. As shown, most of the triplets have distance less than 3. Besides, as indicated by the performance on the two datasets, we infer that triplets far away from 3hop are very challenging to model. At least in this task, triplets that are less or equal than 3 hops are the main focus of different models. This also explains why RSN, which processes long paths, does not perform well in the link prediction task. The searched models in Fig. 7 do not consider long term structures. The architecture on WN18RR models two successive triplets, and the architecture on FB15k237 only models single triplet. They focus more on the semantics and use gradient descent to interact among triplets.
Datasets  Hops  

2  3  
WN18RR  validation  35.5%  8.8%  22.2%  33.5% 
testing  35.0%  9.3%  21.4%  34.3%  
FB15k237  validation  0%  73.2%  26.1%  0.7% 
testing  0%  73.4%  26.8%  0.8% 
Countries
We also give the graphical illustration of architectures searched in Countries dataset in Fig. 8. The search procedure is conducted in the whole search space rather than the four patterns P1P4. We can see that in Fig. 8, (a) belongs to P1, (b) belongs to P2 and (d) belongs to P4. There results further verify that S2E can adaptively search architectures for specific KG tasks and datasets.
Comments
There are no comments yet.