1 Introduction
The problem of private information retrieval (PIR), since its introduction [1], has attracted significant attention from researchers in the fields of theoretical computer science, cryptography, information theory, and coding theory. In the classical PIR model, a user wishes to retrieve one of the available messages, from noncommunicating databases, each of which has a copy of these messages. User privacy needs to be preserved during message retrieval, which requires that the identity of the desired message not be revealed to any single database. To accomplish the task efficiently, good codes should be designed to download the least amount of data perbit of desired message, the inverse of which is referred to as the capacity of the PIR system. This capacity problem in the classical setting was settled recently [2].
In practical systems, the databases may suffer from failures, and are also constrained on the storage space. Erasure codes can be used to improve both storage efficiency and failure resistance. This consideration motivated the investigation of PIR from MDScoded databases [3, 4, 5, 6], with coding parameter , i.e., the messages can be recovered by accessing any databases. The capacity of PIR from MDScoded databases (MDSPIR) was characterized [4] as
(1) 
In a given code, the smallest required number of symbols in each message is called the message size (sometimes also referred to as the subpacketization factor), which is an important factor impacting the practicality and efficiency of the code. The capacityachieving code given in [4] requires , which can be extremely large for a system with even a moderate number of messages. The problem of reducing the message size of capacityachieving codes was recently considered by Xu and Zhang [6], and it was shown that under the assumption that all answers are of the same length, the message size must satisfy . These existing results may have left the impression that capacityachieving codes would necessitate a message size exponential in the number of messages.
In this work, we show that the minimum message size for capacityachieving PIR codes can in fact be significantly smaller than previously believed, by providing capacityachieving linear codes with message size . Two linear code constructions, referred to as ConstructionA and ConstructionB, respectively, are given. The two constructions have the same download cost and message size, however ConstructionB has a better upload cost (i.e., a lower communication cost for the user to send the queries), at the expense of being slightly more sophisticated than ConstructionA. The key difference between the two proposed constructions and existing codes in the literature is that the proposed codes reduce the reliance on symmetry, and the answers may be of different lengths. We further show that this is in fact the minimum message size when , the proof of which requires a careful analysis of the converse proof of the informationtheoretic MDSPIR capacity. The code constructions and converse proof reflect a reverse engineering approach which further extends [7, 8], however the analysis technique here is considerably more involved due to the integer constraints. Finally, we show that, when is small, it is in fact possible to design codes with a message size even smaller than .
The rest of the paper is organized as follows. In Section 2, a formal problem definition is given. ConstructionA and ConstructionB are then given in Section 3 and Section 4, respectively, where the correctness and performance are also proved and analyzed. The optimality of message size is established by first identifying several critical properties of capacityachieving codes in Section 5.1, then lowerbounding the minimum message size when in Section 5.2. A special code is given in Section 5.3 to show that when , the message size can be even lower than . Finally, Section 6 concludes the paper. Several technical proofs are relegated to the Appendix.
2 System Model
There are a total of mutually independent messages
in the system. Each message is uniformly distributed over
, i.e., the set of length sequences in the finite alphabet . All the messages can be collected and written as a single lengthrow vector
. Each message is MDScoded and then distributed to databases, such that from any databases, the messages can be fully recovered. Since the messages are MDScoded, it is without loss of generality to assume that for some integer .When a user wishes to retrieve a particular message , queries are sent to the databases, where is the query for database. The retrieval needs to be information theoretically private, i.e., any database is not able to infer any knowledge as to which message is being requested. For this purpose, a random key in the set is used together with the desired message index to generate the set of queries . Each query belongs to the set of allowed queries for database, denoted as . After receiving query , database responds with an answer . Each symbol in the answers also belongs to the finite field , and the answers may have multiple (and different numbers of) symbols. Using the answers from all databases, together with and , the user then reconstructs .
A more rigorous definition of the linear information retrieval process we consider in this work can be specified by a set of coding matrices and functions as follows. For notational simplicity, we denote the cardinality of a set as .
Definition 1.
A linear private information retrieval code from linearly MDScoded databases (a linear MDSPIR code) consists of the following coding components:

A set of MDS encoding matrices:
where , is an matrix in the alphabet for encoding message , and each encodes the messages into the information to be stored at database, denoted as ;

A set of MDS decoding recovery functions:
(2) for each such that , whose outputs are denoted as ;

A query function
i.e., for retrieving message , the user sends the query to database;

An answer length function
(3) i.e., the length of the answer from each database, a nonnegative integer, is a deterministic function of the query, but not the particular realization of the messages;

An answer generating matrix
(4) i.e., the answer , when is the query received by database;

A reconstruction function
(5) i.e., after receiving the answers, the user reconstructs the message as .
These functions satisfy the following three requirements:

MDS recoverable: For any such that , we have .

Retrieval correctness: For any , we have

Privacy: For every , and ,
(6)
Note that
is in fact a random variable, since
is the random key. It follows that even when the messages are viewed as deterministic, is still not deterministic. In contrast, for any specific query realization , the corresponding answer is deterministic when the messages are viewed as deterministic. The distinction between and is indicated by the bracket and the parenthesis .In order to measure the performance of an MDSPIR code, we consider the following two metrics, with the focus on minimizing the latter while keeping the former optimal:

The message size , which is the number of symbols to represent each individual message. This quantity should be minimized, because in practical applications, a smaller message size implies a more versatile code.
A third metric, the upload cost, is also of interest in practical systems (also particularly in computer science literature, e.g., [1]), although it is not our main focus in this work. The upload cost can be defined as
(8) 
which is roughly the total number of bits that the user needs to send to the servers during the query phase.
We will need several more parameters before proceeding. Define , then
(9) 
for some positive integers and , which are coprime.
3 New MDSPIR Code: ConstructionA
In this section, we provide the first MDSPIR code construction with message length , which we refer to as ConstructionA.
3.1 The Coding Components of ConstructionA
Each message can be divided into submessages, denoted as , and each submessage contains symbols in the alphabet . The construction relies on two novel ingredients: a new indexing on the key (query) and the introduction of pseudo code symbols.
The first novel ingredient in the construction, which is different from previous ones in the literature, is the random key , which is a length vector uniformly distributed in the set
(10) 
where indicates modulo . In this code construction, we need to first choose a (in fact, any) linear MDS code , in the alphabet as our base code. There are many known techniques to construct such codes, such as ReedSolomon codes and Cauchy matrix based constructions; see [9]. The coding functions can then be given as follows:

Each submessage , and , is encoded by into coded symbols , with placed at database, where is the th column of the generator matrix of code operated on each submessage, which produces the stored information at database.

The decoding function is obvious which is naturally induced by that of .

For any , the query generating function produces a length column vector as
(11) 
Database first produces a query matrix
(12) where is the allone column vector of length , and indicates matrix transpose; the element of on the th row and th column is denoted as . The query length function is then defined as:
(13) where is the indicator function, i.e., is the number of columns in which has an element less than .

The second novel ingredient, which is different from previous ones in the literature, is the introduction of pseudo code symbols and pseudo message symbols in the submessages: for . For , an intermediate answer vector of length is formed as
(14) each component of which is the finite field addition of some components of the vector that are indicated by the corresponding column of . The eventual answer of length is formed by concatenating the components of which are not constantly zero, i.e., those corresponding to the positions indicated in (13).

For any , define the interference database set . The th component of , , can be written as
where the length row vector is defined as
Note that is not a function of , since unless . Thus as long as , the vector can be fully recovered by the MDS property of the code ; see Fig. 1 for an illustration. Further note that the th component of for can be written as
(15) from which, since is known, we can recover
(16) Denote . As long as , we can recover the vector by again invoking the property of the MDS code .
database0  database1  database2  
⋮  ⋮  ⋮  ⋮  ⋮  ⋮  ⋮  ⋮  ⋮ 
3.2 An Example for ConstructionA
Let us first consider an example , which induces in the code. The queries and the answers are shown in Table 1 with the answers indicated by the encoded symbols . We omit the index since here . Consider the case to retrieve message , and the key is . Then the queries are
(17) 
The corresponding queries (and query matrices such induced) and answers are marked bold in Table 1. In these matrices, each column has at least one element being , and thus the total number of transmission symbols is . It is seen that from , the symbol can be recovered by the MDS property, and thus . Similarly, we can recover . Using both , we can then recover the original message by decoding the MDS code .
3.3 Correctness, Privacy, and Download Cost
According to the last coding component function (the reconstruction function) in ConstructionA, the correctness of the proposed code hinges on two conditions: for all and for all . We establish these two conditions in the following lemma, whose proof can be found in Appendix A.
Lemma 1.
In ConstructionA, for any request of message and any random key ,

for any ;

for any .
We also have the following main theorem for ConstructionA.
Theorem 1.
Codes obtained by ConstructionA are both private and capacityachieving.
Proof.
The fact that the code is private is immediate, by observing that is uniformly distributed on the set
(18) 
regardless of the value of .
The expected lengths of the answers is
(19) 
assuming an arbitrary message
is being requested. The probabilities involved in the summand
depend on(20) 
By the definition of , it is clear that if any item in
is less than , then for all , which will induce transmitted symbols in the retrieval from all databases for ; this event occurs with probability . On the other hand, when the event does not occur, in the vector
the number of elements that are less than is , which induces symbols being transmitted. Therefore
(21) 
from which it follows that the code is indeed capacity achieving, by taking into account (9). ∎
The following lemma is also immediate, and we state it as a lemma below.
Lemma 2.
The upload cost of ConstructionA is .
4 New MDSPIR Code: ConstructionB
In this section, we provide an alternative code construction, namely ConstructionB. This construction requires a lower upload cost than ConstructionA, however, it relies on two different coding strategies for the two cases of high rate codes and low rate codes . The high rate code construction is essentially built on a product code, while the low rate codes bear more similarity to ConstructionA.
4.1 ConstructionB for
In this construction, the same random key as in ConstructionA is used, and the MDS encoding matrices and decoding functions are also exactly the same as in ConstructionA. We need a second generic  code in the alphabet in this construction. ConstructionB essentially utilizes a product code with row code and column code [9]. In this context, it is more convenient to view the message as being represented as an matrix, denoted as
(22) 
Next we provide the coding components in ConstructionB.

The query generating function at server produces the following query vector
(23) where , and it operates on a vector by operating on each component individually.

Define an query pattern matrix as
(24) where the first row has the first elements being ’s and the rest being ’s, and the remaining rows are obtained by cyclically shifting the first elements in the first row but keeping the last in place. The query length function is then defined as
(25) i.e., it is the number of columns in the matrix selected by the vector that have nonzero elements.

Recall that coded message at database is a length vector
(26) In order to generate the answer, each vector is encoded by into a length intermediate code vector
(27) where is the generator matrix of the code . An intermediate answer vector is then produced
(28) The eventual answer of length is formed by concatenating the components of which are not constantly zero, i.e., those indicated by (25).

For any , define the interference database set . For , the th symbol in the intermediate answer is
where the matrix is defined as
Note that is not a function of , since unless . Thus as long as , the vector can be fully recovered by the MDS property of the code . Further note that for can be written as
(29) from which, since is known, we can recover
(30) Denote and , the latter of which is the set of databases that provide at least symbols of the requested messages in the form (30). For any , we can recover by invoking the property of code . Then as long as , can be fully recovered by invoking the property of code .
4.1.1 An Uncompressed Description of ConstructionB
The description of the coding components above is in a compressed form, and offers little intuition. The following equivalent description, on the other hand, can provide better intuition at the expense of more redundant items. Let the extended pattern matrix of size be defined as
(31) 
i.e., expanding the original pattern matrix by appending an all matrix of size . The same query answer can now be equivalently produced at each server by using the following auxiliary query
(32) 
i.e., without using the function mapping, and then following the same manner in answer generating, using the extended patter matrix . The stored contents of message across all the databases can be visualized as follows
(33) 
where each column corresponds to a database, and the contents below the horizontal line are not stored physically, but can be generated as part of the answer computation.
It is straightforward to verify that with the auxiliary query
Comments
There are no comments yet.