Introduction

This repository / book describes the process for proposing changes to Graph Protocol in the form of RFCs and Engineering Plans.

It also includes all approved, rejected and obsolete RFCs and Engineering Plans. For more details, see the following pages:

RFCs

What is an RFC?

An RFC describes a change to Graph Protocol, for example a new feature. Any substantial change goes through the RFC process, where the change is described in an RFC, is proposed a pull request to the rfcs repository, is reviewed, currently by the core team, and ultimately is either either approved or rejected.

RFC process

1. Create a new RFC

RFCs are numbered, starting at 0001. To create a new RFC, create a new branch of the rfcs repository. Check the existing RFCs to identify the next number to use. Then, copy the RFC template to a new file in the rfcs/ directory. For example:

cp rfcs/0000-template.md rfcs/0015-fulltext-search.md

Write the RFC, commit it to the branch and open a pull request in the rfcs repository.

In addition to the RFC itself, the pull request must include the following changes:

  • a link to the RFC on the Approved RFCs page, and
  • a link to the RFC under Approved RFCs in SUMMARY.md.

2. RFC review

After an RFC has been submitted through a pull request, it is being reviewed. At the time of writing, every RFC needs to be approved by

  • at least one Graph Protocol founder, and
  • at least one member of the core development team.

3. RFC approval

Once an RFC is approved, the RFC meta data (see the template) is updated and the pull request is merged by the original author or a Graph Protocol team member.

Approved RFCs

RFC-0001: Subgraph Composition

Author
Jannis Pohlmann
RFC pull request
https://github.com/graphprotocol/rfcs/pull/1
Obsoletes
-
Date of submission
2019-12-08
Date of approval
-
Approved by
-

Summary

Subgraph composition enables referencing, extending and querying entities across subgraph boundaries.

Goals & Motivation

The high-level goal of subgraph composition is to be able to compose subgraph schemas and data hierarchically. Imagine umbrella subgraphs that combine all the data from a domain (e.g. DeFi, job markets, music) through one unified, coherent API. This could allow reuse and governance at different levels and go all the way to the top, fulfilling the vision of the Graph.

The ability to reference, extend and query entities across subgraph boundaries enables several use cases:

  1. Linking entities across subgraphs.
  2. Extending entities defined in other subgraphs by adding new fields.
  3. Breaking down data silos by composing subgraphs and defining richer schemas without indexing the same data over and over again.

Subgraph composition is needed to avoid duplicated work, both in terms of developing subgraphs as well as indexing them. It is an essential part of the overall vision behind The Graph, as it allows to combine isolated subgraphs into a complete, connected graph of the (decentralized) world's data.

Subgraph developers will benefit from the ability to reference data from other subgraphs, saving them development time and enabling richer data models. dApp developers will be able to leverage this to build more compelling applications. Node operators will benefit from subgraph composition by having better insight into which subgraphs are queried together, allowing them to make more informed decisions about which subgraphs to index.

Urgency

Due to the high impact of this feature and its important role in fulfilling the vision behind The Graph, it would be good to start working on this as early as possible.

Terminology

The feature is referred to by query-time subgraph composition, short: subgraph composition.

Terms introduced and used in this RFC:

  • Imported schema: The schema of another subgraph from which types are imported.
  • Imported type: An entity type imported from another subgraph schema.
  • Extended type: An entity type imported from another subgraph schema and extended in the subgraph that imports it.
  • Local schema: The schema of the subgraph that imports from another subgraph.
  • Local type: A type defined in the local schema.

Detailed Design

The sections below make the assumption that there is a subgraph with the name ethereum/mainnet that includes an Address entity type.

Composing Subgraphs By Importing Types

In order to reference entity types from annother subgraph, a developer would first import these types from the other subgraph's schema.

Types can be imported either from a subgraph name or from a subgraph ID. Importing from a subgraph name means that the exact version of the imported subgraph will be identified at query time and its schema may change in arbitrary ways over time. Importing from a subgraph ID guarantees that the schema will never change but also means that the import points to a subgraph version that may become outdated over time.

Let's say a DAO subgraph contains a Proposal type that has a proposer field that should link to an Ethereum address (think: Ethereum accounts or contracts) and a transaction field that should link to an Ethereum transaction. The developer would then write the DAO subgraph schema as follows:

type _Schema_
  @import(
    types: ["Address", { name: "Transaction", as: "EthereumTransaction" }],
    from: { name: "ethereum/mainnet" }
  )

type Proposal @entity {
  id: ID!
  proposer: Address!
  transaction: EthereumTransaction!
}

This would then allow queries that follow the references to addresses and transactions, like

{
  proposals { 
    proposer {
      balance
      address
    }
    transaction {
      hash
      block {
        number
      }
    }
  }
}

Extending Types From Imported Schemas

Extending types from another subgraph involves several steps:

  1. Importing the entity types from the other subgraph.
  2. Extending these types with custom fields.
  3. Managing (e.g. creating) extended entities in subgraph mappings.

Let's say the DAO subgraph wants to extend the Ethereum Address type to include the proposals created by each respective account. To achieve this, the developer would write the following schema:

type _Schema_
  @import(
    types: ["Address"],
    from: { name: "ethereum/mainnet" }
  )

type Proposal @entity {
  id: ID!
  proposer: Address!
}

extend type Address {
  proposals: [Proposal!]! @derivedFrom(field: "proposal")
}

This makes queries like the following possible, where the query can go "back" from addresses to proposal entities, despite the Ethereum Address type originally being defined in the ethereum/mainnet subgraph.

{
  addresses {
    id
    proposals {
      id
      proposer {
        id
    }
  }
}

In the above case, the proposals field on the extended type is derived, which means that an implementation wouldn't have to create a local extension type in the store. However, if proposals was defined as

extend type Address {
  proposals: [Proposal!]!
}

then it would the subgraph mappings would have to create partial Address entities with id and proposals fields for all addresses from which proposals were created. At query time, these entity instances would have to be merged with the original Address entities from the ethereum/mainnet subgraph.

Subgraph Availability

In the decentralized network, queries will be split and routed through the network based on what indexers are available and which subgraphs they index. At that point, failure to find an indexer for a subgraph that types were imported from will result in a query error. The error that a non-nullable field resolved to null bubbles up to the next nullable parent, in accordance with the GraphQL Spec.

Until the network is reality, we are dealing with individual Graph Nodes and querying subgraphs where imported entity types are not also indexed on the same node should be handled with more tolerance. This RFC proposes that entity reference fields that refer to imported types are converted to being optional in the generated API schema. If the subgraph that the type is imported from is not available on a node, such fields should resolve to null.

Interfaces

Subgraph composition also supports interfaces in the ways outlined below.

Interfaces Can Be Imported From Other Subgraphs

The syntax for this is the same as that for importing types:

type _Schema_
  @import(types: ["ERC20"], from: { name: "graphprotocol/erc20" })

Local Types Can Implement Imported Interfaces

This is achieved by importing the interface from another subgraph schema and implementing it in entity types:

type _Schema_
  @import(types: ["ERC20"], from: { name: "graphprotocol/erc20" })

type MyToken implements ERC20 @entity {
  # ...
}

Imported Types Can Be Extended To Implement Local Interfaces

This is achieved by importing the types from another subgraph schema, defining a local interface and using extend to implement the interface on the imported types:

type _Schema_
  @import(types: [{ name: "Token", as "LPT" }], from: { name: "livepeer/livepeer" })
  @import(types: [{ name: "Token", as "Rep" }], from: { name: "augur/augur" })

interface Token {
  id: ID!
  balance: BigInt!
}

extend LPT implements Token {
  # ...
}
extend Rep implements Token {
  # ...
}

Imported Types Can Be Extended To Implement Imported Interfaces

This is a combination of importing an interface, importing the types and extending them to implement the interface:

type _Schema_
  @import(types: ["Token"], from: { name: "graphprotocol/token" })
  @import(types: [{ name: "Token", as "LPT" }], from: { name: "livepeer/livepeer" })
  @import(types: [{ name: "Token", as "Rep" }], from: { name: "augur/augur" })

extend LPT implements Token {
  # ...
}
extend Rep implements Token {
  # ...
}

Implementation Concerns For Interface Support

Querying across types from different subgraphs that implement the same interface may require a smart algorithm, especially when it comes to pagination. For instance, if the first 1000 entities for an interface are queried, this range of 1000 entities may be divided up between different local and imported types arbitrarily.

A naive algorithm could request 1000 entities from each subgraph, applying the selected filters and order, combine the results and cut off everything after the first 1000 items. This would generate a minimum of requests but would involve significant overfetching.

Another algorithm could just fetch the first item from each subgraph, then based on that information, divide up the range in more optimal ways than the previous algorith, and satisfy the query with more requests but with less overfetching.

Compatibility

Subgraph composition is a purely additive, non-breaking change. Existing subgraphs remain valid without any migrations being necessary.

Drawbacks And Risks

Reasons that could speak against implementing this feature:

  • Schema parsing and validation becomes more complicated. Especially validation of imported schemas may not always be possible, depending on whether and when the referenced subgraph is available on the Graph Node or not.

  • Query execution becomes more complicated. The subgraph a type belongs to must be identified and local as well as imported versions of extended entities have to be queried separately and be merged before returning data to the client.

Alternatives

No alternatives have been considered.

There are other ways to compose subgraph schemas using GraphQL technologies such as schema stitching or Apollo Federation. However, schema stitching is being deprecated and Apollo Federation requires a centralized server to serve to extend and merge GraphQL API. Both of these solutions slow down queries.

Another reason not to use these is that GraphQL will only be one of several query languages supported in the future. Composition therefore has to be implemented in a query-language-agnostic way.

Open Questions

  • Right now, interfaces require unique IDs across all the concrete entity types that implement them. This is not something we can guarantee any longer if these concrete types live in different subgraphs. So we have to handle this at query time (or must somehow disallow it, returning a query error).

    It is also unclear how an individual interface entity lookup would look like if IDs are no longer guaranteed to be unique:

    someInterface(id: "?????") {
    }
    

RFC-0002: Ethereum Tracing Cache

Author
Zac Burns
RFC pull request
https://github.com/graphprotocol/rfcs/pull/4
Obsoletes (if applicable)
None
Date of submission
2019-12-13
Date of approval
2019-12-20
Approved by
Jannis Pohlmann

Summary

This RFC proposes the creation of a local Ethereum tracing cache to speed up indexing of subgraphs which use block and/or call handlers.

Motivation

When indexing a subgraph that uses block and/or call handlers, it is necessary to extract calls from the trace of each block that a Graph Node indexes. It is expensive to acquire and process traces from Ethereum nodes in both money and time.

When developing a subgraph it is common to make changes and deploy those changes to a production Graph Node for testing. Each time a change is deployed, the Graph Node must re-sync the subgraph using the same traces that were used for the previous sync of the subgraph. The cost of acquiring the traces each time a change is deployed impacts a subgraph developer's ability to iterate and test quickly.

Urgency

None

Terminology

Ethereum cache: The new API proposed here.

Detailed Design

There is an existing EthereumCallCache for caching eth_call built into Graph Node today. This cache will be extended to support traces, and renamed to EthereumCache.

Compatibility

This change is backwards compatible. Existing code can continue to use the parity tracing API. Because the cache is local, each indexing node may delete the cache should the format or implementation of caching change. In this case of invalidated cache the code will fall back to existing methods for retrieving a trace and repopulating the cache.

Drawbacks and Risks

Subgraphs which are not being actively developed will incur the overhead for storing traces, but will not ever reap the benefits of ever reading them back from the cache.

If this drawback is significant, it may be necessary to extend EthereumCache to provide a custom score for cache invalidation other than the current date. For example, trace_filter calls could be invalidated based on the latest update time for a subgraph requiring the trace. It is expected that a subgraph which has been updated recently is more likely to be updated again soon then a subgraph which has not been recently updated.

Alternatives

None

Open Questions

None

Obsolete RFCs

Obsolete RFCs are moved to the rfcs/obsolete directory in the rfcs repository. They are listed below for reference.

  • No RFCs have been obsoleted yet.

Rejected RFCs

Rejected RFCs can be found by filtering open and closed pull requests by those that are labeled with rejected. This list can be found here.

Engineering Plans

What is an Engineering Plan?

Engineering Plans are plans to turn an RFC into an implementation in the core Graph Protocol tools like Graph Node, Graph CLI and Graph TS. Every substantial development effort that follows an RFC is planned in the form of an Engineering Plan.

Engineering Plan process

1. Create a new Engineering Plan

Like RFCs, Engineering Plans are numbered, starting at 0001. To create a new plan, create a new branch of the rfcs repository. Check the existing plans to identify the next number to use. Then, copy the Engineering Plan template to a new file in the engineering-plans/ directory. For example:

cp engineering-plans/0000-template.md engineering-plans/0015-fulltext-search.md

Write the Engineering Plan, commit it to the branch and open a pull request in the rfcs repository.

In addition to the Engineering Plan itself, the pull request must include the following changes:

  • a link to the Engineering Plan on the Approved Engineering Plans page, and
  • a link to the Engineering Plan under Approved Engineering Plans in SUMMARY.md.

2. Engineering Plan review

After an Engineering Plan has been submitted through a pull request, it is being reviewed. At the time of writing, every Engineering Plan needs to be approved by

  • the Tech Lead, and
  • at least one member of the core development team.

3. Engineering Plan approval

Once an Engineering Plan is approved, the Engineering Plan meta data (see the template) is updated and the pull request is merged by the original author or a Graph Protocol team member.

Approved Engineering Plans

PLAN-0001: GraphQL Query Prefetching

Author
David Lutterkort
Implements
No RFC - no user visible changes
Engineering Plan pull request
https://github.com/graphprotocol/rfcs/pull/2
Date of submission
2019-11-27
Date of approval
2019-12-10
Approved by
Jannis Pohlmann, Leo Yvens

This is not really a plan as it was written and discussed before we adopted the RFC process, but contains important implementation detail of how we process GraphQL queries.

Contents

Implementation Details for prefetch queries

Goal

For a GraphQL query of the form

query {
  parents(filter) {
    id
    children(filter) {
      id
    }
  }
}

we want to generate only two SQL queries: one to get the parents, and one to get the children for all those parents. The fact that children is nested under parents requires that we add a filter to the children query that restricts children to those that are related to the parents we fetched in the first query to get the parents. How exactly we filter the children query depends on how the relationship between parents and children is modeled in the GraphQL schema, and on whether one (or both) of the types involved are interfaces.

The rest of this writeup is concerned with how to generate the query for children, assuming we already retrieved the list of all parents.

The bulk of the implementation of this feature can be found in graphql/src/store/prefetch.rs, store/postgres/src/jsonb_queries.rs, and store/postgres/src/relational_queries.rs

Handling first/skip

We never get all the children for a parent; instead we always have a first and skip argument in the children filter. Those arguments need to be applied to each parent individually by ranking the children for each parent according to the order defined by the children query. If the same child matches multiple parents, we need to make sure that it is considered separately for each parent as it might appear at different ranks for different parents. In SQL, we use the rank() window function for this:

select *
  from (
    select c.*,
           rank() over (partition by parent_id order by ...) as pos
      from (query to get children) c)
 where pos >= skip and pos < skip + first

Handling interfaces

If parents or children (or both) are interfaces, we resolve the interfaces into the concrete types implementing them, produce a query for each combination of parent/child concrete type and combine those queries via union all.

Since implementations of the same interface will generally differ in the schema they use, we can not form a union all of all the data in the tables for these concrete types, but have to first query only attributes that we know will be common to all entities implementing the interface, most notably the vid (a unique identifier that identifies the precise version of an entity), and then later fill in the details of each entity by converting it directly to JSON.

That means that when we deal with children that are an interface, we will first select only the following columns (where exactly they come from depends on how the parent/child relationship is modeled)

select '{__typename}' as entity, c.vid, c.id, parent_id

and form the union all of these queries. We then use that union to rank children as described above.

Handling parent/child relationships

How we get the children for a set of parents depends on how the relationship between the two is modeled. The interesting parameters there are whether parents store a list or a single child, and whether that field is derived, together with the same for children.

There are a total of 16 combinations of these four boolean variables; four of them, when both parent and child derive their fields, are not permissible. It also doesn't matter whether the child derives its parent field: when the parent field is not derived, we need to use that since that is the only place that contains the parent -> child relationship. When the parent field is derived, the child field can not be a derived field.

That leaves us with the following combinations of whether the parent and child store a list or a scalar value, and whether the parent is derived:

For details on the GraphQL schema for each row in this table, see the section at the end. The Join cond indicates how we can find the children for a given parent. There are four different join conditions in this table.

When we query children, we need to have the id of the parent that child is related to (and list the child multiple times if it is related to multiple parents) since that is the field by which we window and rank children.

For join conditions of type C and D, the id of the parent is not stored in the child, which means we need to join with the parents table.

Let's work out the details of these queries; the implementation uses struct EntityLink in graph/src/components/store.rs to distinguish between the different types of joins and queries.

CaseParent list?Parent derived?Child list?Join condType
1TRUETRUETRUEchild.parents ∋ parent.idA
2FALSETRUETRUEchild.parents ∋ parent.idA
3TRUETRUEFALSEchild.parent = parent.idB
4FALSETRUEFALSEchild.parent = parent.idB
5TRUEFALSETRUEchild.id ∈ parent.childrenC
6TRUEFALSEFALSEchild.id ∈ parent.childrenC
7FALSEFALSETRUEchild.id = parent.childD
8FALSEFALSEFALSEchild.id = parent.childD

Type A

Use when parent is derived and child is a list

select c.*, parent_id
 from {children} c join lateral unnest(c.{parent_field}) parent_id
where parent_id = any($parent_ids)

Data needed to generate:

  • children: name of child table
  • parent_ids: list of parent ids
  • parent_field: name of parents field (array) in child table

The implementation uses a EntityLink::Direct for joins of this type.

Type B

Use when parent is derived and child is not a list

select c.*, c.{parent_field} as parent_id
 from {children} c
where c.{parent_field} = any($parent_ids)

Data needed to generate:

  • children: name of child table
  • parent_ids: list of parent ids
  • parent_field: name of parent field (scalar) in child table

The implementation uses a EntityLink::Direct for joins of this type.

Type C

Use when parent is a list and not derived

select c.*, p.id as parent_id
 from {children} c, {parents} p
where p.id = any($parent_ids)
  and c.id = any(p.{child_field})

Data needed to generate:

  • children: name of child table
  • parent_ids: list of parent ids
  • parents: name of parent table
  • child_field: name of child field (array) in parent table

The implementation uses a EntityLink::Parent for joins of this type.

Type D

Use when parent is not a list and not derived

select c.*, p.id as parent_id
 from {children} c, {parents} p
where p.id = any($parent_ids)
  and c.id = p.child_field

Data needed to generate:

  • children: name of child table
  • parent_ids: list of parent ids
  • parents: name of parent table
  • child_field: name of child field (scalar) in parent table

The implementation uses a EntityLink::Parent for joins of this type.

Putting it all together

Note that in all of these queries, we ultimately return the typename of each entity, together with a JSONB representation of that entity. We do this for two reasons: first, because different child tables might have different schemas, which precludes us from taking the union of these child tables, and second, because Diesel does not let us execute queries where the type and number of columns in the result is determined dynamically.

We need to to be careful though to not convert to JSONB too early, as that is slow when done for large numbers of rows. Deferring conversion is responsible for some of the complexity in these queries.

In the following, we only go through the queries for relational storage; for JSONB storage, there are similar considerations, though they are somewhat simpler as the union all in the below queries turns into an entity = any(..) clause with JSONB storage, and because we do not need to convert to JSONB data.

Note that for the windowed queries below, the entity we return will have parent_id and pos attributes. The parent_id is necessary to attach the query result to the right parents we already have in memory. The JSONB queries need to somehow insert the parent_id field into the JSONB data they return.

In the most general case, we have an EntityCollection::Window with multiple windows. The query for that case is

with matches as (
  -- Limit the matches for each parent
  select c.*
    from (
      -- Rank matching children for each parent
      select c.*,
             rank() over (partition by c.parent_id order by {query.order}) as pos
        from (
          {window.children_uniform(sort_key, block)}
          union all
            ... range ober all windows) c) c
   where c.pos > {skip} and c.pos <= {skip} + {first})
-- Get the full entity for each match
select m.entity, to_jsonb(c.*) as data, m.parent_id, m.pos
  from matches m, {window.child_table()} c
 where c.vid = m.vid and m.entity = '{window.child_type}'
 union all
       ... range over all windows
 -- Make sure we return the children for each parent in the correct order
 order by parent_id, pos

When there is only one window, we can simplify the above query. The simplification basically inlines the matches CTE. That is important as CTE's in Postgres before Postgres 12 are optimization fences, even when they are only used once. We therefore reduce the two queries that Postgres executes above to one for the fairly common case that the children are not an interface.

select '{window.child_type}' as entity, to_jsonb(c.*) as data
  from (
    -- Rank matching children
    select c.*,
          rank() over (partition by c.parent_id order by {query.order}) as pos
     from ({window.children_detailed()}) c) c
 where c.pos >= {window.skip} and c.pos <= {window.skip} + {window.first}
 order by c.parent_id,c.pos

When we do not have to window, but only deal with an EntityCollection::All with multiple entity types, we can simplify the query by avoiding ranking and just using an ordinary order by clause:

with matches as (
  -- Get uniform info for all matching children
  select '{entity_type}' as entity, id, vid, {sort_key}
    from {entity_table} c
   where {query_filter}
   union all
     ... range over all entity types
   order by {sort_key} offset {query.skip} limit {query.first})
-- Get the full entity for each match
select m.entity, to_jsonb(c.*) as data, c.id, c.{sort_key}
  from matches m, {entity_table} c
 where c.vid = m.vid and m.entity = '{entity_type}'
 union all
       ... range over all entity types
 -- Make sure we return the children for each parent in the correct order
     order by c.{sort_key}, c.id

And finally, for the very common case of a GraphQL query without nested children that uses a concrete type, not an interface, we can further simplify this, again by essentially inlining the matches CTE to:

select '{entity_type}' as entity, to_jsonb(c.*) as data
  from {entity_table} c
 where query.filter()
 order by {query.order} offset {query.skip} limit {query.first}

Boring list of possible GraphQL models

These are the eight ways in which a parent/child relationship can be modeled. For brevity, I left the id attribute on each parent and child type out.

This list assumes that parent and child types are concrete types, i.e., that any interfaces involved in this query have already been reolved into their implementations and we are dealing with one pair of concrete parent/child types.

# Case 1
type Parent {
  children: [Child] @derived
}

type Child {
  parents: [Parent]
}

# Case 2
type Parent {
  child: Child @derived
}

type Child {
  parents: [Parent]
}

# Case 3
type Parent {
  children: [Child] @derived
}

type Child {
  parent: Parent
}

# Case 4
type Parent {
  child: Child @derived
}

type Child {
  parent: Parent
}

# Case 5
type Parent {
  children: [Child]
}

type Child {
  # doesn't matter
}

# Case 6
type Parent {
  children: [Child]
}

type Child {
  # doesn't matter
}

# Case 7
type Parent {
  child: Child
}

type Child {
  # doesn't matter
}

# Case 8
type Parent {
  child: Child
}

type Child {
  # doesn't matter
}

PLAN-0002: Ethereum Tracing Cache

Author
Zachary Burns
Implements
RFC-0002 Ethereum Tracing Cache
Engineering Plan pull request
https://github.com/graphprotocol/rfcs/pull/9
Date of submission
2019-12-20
Date of approval
2020-01-07
Approved by
Jannis Pohlmann, Leo Yvens

Summary

Implements RFC-0002: Ethereum Tracing Cache

Implementation

These changes happen within or near ethereum_adapter.rs, store.rs and db_schema.rs.

Limitations

The problem of reorg turns out to be a particularly tricky one for the cache, mostly due to ranges of blocks being requested rather than individual hashes. To sidestep this problem, only blocks that are older than the reorg threshold will be eligible for caching.

Additionally, there are some subgraphs which may require traces from all or a substantial number of blocks and don't make effective use of filtering. In particular, subgraphs which specify a call handler without a contract address fall into this category. In order to prevent the cache from bloating, any use of Ethereum traces which does not filter on a contract address will bypass the cache.

EthereumTraceCache

The implementation introduces the following trait, which is implemented primarily by Store.


#![allow(unused_variables)]
fn main() {
use std::ops::RangeInclusive;
struct TracesInRange {
    range: RangeInclusive<u64>,
    traces: Vec<Trace>,
}

pub trait EthereumTraceCache: Send + Sync + 'static {
    /// Attempts to retrieve traces from the cache. Returns ranges which were retrieved.
    /// The results may not cover the entire range of blocks. It is up to the caller to decide
    /// what to do with ranges of blocks that are not cached.
    fn traces_for_blocks(contract_address: Option<H160>, blocks: RangeInclusive<u64>
        ) -> Box<dyn Future<Output=Result<Vec<TracesInRange>, Error>>>;
    fn add(contract_address: Option<H160>, traces: Vec<TracesInRange>);
}
}

Block schema

Each cached block will exist as its own row in the database in an eth_traces_cache table.


#![allow(unused_variables)]
fn main() {
eth_traces_cache(id) {
  id -> Integer,
  network -> Text,
  block_number: Integer,
  contract_address: Bytea,
  traces -> Jsonb,
}
}

A multi-column index will be added on network, block_number, and contract_address.

It can be noted that in the eth_traces_cache table, there is a very low cardinality for the value of the network row. It is inefficient for example to store the string mainnet millions of times and consider this value when querying. A data oriented approach would be to partition these tables on the value of the network. It is expected that hash partitioning available in Postgres 11 would be useful here, but the necessary dependencies won't be ready in time for this RFC. This may be revisited in the future.

Valid Cache Range

Because the absence of trace data for a block is a valid cache result, the database must maintain a data structure indicating which ranges of the cache are valid in an eth_traces_meta table. This table also enables eventually implementing cleaning out old data.

This is the schema for that structure:


#![allow(unused_variables)]
fn main() {
id -> Integer,
network -> Text,
start_block -> Integer,
end_block -> Integer,
contract_address -> Nullable<Bytea>,
accessed_at -> Date,
}

When inserting data into the cache, removing data from the cache, or reading the cache, a serialized transaction must be used to preserve atomicity between the valid cache range structure and the cached blocks. Care must be taken to not rely on any data read outside of the serialized transaction, and for the extent of the serialized transaction to not span any async contexts that rely on any Future outside of the database itself. The definition of the EthereumTraceCache trait is designed to uphold these guarantees.

In order to preserve space in the database, whenever the valid cache range is added it will be added such that adjacent and overlapping ranges are merged into it.

Cache usage

The primary user of the cache is EtheriumAdapter<T> in the traces function.

The correct algorithm for retrieving traces from the cache is surprisingly nuanced. The complication arises from the interaction between multiple subgraphs which may require a subset of overlapping contract addresses. The rate at which indexing proceeds of these subgraphs can cause different ranges of the cache to be valid for a contract address in a single query.

We want to minimize the cost of external requests for trace data. It is likely that it is better to...

  • Make fewer requests
  • Not ask for trace data that is already cached
  • Ask for trace data for multiple contract addresses within the same block when possible.

There is one flow of data which upholds these invariants. In doing so it makes a tradeoff of increasing latency for the execution of a specific subgraph, but increases throughput of the whole system.

Within this graph:

  • Edges which are labelled refer to some subset of the output data.
  • Edges which are not labelled refer to the entire set of the output data.
  • Each node executes once for each contiguous range of blocks. That is, it merges all incoming data before executing, and executes the minimum possible times.
  • The example given is just for 2 addresses. The actual code must work on sets of addresses.
graph LR;
   A[Block Range for Contract A & B]
   A --> |Above Reorg Threshold| E
   D[Get Cache A]
   A --> |Below Reorg Threshold A| D
   A --> |Below Reorg Threshold B| H
   E[Ethereum A & B]
   F[Ethereum A]
   G[Ethereum B]
   H[Get Cache B]
   D --> |Found| M
   H --> |Found| M
   M[Result]
   D --> |Missing| N
   H --> |Missing| N
   N[Overlap]
   N --> |A & B| E
   N --> |A| F
   N --> |B| G
   E --> M
   K[Set Cache A]
   L[Set Cache B]
   E --> |B Below Reorg Threshold| L
   E --> |A Below Reorg Threshold| K
   F --> K
   G --> L
   F --> M
   G --> M

This construction is designed to make the fewest number of the most efficient calls possible. It is not as complicated as it looks. The actual construction can be expressed as sequential steps with a set of filters preceding each step.

Useful dependencies

The feature deals a lot with ranges and sets. Operations like sum, subtract, merge, and find overlapping are used frequently. nested_intervals is a crate which provides some of these operations.

Tests

Benchmark

A temporary benchmark will be added for indexing a simple subgraph which uses call handlers. The benchmark will be run in these scenarios:

  • Sync before changes
  • Re-sync before changes
  • Sync after changes
  • Re-sync after changes

Ranges

Due to the complexity of the resource minimizing data workflow, it will be useful to have mocks for the cache and database which record their calls, and check that expected calls are made for tricky data sets.

Database

A real database integration test will be added to test the add/remove from cache implementation to verify that it correctly merges blocks, handles concurrency issues, etc.

Migration

None

Documentation

None, aside from code comments

Implementation Plan:

These estimates inflated to account for the author's lack of experience with Postgres, Ethereum, Futures0.1, and The Graph in general.

  • (1) Create benchmarks
  • Postgres Cache
    • (0.5) Block Cache
    • (0.5) Trace Serialization/Deserialization
    • (1.0) Ranges Cache
    • (0.5) Concurrency/Transactions
    • (0.5) Tests against Postgres
  • Data Flow
    • (3) Implementation
    • (1) Unit tests
  • (0.5) Run Benchmarks

Total: 8

PLAN-0003: Remove JSONB Storage

Author
David Lutterkort
Implements
No RFC - no user visible changes
Engineering Plan pull request
https://github.com/graphprotocol/rfcs/pull/7
Date of submission
2019-12-18
Date of approval
2019-12-20
Approved by
Jess Ngo, Jannis Pohlmann

Summary

Remove JSONB storage from graph-node. That means that we want to remove the old storage scheme, and only use relational storage going forward. At a high level, removal has to touch the following areas:

  • user subgraphs in the hosted service
  • user subgraphs in self-hosted graph-node instances
  • subgraph metadata in subgraphs.entities (see this issue)
  • the graph-node code base

Because it touches so many areas and different things, JSONB storage removal will need to happen in several steps, the last being actual removal of JSONB code. The first three steps above are independent of each other and can be done in parallel.

Implementation

User Subgraphs in the Hosted Service

We will need to communicate to users that they need to update their subgraphs if they still use JSONB storage. Currently, there are ~ 580 subgraphs (list) belonging to 220 different organizations using JSONB storage. It is quite likely that the vast majority of them is not needed anymore and simply left over from somebody trying something out.

We should contact users and tell them that we will delete their subgraph after a certain date (say 2020-02-01) unless they deploy a new version of the subgraph (with an explanation why etc. of course) Redeploying their subgraph is all that is needed for those updates.

Self-hosted User Subgraphs

We will need to tell users that the 'old' JSONB storage is deprecated and support for it will be removed as of some target date, and that they need to redeploy their subgraph.

Users will need some documentation/tooling to help them understand

  • which of their deployed subgraphs still use JSONB storage
  • how to remove old subgraphs
  • how to remove old deployments

Subgraph Metadata in subgraphs.entities

We can treat the subgraphs schema like a normal subgraph, with the exception that some entities must not be versioned. For that, we will need to adopt code that makes it possible to write entities to the store without recording their version (or, more generally, so that there will only be one version of the entity, tagged with a block range [0,))

We will manually create the DDL for the subgraphs.graphql schema and run that as part of a database migration. In that migration, we will also copy the existing metadata from subgraphs.entities and subgraphs.entity_history into their new tables.

The Code Base

Delete all code handling JSONB storage. This will mostly affect entities.rs and jsonb_queries.rs in graph-store-postgres, but there are also smaller things like that we do not need the annotations on Entity to serialize them to the JSON format that JSONB uses.

Tests

Most of the code-level changes are covered by the existing test suite. The major exception is that the migration of subgraph metadata needs to be tested and checked manually, using a recent dump of the production database.

Migration

See above on migrating data in the subgraphs schema.

Documentation

No user-facing documentation is needed.

Implementation Plan

No estimates yet as we should first agree on this general course of action

  • Notify hosted users to update their subgraph or have it deleted by date X
  • Mark JSONB storage as deprecated and announce when it will be removed
  • Provide tool to ship with graph-node to delete unused deployments and unneeded subgraphs
  • Add affordance to not version entities to relational storage code
  • Write SQL migrations to create new subgraph metadata schema and copy existing data
  • Delete old JSONB code
  • On start of graph-node, add check for any deployments that still use JSONB storage and log warning messages telling users to redeploy (once the JSONB code has been deleted, this data can not be accessed any more)

Open Questions

None

PLAN-0004: Subgraph Schema Merging

Author
Jorge Olivero
Implements
RFC-0001 Subgraph Composition
Engineering Plan pull request
Engineering Plan PR
Obsoletes (if applicable)
-
Date of submission
2019-12-09
Date of approval
TBD
Approved by
TBD

Summary

Subgraph composition allows a subgraph to import types from another subgraph. Imports are designed to be loosely coupled throughout the deployment and indexing phases, and tightly coupled during query execution for the subgraph.

To generate an API schema for the subgraph, the subgraph schema needs to be merged with the imported subgraph schemas. The merging process needs to take the following into account:

  1. An imported schema not being available on the Graph Node
  2. An imported schema mising a type imported by the subgraph schema
  3. A schema imported by subgraph name changes
  4. Ability to tell which subgraph/schema each type belongs to

Implementation

The schema merging implementation consists of two parts:

  1. A cache of subgraph schemas
  2. The schema merging logic

Schema Merging

Add a merged_schema(&Schema, HashMap<SchemaReference, Arc<Schema>) -> Schema function to the graphql::schema crate which will add each of the imported types to the provided document with a @subgraphId directive denoting which subgraph the type came from. If any of the imported types have non scalar fields, import those types as well.

The HashMap<SchemaReference, Arc<Schema> includes all of the schemas in the subgraph's import graph which are available on the Graph Node. For each @import directive on the subgraph, find the imported types by tracing their path along the import graph.

  • If any schema node along that path is missing or if a type is missing in the schema, add a type definition to the subgraphs merged schema with the proper name, a @subgraphId(id: "...") directive (if available), and a @placeholder directive denoting that type was not found. sc- If the type is found, copy it and add a @subgraphId(id: "...") directive.
  • If the type is imported with the { name: "", as: "" } format, the merged type will include an @originalName(name: "...") directive preserving the type name from the original schema.

The api_schema function will add all the necessary types and fields for the imported types without requiring any changes.

Example #1: Complete merge

Local schema before calling merged_schema:

type _Schema_
  @import(
    types: ["B"],
    from: { id: "X" }
  )

type A @entity {
  id: ID!
  foo: B!
}

Imported Schema X:

type B @entity {
  id: ID!
  bar: String
}

Schema after calling merged_schema:

type A @entity {
  id: ID!
  foo: B!
}

type B @entity @subgraphId(id: "X") {
  id: ID!
  bar: String
}

Example #2: Incomplete merge

Schema before calling merged_schema:

type _Schema_
  @import(
    types: ["B"],
    from: { id: "X" }
  )

type A @entity {
  id: ID!
  foo: B!
}

Imported Schema X:

NOT AVAILABLE

Schema after calling merged_schema

type A @entity @subgraphId(id: "...") {
  id: ID!
  foo: B!
}

type B @entity @placeholder {
  id: ID!
}

Example #3: Complete merge with { name: "...", as: "..." }

Schema before calling merged_schema

type _Schema_
  @imports(
    types: [{ name: "B", as: "BB" }]
    from: { id: "X" }
  )

type B @entity {
  id: ID!
  foo: BB!
}

Imported Schema X:

type B @entity {
  id: ID!
  bar: String
}

Schema after calling merged_schema

type B @entity {
  id: ID!
  foo: BB!
}

type BB @entity @subgraphId(id: "X") @originalName(name: "B") {
  id: ID!
  bar: String
}

Example #4: Complete merge with nested types

Schema before calling merged_schema

type _Schema_
  @imports(
    types: [{ name: "B", as: "BB" }]
    from: { id: "X" }
  )

type B @entity {
  id: ID!
  foo: BB!
}

Imported Schema X:

type _Schema_
  @imports(
	types: [{ name: "C", as: "CC" }]
	from: { id: "Y" }
  )

type B @entity {
  id: ID!
  bar: CC!
  baz: DD!
}

type DD @entity {
  id: ID!
}

Imported Schema Y:

type C @entity {
	id: ID!
}

Schema after calling merged_schema

type B @entity {
  id: ID!
  foo: BB!
}

type BB @entity @subgraphId(id: "X") @originalName(name: "B") {
  id: ID!
  bar: CC!
  baz: DD!
}

type CC @entity @subgraphId(id: "Y") @originalName(name: "C") {
  id: ID!
}

type DD @entity @subgraphId(id: "X") {
  id: ID!
}

After the schema document is merged, the api_schema function will be called.

Cache Invalidation

For each schema in the cache, keep a vector of subgraph pointers containing an element for each schema in the subgraph's import graph which was imported by name and the subgraph ID which was used during the schema merge. When a schema is accessed from the schema cache (and possibly only if this check hasn't happened in the last N seconds), check the current version for each of these schemas and run a diff against the versions used for the most recent schema merge. If there are any new versions, re merge the schema.

Currently the schema_cache in the Store is a Mutex<LruCache<SubgraphDeploymentId, SchemaPair>>. A SchemaPair consists of two fields: input_schema and api_schema. To support the refresh flow, SchemaPair would be extended to be a SchemaEntry, with the fields input_schema, api_schema, schemas_imported (Vec<(SchemaReference, SubgraphDeploymentId)>), and a last_refresh_check timestamp.

A more performant invalidation solution would be to have the cache maintain a listener notifying it every time a subgraph's current version changes. Upon receiving the notification the listener scans the schemas in the cache for those which should be remerged.

Tests

  1. Schemas are merged correctly when all schemas and imported types are available.

  2. Placeholder types are properly inserted into the merged schema when a schema is not available.

  3. Placeholder types are properly inserted into the merged schema when the relevant schemas are available but the types are not.

  4. The cache is invalidated when the store returns an updated version of a cache entry's dependency.

Migration

Subgraph composition is an additive feature which doesn't require a special migration plan.

Documentation

Documentation on https://thegraph.com/docs needs to outline:

  1. The reserved Schema type and how to define imports on it.
  2. The semantics of importing by subgraph ID vs. subgraph name, i.e. what happens when a subgraph imported by name removes expected types from the schema.
  3. How queries are processed when imported subgraphs schemas or types are not available on the graph-node processing the query.

Implementation Plan

  • Implement the merged_schema function (2d)
  • Write tests for the merged_schema function (1d)
  • Integrate merged_schema into Store::cached_schema and update the cache to include the relevant information for imported schemas and types (1d)
  • Add cache invalidation logic to Store::cached_schema (2d)

Open Questions

  • The execution of queries and subscriptions needs to be updated to leverage the types in a merged schema.

Obsolete Engineering Plans

Obsolete Engineering Plans are moved to the engineering-plans/obsolete directory in the rfcs repository. They are listed below for reference.

  • No Engineering Plans have been obsoleted yet.

Rejected Engineering Plans

Rejected Engineering Plans can be found by filtering open and closed pull requests by those that are labeled with rejected. This list can be found here.