Introduction
This repository / book describes the process for proposing changes to Graph Protocol in the form of RFCs and Engineering Plans.
It also includes all approved, rejected and obsolete RFCs and Engineering Plans. For more details, see the following pages:
RFCs
What is an RFC?
An RFC describes a change to Graph Protocol, for example a new feature. Any
substantial change goes through the RFC process, where the change is described
in an RFC, is proposed a pull request to the rfcs
repository, is reviewed,
currently by the core team, and ultimately is either either approved or
rejected.
RFC process
1. Create a new RFC
RFCs are numbered, starting at 0001
. To create a new RFC, create a new branch
of the rfcs
repository. Check the existing RFCs to identify the next number to
use. Then, copy the RFC
template
to a new file in the rfcs/
directory. For example:
cp rfcs/0000-template.md rfcs/0015-fulltext-search.md
Write the RFC, commit it to the branch and open a pull
request in the rfcs
repository.
In addition to the RFC itself, the pull request must include the following changes:
- a link to the RFC on the Approved RFCs page, and
- a link to the RFC under
Approved RFCs
inSUMMARY.md
.
2. RFC review
After an RFC has been submitted through a pull request, it is being reviewed. At the time of writing, every RFC needs to be approved by
- at least one Graph Protocol founder, and
- at least one member of the core development team.
3. RFC approval
Once an RFC is approved, the RFC meta data (see the template) is updated and the pull request is merged by the original author or a Graph Protocol team member.
Approved RFCs
RFC-0001: Subgraph Composition
- Author
- Jannis Pohlmann
- RFC pull request
- https://github.com/graphprotocol/rfcs/pull/1
- Obsoletes
- -
- Date of submission
- 2019-12-08
- Date of approval
- -
- Approved by
- -
Summary
Subgraph composition enables referencing, extending and querying entities across subgraph boundaries.
Goals & Motivation
The high-level goal of subgraph composition is to be able to compose subgraph schemas and data hierarchically. Imagine umbrella subgraphs that combine all the data from a domain (e.g. DeFi, job markets, music) through one unified, coherent API. This could allow reuse and governance at different levels and go all the way to the top, fulfilling the vision of the Graph.
The ability to reference, extend and query entities across subgraph boundaries enables several use cases:
- Linking entities across subgraphs.
- Extending entities defined in other subgraphs by adding new fields.
- Breaking down data silos by composing subgraphs and defining richer schemas without indexing the same data over and over again.
Subgraph composition is needed to avoid duplicated work, both in terms of developing subgraphs as well as indexing them. It is an essential part of the overall vision behind The Graph, as it allows to combine isolated subgraphs into a complete, connected graph of the (decentralized) world's data.
Subgraph developers will benefit from the ability to reference data from other subgraphs, saving them development time and enabling richer data models. dApp developers will be able to leverage this to build more compelling applications. Node operators will benefit from subgraph composition by having better insight into which subgraphs are queried together, allowing them to make more informed decisions about which subgraphs to index.
Urgency
Due to the high impact of this feature and its important role in fulfilling the vision behind The Graph, it would be good to start working on this as early as possible.
Terminology
The feature is referred to by query-time subgraph composition, short: subgraph composition.
Terms introduced and used in this RFC:
- Imported schema: The schema of another subgraph from which types are imported.
- Imported type: An entity type imported from another subgraph schema.
- Extended type: An entity type imported from another subgraph schema and extended in the subgraph that imports it.
- Local schema: The schema of the subgraph that imports from another subgraph.
- Local type: A type defined in the local schema.
Detailed Design
The sections below make the assumption that there is a subgraph with the name
ethereum/mainnet
that includes an Address
entity type.
Composing Subgraphs By Importing Types
In order to reference entity types from annother subgraph, a developer would first import these types from the other subgraph's schema.
Types can be imported either from a subgraph name or from a subgraph ID. Importing from a subgraph name means that the exact version of the imported subgraph will be identified at query time and its schema may change in arbitrary ways over time. Importing from a subgraph ID guarantees that the schema will never change but also means that the import points to a subgraph version that may become outdated over time.
Let's say a DAO subgraph contains a Proposal
type that has a proposer
field
that should link to an Ethereum address (think: Ethereum accounts or contracts)
and a transaction
field that should link to an Ethereum transaction. The
developer would then write the DAO subgraph schema as follows:
type _Schema_
@import(
types: ["Address", { name: "Transaction", as: "EthereumTransaction" }],
from: { name: "ethereum/mainnet" }
)
type Proposal @entity {
id: ID!
proposer: Address!
transaction: EthereumTransaction!
}
This would then allow queries that follow the references to addresses and transactions, like
{
proposals {
proposer {
balance
address
}
transaction {
hash
block {
number
}
}
}
}
Extending Types From Imported Schemas
Extending types from another subgraph involves several steps:
- Importing the entity types from the other subgraph.
- Extending these types with custom fields.
- Managing (e.g. creating) extended entities in subgraph mappings.
Let's say the DAO subgraph wants to extend the Ethereum Address
type to
include the proposals created by each respective account. To achieve this, the
developer would write the following schema:
type _Schema_
@import(
types: ["Address"],
from: { name: "ethereum/mainnet" }
)
type Proposal @entity {
id: ID!
proposer: Address!
}
extend type Address {
proposals: [Proposal!]! @derivedFrom(field: "proposal")
}
This makes queries like the following possible, where the query can go "back"
from addresses to proposal entities, despite the Ethereum Address
type
originally being defined in the ethereum/mainnet
subgraph.
{
addresses {
id
proposals {
id
proposer {
id
}
}
}
In the above case, the proposals
field on the extended type is derived, which
means that an implementation wouldn't have to create a local extension type in
the store. However, if proposals
was defined as
extend type Address {
proposals: [Proposal!]!
}
then it would the subgraph mappings would have to create partial Address
entities with id
and proposals
fields for all addresses from which proposals
were created. At query time, these entity instances would have to be merged with
the original Address
entities from the ethereum/mainnet
subgraph.
Subgraph Availability
In the decentralized network, queries will be split and routed through the network based on what indexers are available and which subgraphs they index. At that point, failure to find an indexer for a subgraph that types were imported from will result in a query error. The error that a non-nullable field resolved to null bubbles up to the next nullable parent, in accordance with the GraphQL Spec.
Until the network is reality, we are dealing with individual Graph Nodes and
querying subgraphs where imported entity types are not also indexed on the same
node should be handled with more tolerance. This RFC proposes that entity
reference fields that refer to imported types are converted to being optional in
the generated API schema. If the subgraph that the type is imported from is not
available on a node, such fields should resolve to null
.
Interfaces
Subgraph composition also supports interfaces in the ways outlined below.
Interfaces Can Be Imported From Other Subgraphs
The syntax for this is the same as that for importing types:
type _Schema_
@import(types: ["ERC20"], from: { name: "graphprotocol/erc20" })
Local Types Can Implement Imported Interfaces
This is achieved by importing the interface from another subgraph schema and implementing it in entity types:
type _Schema_
@import(types: ["ERC20"], from: { name: "graphprotocol/erc20" })
type MyToken implements ERC20 @entity {
# ...
}
Imported Types Can Be Extended To Implement Local Interfaces
This is achieved by importing the types from another subgraph schema, defining a
local interface and using extend
to implement the interface on the imported
types:
type _Schema_
@import(types: [{ name: "Token", as "LPT" }], from: { name: "livepeer/livepeer" })
@import(types: [{ name: "Token", as "Rep" }], from: { name: "augur/augur" })
interface Token {
id: ID!
balance: BigInt!
}
extend LPT implements Token {
# ...
}
extend Rep implements Token {
# ...
}
Imported Types Can Be Extended To Implement Imported Interfaces
This is a combination of importing an interface, importing the types and extending them to implement the interface:
type _Schema_
@import(types: ["Token"], from: { name: "graphprotocol/token" })
@import(types: [{ name: "Token", as "LPT" }], from: { name: "livepeer/livepeer" })
@import(types: [{ name: "Token", as "Rep" }], from: { name: "augur/augur" })
extend LPT implements Token {
# ...
}
extend Rep implements Token {
# ...
}
Implementation Concerns For Interface Support
Querying across types from different subgraphs that implement the same interface may require a smart algorithm, especially when it comes to pagination. For instance, if the first 1000 entities for an interface are queried, this range of 1000 entities may be divided up between different local and imported types arbitrarily.
A naive algorithm could request 1000 entities from each subgraph, applying the selected filters and order, combine the results and cut off everything after the first 1000 items. This would generate a minimum of requests but would involve significant overfetching.
Another algorithm could just fetch the first item from each subgraph, then based on that information, divide up the range in more optimal ways than the previous algorith, and satisfy the query with more requests but with less overfetching.
Compatibility
Subgraph composition is a purely additive, non-breaking change. Existing subgraphs remain valid without any migrations being necessary.
Drawbacks And Risks
Reasons that could speak against implementing this feature:
-
Schema parsing and validation becomes more complicated. Especially validation of imported schemas may not always be possible, depending on whether and when the referenced subgraph is available on the Graph Node or not.
-
Query execution becomes more complicated. The subgraph a type belongs to must be identified and local as well as imported versions of extended entities have to be queried separately and be merged before returning data to the client.
Alternatives
No alternatives have been considered.
There are other ways to compose subgraph schemas using GraphQL technologies such as schema stitching or Apollo Federation. However, schema stitching is being deprecated and Apollo Federation requires a centralized server to serve to extend and merge GraphQL API. Both of these solutions slow down queries.
Another reason not to use these is that GraphQL will only be one of several query languages supported in the future. Composition therefore has to be implemented in a query-language-agnostic way.
Open Questions
-
Right now, interfaces require unique IDs across all the concrete entity types that implement them. This is not something we can guarantee any longer if these concrete types live in different subgraphs. So we have to handle this at query time (or must somehow disallow it, returning a query error).
It is also unclear how an individual interface entity lookup would look like if IDs are no longer guaranteed to be unique:
someInterface(id: "?????") { }
RFC-0002: Ethereum Tracing Cache
- Author
- Zac Burns
- RFC pull request
- https://github.com/graphprotocol/rfcs/pull/4
- Obsoletes (if applicable)
- None
- Date of submission
- 2019-12-13
- Date of approval
- 2019-12-20
- Approved by
- Jannis Pohlmann
Summary
This RFC proposes the creation of a local Ethereum tracing cache to speed up indexing of subgraphs which use block and/or call handlers.
Motivation
When indexing a subgraph that uses block and/or call handlers, it is necessary to extract calls from the trace of each block that a Graph Node indexes. It is expensive to acquire and process traces from Ethereum nodes in both money and time.
When developing a subgraph it is common to make changes and deploy those changes to a production Graph Node for testing. Each time a change is deployed, the Graph Node must re-sync the subgraph using the same traces that were used for the previous sync of the subgraph. The cost of acquiring the traces each time a change is deployed impacts a subgraph developer's ability to iterate and test quickly.
Urgency
None
Terminology
Ethereum cache: The new API proposed here.
Detailed Design
There is an existing EthereumCallCache
for caching eth_call
built into Graph Node today. This cache will be extended to support traces, and renamed to EthereumCache
.
Compatibility
This change is backwards compatible. Existing code can continue to use the parity tracing API. Because the cache is local, each indexing node may delete the cache should the format or implementation of caching change. In this case of invalidated cache the code will fall back to existing methods for retrieving a trace and repopulating the cache.
Drawbacks and Risks
Subgraphs which are not being actively developed will incur the overhead for storing traces, but will not ever reap the benefits of ever reading them back from the cache.
If this drawback is significant, it may be necessary to extend EthereumCache
to provide a custom score for cache invalidation other than the current date. For example, trace_filter
calls could be invalidated based on the latest update time for a subgraph requiring the trace. It is expected that a subgraph which has been updated recently is more likely to be updated again soon then a subgraph which has not been recently updated.
Alternatives
None
Open Questions
None
Obsolete RFCs
Obsolete RFCs are moved to the rfcs/obsolete
directory in the rfcs
repository. They are listed below for reference.
- No RFCs have been obsoleted yet.
Rejected RFCs
Rejected RFCs can be found by filtering open and closed pull requests by those
that are labeled with rejected
. This list can be found
here.
Engineering Plans
What is an Engineering Plan?
Engineering Plans are plans to turn an RFC into an implementation in the core Graph Protocol tools like Graph Node, Graph CLI and Graph TS. Every substantial development effort that follows an RFC is planned in the form of an Engineering Plan.
Engineering Plan process
1. Create a new Engineering Plan
Like RFCs, Engineering Plans are numbered, starting at 0001
. To create a new
plan, create a new branch of the rfcs
repository. Check the existing plans to
identify the next number to use. Then, copy the Engineering Plan
template
to a new file in the engineering-plans/
directory. For example:
cp engineering-plans/0000-template.md engineering-plans/0015-fulltext-search.md
Write the Engineering Plan, commit it to the branch and open a pull
request in the rfcs
repository.
In addition to the Engineering Plan itself, the pull request must include the following changes:
- a link to the Engineering Plan on the Approved Engineering Plans page, and
- a link to the Engineering Plan under
Approved Engineering Plans
inSUMMARY.md
.
2. Engineering Plan review
After an Engineering Plan has been submitted through a pull request, it is being reviewed. At the time of writing, every Engineering Plan needs to be approved by
- the Tech Lead, and
- at least one member of the core development team.
3. Engineering Plan approval
Once an Engineering Plan is approved, the Engineering Plan meta data (see the template) is updated and the pull request is merged by the original author or a Graph Protocol team member.
Approved Engineering Plans
- PLAN-0001: GraphQL Query Prefetching
- PLAN-0002: Ethereum Tracing Cache
- PLAN-0003: Remove JSONB Storage
- PLAN-0004: Subgraph Schema Merging
PLAN-0001: GraphQL Query Prefetching
- Author
- David Lutterkort
- Implements
- No RFC - no user visible changes
- Engineering Plan pull request
- https://github.com/graphprotocol/rfcs/pull/2
- Date of submission
- 2019-11-27
- Date of approval
- 2019-12-10
- Approved by
- Jannis Pohlmann, Leo Yvens
This is not really a plan as it was written and discussed before we adopted the RFC process, but contains important implementation detail of how we process GraphQL queries.
Contents
Implementation Details for prefetch queries
Goal
For a GraphQL query of the form
query {
parents(filter) {
id
children(filter) {
id
}
}
}
we want to generate only two SQL queries: one to get the parents, and one
to get the children for all those parents. The fact that children
is
nested under parents
requires that we add a filter to the children
query that restricts children to those that are related to the parents we
fetched in the first query to get the parents. How exactly we filter the
children
query depends on how the relationship between parents and
children is modeled in the GraphQL schema, and on whether one (or both) of
the types involved are interfaces.
The rest of this writeup is concerned with how to generate the query for
children
, assuming we already retrieved the list of all parents.
The bulk of the implementation of this feature can be found in
graphql/src/store/prefetch.rs
, store/postgres/src/jsonb_queries.rs
, and
store/postgres/src/relational_queries.rs
Handling first/skip
We never get all the children
for a parent; instead we always have a
first
and skip
argument in the children filter. Those arguments need to
be applied to each parent individually by ranking the children for each
parent according to the order defined by the children
query. If the same
child matches multiple parents, we need to make sure that it is considered
separately for each parent as it might appear at different ranks for
different parents. In SQL, we use the rank()
window function for this:
select *
from (
select c.*,
rank() over (partition by parent_id order by ...) as pos
from (query to get children) c)
where pos >= skip and pos < skip + first
Handling interfaces
If parents
or children
(or both) are interfaces, we resolve the
interfaces into the concrete types implementing them, produce a query for
each combination of parent/child concrete type and combine those queries
via union all
.
Since implementations of the same interface will generally differ in the schema
they use, we can not form a union all
of all the data in the tables for
these concrete types, but have to first query only attributes that we know
will be common to all entities implementing the interface, most notably the
vid
(a unique identifier that identifies the precise version of an
entity), and then later fill in the details of each entity by converting it
directly to JSON.
That means that when we deal with children that are an interface, we will first select only the following columns (where exactly they come from depends on how the parent/child relationship is modeled)
select '{__typename}' as entity, c.vid, c.id, parent_id
and form the union all
of these queries. We then use that union to rank
children as described above.
Handling parent/child relationships
How we get the children for a set of parents depends on how the relationship between the two is modeled. The interesting parameters there are whether parents store a list or a single child, and whether that field is derived, together with the same for children.
There are a total of 16 combinations of these four boolean variables; four of them, when both parent and child derive their fields, are not permissible. It also doesn't matter whether the child derives its parent field: when the parent field is not derived, we need to use that since that is the only place that contains the parent -> child relationship. When the parent field is derived, the child field can not be a derived field.
That leaves us with the following combinations of whether the parent and child store a list or a scalar value, and whether the parent is derived:
For details on the GraphQL schema for each row in this table, see the
section at the end. The Join cond
indicates how we can find the children
for a given parent. There are four different join conditions in this table.
When we query children, we need to have the id of the parent that child is related to (and list the child multiple times if it is related to multiple parents) since that is the field by which we window and rank children.
For join conditions of type C and D, the id of the parent is not stored in
the child, which means we need to join with the parents
table.
Let's work out the details of these queries; the implementation uses
struct EntityLink
in graph/src/components/store.rs
to distinguish
between the different types of joins and queries.
Case | Parent list? | Parent derived? | Child list? | Join cond | Type |
---|---|---|---|---|---|
1 | TRUE | TRUE | TRUE | child.parents ∋ parent.id | A |
2 | FALSE | TRUE | TRUE | child.parents ∋ parent.id | A |
3 | TRUE | TRUE | FALSE | child.parent = parent.id | B |
4 | FALSE | TRUE | FALSE | child.parent = parent.id | B |
5 | TRUE | FALSE | TRUE | child.id ∈ parent.children | C |
6 | TRUE | FALSE | FALSE | child.id ∈ parent.children | C |
7 | FALSE | FALSE | TRUE | child.id = parent.child | D |
8 | FALSE | FALSE | FALSE | child.id = parent.child | D |
Type A
Use when parent is derived and child is a list
select c.*, parent_id
from {children} c join lateral unnest(c.{parent_field}) parent_id
where parent_id = any($parent_ids)
Data needed to generate:
- children: name of child table
- parent_ids: list of parent ids
- parent_field: name of parents field (array) in child table
The implementation uses a EntityLink::Direct
for joins of this type.
Type B
Use when parent is derived and child is not a list
select c.*, c.{parent_field} as parent_id
from {children} c
where c.{parent_field} = any($parent_ids)
Data needed to generate:
- children: name of child table
- parent_ids: list of parent ids
- parent_field: name of parent field (scalar) in child table
The implementation uses a EntityLink::Direct
for joins of this type.
Type C
Use when parent is a list and not derived
select c.*, p.id as parent_id
from {children} c, {parents} p
where p.id = any($parent_ids)
and c.id = any(p.{child_field})
Data needed to generate:
- children: name of child table
- parent_ids: list of parent ids
- parents: name of parent table
- child_field: name of child field (array) in parent table
The implementation uses a EntityLink::Parent
for joins of this type.
Type D
Use when parent is not a list and not derived
select c.*, p.id as parent_id
from {children} c, {parents} p
where p.id = any($parent_ids)
and c.id = p.child_field
Data needed to generate:
- children: name of child table
- parent_ids: list of parent ids
- parents: name of parent table
- child_field: name of child field (scalar) in parent table
The implementation uses a EntityLink::Parent
for joins of this type.
Putting it all together
Note that in all of these queries, we ultimately return the typename of each entity, together with a JSONB representation of that entity. We do this for two reasons: first, because different child tables might have different schemas, which precludes us from taking the union of these child tables, and second, because Diesel does not let us execute queries where the type and number of columns in the result is determined dynamically.
We need to to be careful though to not convert to JSONB too early, as that is slow when done for large numbers of rows. Deferring conversion is responsible for some of the complexity in these queries.
In the following, we only go through the queries for relational storage;
for JSONB storage, there are similar considerations, though they are
somewhat simpler as the union all
in the below queries turns into
an entity = any(..)
clause with JSONB storage, and because we do not need
to convert to JSONB data.
Note that for the windowed queries below, the entity we return will have
parent_id
and pos
attributes. The parent_id
is necessary to attach
the query result to the right parents we already have in memory. The JSONB
queries need to somehow insert the parent_id
field into the JSONB data
they return.
In the most general case, we have an EntityCollection::Window
with
multiple windows. The query for that case is
with matches as (
-- Limit the matches for each parent
select c.*
from (
-- Rank matching children for each parent
select c.*,
rank() over (partition by c.parent_id order by {query.order}) as pos
from (
{window.children_uniform(sort_key, block)}
union all
... range ober all windows) c) c
where c.pos > {skip} and c.pos <= {skip} + {first})
-- Get the full entity for each match
select m.entity, to_jsonb(c.*) as data, m.parent_id, m.pos
from matches m, {window.child_table()} c
where c.vid = m.vid and m.entity = '{window.child_type}'
union all
... range over all windows
-- Make sure we return the children for each parent in the correct order
order by parent_id, pos
When there is only one window, we can simplify the above query. The
simplification basically inlines the matches
CTE. That is important as
CTE's in Postgres before Postgres 12 are optimization fences, even when
they are only used once. We therefore reduce the two queries that Postgres
executes above to one for the fairly common case that the children are not
an interface.
select '{window.child_type}' as entity, to_jsonb(c.*) as data
from (
-- Rank matching children
select c.*,
rank() over (partition by c.parent_id order by {query.order}) as pos
from ({window.children_detailed()}) c) c
where c.pos >= {window.skip} and c.pos <= {window.skip} + {window.first}
order by c.parent_id,c.pos
When we do not have to window, but only deal with an
EntityCollection::All
with multiple entity types, we can simplify the
query by avoiding ranking and just using an ordinary order by
clause:
with matches as (
-- Get uniform info for all matching children
select '{entity_type}' as entity, id, vid, {sort_key}
from {entity_table} c
where {query_filter}
union all
... range over all entity types
order by {sort_key} offset {query.skip} limit {query.first})
-- Get the full entity for each match
select m.entity, to_jsonb(c.*) as data, c.id, c.{sort_key}
from matches m, {entity_table} c
where c.vid = m.vid and m.entity = '{entity_type}'
union all
... range over all entity types
-- Make sure we return the children for each parent in the correct order
order by c.{sort_key}, c.id
And finally, for the very common case of a GraphQL query without nested
children that uses a concrete type, not an interface, we can further
simplify this, again by essentially inlining the matches
CTE to:
select '{entity_type}' as entity, to_jsonb(c.*) as data
from {entity_table} c
where query.filter()
order by {query.order} offset {query.skip} limit {query.first}
Boring list of possible GraphQL models
These are the eight ways in which a parent/child relationship can be
modeled. For brevity, I left the id
attribute on each parent and child
type out.
This list assumes that parent and child types are concrete types, i.e., that any interfaces involved in this query have already been reolved into their implementations and we are dealing with one pair of concrete parent/child types.
# Case 1
type Parent {
children: [Child] @derived
}
type Child {
parents: [Parent]
}
# Case 2
type Parent {
child: Child @derived
}
type Child {
parents: [Parent]
}
# Case 3
type Parent {
children: [Child] @derived
}
type Child {
parent: Parent
}
# Case 4
type Parent {
child: Child @derived
}
type Child {
parent: Parent
}
# Case 5
type Parent {
children: [Child]
}
type Child {
# doesn't matter
}
# Case 6
type Parent {
children: [Child]
}
type Child {
# doesn't matter
}
# Case 7
type Parent {
child: Child
}
type Child {
# doesn't matter
}
# Case 8
type Parent {
child: Child
}
type Child {
# doesn't matter
}
PLAN-0002: Ethereum Tracing Cache
- Author
- Zachary Burns
- Implements
- RFC-0002 Ethereum Tracing Cache
- Engineering Plan pull request
- https://github.com/graphprotocol/rfcs/pull/9
- Date of submission
- 2019-12-20
- Date of approval
- 2020-01-07
- Approved by
- Jannis Pohlmann, Leo Yvens
Summary
Implements RFC-0002: Ethereum Tracing Cache
Implementation
These changes happen within or near ethereum_adapter.rs
, store.rs
and db_schema.rs
.
Limitations
The problem of reorg turns out to be a particularly tricky one for the cache, mostly due to ranges of blocks being requested rather than individual hashes. To sidestep this problem, only blocks that are older than the reorg threshold will be eligible for caching.
Additionally, there are some subgraphs which may require traces from all or a substantial number of blocks and don't make effective use of filtering. In particular, subgraphs which specify a call handler without a contract address fall into this category. In order to prevent the cache from bloating, any use of Ethereum traces which does not filter on a contract address will bypass the cache.
EthereumTraceCache
The implementation introduces the following trait, which is implemented primarily by Store
.
#![allow(unused_variables)] fn main() { use std::ops::RangeInclusive; struct TracesInRange { range: RangeInclusive<u64>, traces: Vec<Trace>, } pub trait EthereumTraceCache: Send + Sync + 'static { /// Attempts to retrieve traces from the cache. Returns ranges which were retrieved. /// The results may not cover the entire range of blocks. It is up to the caller to decide /// what to do with ranges of blocks that are not cached. fn traces_for_blocks(contract_address: Option<H160>, blocks: RangeInclusive<u64> ) -> Box<dyn Future<Output=Result<Vec<TracesInRange>, Error>>>; fn add(contract_address: Option<H160>, traces: Vec<TracesInRange>); } }
Block schema
Each cached block will exist as its own row in the database in an eth_traces_cache
table.
#![allow(unused_variables)] fn main() { eth_traces_cache(id) { id -> Integer, network -> Text, block_number: Integer, contract_address: Bytea, traces -> Jsonb, } }
A multi-column index will be added on network, block_number, and contract_address.
It can be noted that in the eth_traces_cache
table, there is a very low cardinality for the value of the network row. It is inefficient for example to store the string mainnet
millions of times and consider this value when querying. A data oriented approach would be to partition these tables on the value of the network. It is expected that hash partitioning available in Postgres 11 would be useful here, but the necessary dependencies won't be ready in time for this RFC. This may be revisited in the future.
Valid Cache Range
Because the absence of trace data for a block is a valid cache result, the database must maintain a data structure indicating which ranges of the cache are valid in an eth_traces_meta
table. This table also enables eventually implementing cleaning out old data.
This is the schema for that structure:
#![allow(unused_variables)] fn main() { id -> Integer, network -> Text, start_block -> Integer, end_block -> Integer, contract_address -> Nullable<Bytea>, accessed_at -> Date, }
When inserting data into the cache, removing data from the cache, or reading the cache, a serialized transaction must be used to preserve atomicity between the valid cache range structure and the cached blocks. Care must be taken to not rely on any data read outside of the serialized transaction, and for the extent of the serialized transaction to not span any async contexts that rely on any Future
outside of the database itself. The definition of the EthereumTraceCache
trait is designed to uphold these guarantees.
In order to preserve space in the database, whenever the valid cache range is added it will be added such that adjacent and overlapping ranges are merged into it.
Cache usage
The primary user of the cache is EtheriumAdapter<T>
in the traces
function.
The correct algorithm for retrieving traces from the cache is surprisingly nuanced. The complication arises from the interaction between multiple subgraphs which may require a subset of overlapping contract addresses. The rate at which indexing proceeds of these subgraphs can cause different ranges of the cache to be valid for a contract address in a single query.
We want to minimize the cost of external requests for trace data. It is likely that it is better to...
- Make fewer requests
- Not ask for trace data that is already cached
- Ask for trace data for multiple contract addresses within the same block when possible.
There is one flow of data which upholds these invariants. In doing so it makes a tradeoff of increasing latency for the execution of a specific subgraph, but increases throughput of the whole system.
Within this graph:
- Edges which are labelled refer to some subset of the output data.
- Edges which are not labelled refer to the entire set of the output data.
- Each node executes once for each contiguous range of blocks. That is, it merges all incoming data before executing, and executes the minimum possible times.
- The example given is just for 2 addresses. The actual code must work on sets of addresses.
graph LR; A[Block Range for Contract A & B] A --> |Above Reorg Threshold| E D[Get Cache A] A --> |Below Reorg Threshold A| D A --> |Below Reorg Threshold B| H E[Ethereum A & B] F[Ethereum A] G[Ethereum B] H[Get Cache B] D --> |Found| M H --> |Found| M M[Result] D --> |Missing| N H --> |Missing| N N[Overlap] N --> |A & B| E N --> |A| F N --> |B| G E --> M K[Set Cache A] L[Set Cache B] E --> |B Below Reorg Threshold| L E --> |A Below Reorg Threshold| K F --> K G --> L F --> M G --> M
This construction is designed to make the fewest number of the most efficient calls possible. It is not as complicated as it looks. The actual construction can be expressed as sequential steps with a set of filters preceding each step.
Useful dependencies
The feature deals a lot with ranges and sets. Operations like sum, subtract, merge, and find overlapping are used frequently. nested_intervals is a crate which provides some of these operations.
Tests
Benchmark
A temporary benchmark will be added for indexing a simple subgraph which uses call handlers. The benchmark will be run in these scenarios:
- Sync before changes
- Re-sync before changes
- Sync after changes
- Re-sync after changes
Ranges
Due to the complexity of the resource minimizing data workflow, it will be useful to have mocks for the cache and database which record their calls, and check that expected calls are made for tricky data sets.
Database
A real database integration test will be added to test the add/remove from cache implementation to verify that it correctly merges blocks, handles concurrency issues, etc.
Migration
None
Documentation
None, aside from code comments
Implementation Plan:
These estimates inflated to account for the author's lack of experience with Postgres, Ethereum, Futures0.1, and The Graph in general.
- (1) Create benchmarks
- Postgres Cache
- (0.5) Block Cache
- (0.5) Trace Serialization/Deserialization
- (1.0) Ranges Cache
- (0.5) Concurrency/Transactions
- (0.5) Tests against Postgres
- Data Flow
- (3) Implementation
- (1) Unit tests
- (0.5) Run Benchmarks
Total: 8
PLAN-0003: Remove JSONB Storage
- Author
- David Lutterkort
- Implements
- No RFC - no user visible changes
- Engineering Plan pull request
- https://github.com/graphprotocol/rfcs/pull/7
- Date of submission
- 2019-12-18
- Date of approval
- 2019-12-20
- Approved by
- Jess Ngo, Jannis Pohlmann
Summary
Remove JSONB storage from graph-node
. That means that we want to remove
the old storage scheme, and only use relational storage going
forward. At a high level, removal has to touch the following areas:
- user subgraphs in the hosted service
- user subgraphs in self-hosted
graph-node
instances - subgraph metadata in
subgraphs.entities
(see this issue) - the
graph-node
code base
Because it touches so many areas and different things, JSONB storage removal will need to happen in several steps, the last being actual removal of JSONB code. The first three steps above are independent of each other and can be done in parallel.
Implementation
User Subgraphs in the Hosted Service
We will need to communicate to users that they need to update their subgraphs if they still use JSONB storage. Currently, there are ~ 580 subgraphs (list) belonging to 220 different organizations using JSONB storage. It is quite likely that the vast majority of them is not needed anymore and simply left over from somebody trying something out.
We should contact users and tell them that we will delete their subgraph after a certain date (say 2020-02-01) unless they deploy a new version of the subgraph (with an explanation why etc. of course) Redeploying their subgraph is all that is needed for those updates.
Self-hosted User Subgraphs
We will need to tell users that the 'old' JSONB storage is deprecated and support for it will be removed as of some target date, and that they need to redeploy their subgraph.
Users will need some documentation/tooling to help them understand
- which of their deployed subgraphs still use JSONB storage
- how to remove old subgraphs
- how to remove old deployments
Subgraph Metadata in subgraphs.entities
We can treat the subgraphs
schema like a normal subgraph, with the
exception that some entities must not be versioned. For that, we will need
to adopt code that makes it possible to write entities to the store without
recording their version (or, more generally, so that there will only be one
version of the entity, tagged with a block range [0,)
)
We will manually create the DDL for the subgraphs.graphql
schema and run
that as part of a database migration. In that migration, we will also copy
the existing metadata from subgraphs.entities
and
subgraphs.entity_history
into their new tables.
The Code Base
Delete all code handling JSONB storage. This will mostly affect
entities.rs
and jsonb_queries.rs
in graph-store-postgres
, but there
are also smaller things like that we do not need the annotations on
Entity
to serialize them to the JSON format that JSONB uses.
Tests
Most of the code-level changes are covered by the existing test suite. The major exception is that the migration of subgraph metadata needs to be tested and checked manually, using a recent dump of the production database.
Migration
See above on migrating data in the subgraphs
schema.
Documentation
No user-facing documentation is needed.
Implementation Plan
No estimates yet as we should first agree on this general course of action
- Notify hosted users to update their subgraph or have it deleted by date X
- Mark JSONB storage as deprecated and announce when it will be removed
- Provide tool to ship with
graph-node
to delete unused deployments and unneeded subgraphs - Add affordance to not version entities to relational storage code
- Write SQL migrations to create new subgraph metadata schema and copy existing data
- Delete old JSONB code
- On start of
graph-node
, add check for any deployments that still use JSONB storage and log warning messages telling users to redeploy (once the JSONB code has been deleted, this data can not be accessed any more)
Open Questions
None
PLAN-0004: Subgraph Schema Merging
- Author
- Jorge Olivero
- Implements
- RFC-0001 Subgraph Composition
- Engineering Plan pull request
- Engineering Plan PR
- Obsoletes (if applicable)
- -
- Date of submission
- 2019-12-09
- Date of approval
- TBD
- Approved by
- TBD
Summary
Subgraph composition allows a subgraph to import types from another subgraph. Imports are designed to be loosely coupled throughout the deployment and indexing phases, and tightly coupled during query execution for the subgraph.
To generate an API schema for the subgraph, the subgraph schema needs to be merged with the imported subgraph schemas. The merging process needs to take the following into account:
- An imported schema not being available on the Graph Node
- An imported schema mising a type imported by the subgraph schema
- A schema imported by subgraph name changes
- Ability to tell which subgraph/schema each type belongs to
Implementation
The schema merging implementation consists of two parts:
- A cache of subgraph schemas
- The schema merging logic
Schema Merging
Add a merged_schema(&Schema, HashMap<SchemaReference, Arc<Schema>) -> Schema
function to the graphql::schema
crate which will add each of the imported types to the provided document with a @subgraphId
directive denoting which subgraph the type came from. If any of the imported types have non scalar fields, import those types as well.
The HashMap<SchemaReference, Arc<Schema>
includes all of the schemas in the subgraph's import graph which are available on the Graph Node. For each @import
directive on the subgraph, find the imported types by tracing their path along the import graph.
- If any schema node along that path is missing or if a type is missing in the schema, add a type definition to the subgraphs merged schema with the proper name, a
@subgraphId(id: "...")
directive (if available), and a@placeholder
directive denoting that type was not found. sc- If the type is found, copy it and add a@subgraphId(id: "...")
directive. - If the type is imported with the
{ name: "", as: "" }
format, the merged type will include an@originalName(name: "...")
directive preserving the type name from the original schema.
The api_schema
function will add all the necessary types and fields for the imported types without requiring any changes.
Example #1: Complete merge
Local schema before calling merged_schema
:
type _Schema_
@import(
types: ["B"],
from: { id: "X" }
)
type A @entity {
id: ID!
foo: B!
}
Imported Schema X:
type B @entity {
id: ID!
bar: String
}
Schema after calling merged_schema
:
type A @entity {
id: ID!
foo: B!
}
type B @entity @subgraphId(id: "X") {
id: ID!
bar: String
}
Example #2: Incomplete merge
Schema before calling merged_schema
:
type _Schema_
@import(
types: ["B"],
from: { id: "X" }
)
type A @entity {
id: ID!
foo: B!
}
Imported Schema X:
NOT AVAILABLE
Schema after calling merged_schema
type A @entity @subgraphId(id: "...") {
id: ID!
foo: B!
}
type B @entity @placeholder {
id: ID!
}
Example #3: Complete merge with { name: "...", as: "..." }
Schema before calling merged_schema
type _Schema_
@imports(
types: [{ name: "B", as: "BB" }]
from: { id: "X" }
)
type B @entity {
id: ID!
foo: BB!
}
Imported Schema X:
type B @entity {
id: ID!
bar: String
}
Schema after calling merged_schema
type B @entity {
id: ID!
foo: BB!
}
type BB @entity @subgraphId(id: "X") @originalName(name: "B") {
id: ID!
bar: String
}
Example #4: Complete merge with nested types
Schema before calling merged_schema
type _Schema_
@imports(
types: [{ name: "B", as: "BB" }]
from: { id: "X" }
)
type B @entity {
id: ID!
foo: BB!
}
Imported Schema X:
type _Schema_
@imports(
types: [{ name: "C", as: "CC" }]
from: { id: "Y" }
)
type B @entity {
id: ID!
bar: CC!
baz: DD!
}
type DD @entity {
id: ID!
}
Imported Schema Y:
type C @entity {
id: ID!
}
Schema after calling merged_schema
type B @entity {
id: ID!
foo: BB!
}
type BB @entity @subgraphId(id: "X") @originalName(name: "B") {
id: ID!
bar: CC!
baz: DD!
}
type CC @entity @subgraphId(id: "Y") @originalName(name: "C") {
id: ID!
}
type DD @entity @subgraphId(id: "X") {
id: ID!
}
After the schema document is merged, the api_schema
function will be called.
Cache Invalidation
For each schema in the cache, keep a vector of subgraph pointers containing an element for each schema in the subgraph's import graph which was imported by name and the subgraph ID which was used during the schema merge. When a schema is accessed from the schema cache (and possibly only if this check hasn't happened in the last N seconds), check the current version for each of these schemas and run a diff against the versions used for the most recent schema merge. If there are any new versions, re merge the schema.
Currently the schema_cache
in the Store
is a Mutex<LruCache<SubgraphDeploymentId, SchemaPair>>
. A SchemaPair
consists of two fields: input_schema
and api_schema
. To support the refresh flow, SchemaPair
would be extended to be a SchemaEntry
, with the fields input_schema
, api_schema
, schemas_imported
(Vec<(SchemaReference, SubgraphDeploymentId)>
), and a last_refresh_check
timestamp.
A more performant invalidation solution would be to have the cache maintain a listener notifying it every time a subgraph's current version changes. Upon receiving the notification the listener scans the schemas in the cache for those which should be remerged.
Tests
-
Schemas are merged correctly when all schemas and imported types are available.
-
Placeholder types are properly inserted into the merged schema when a schema is not available.
-
Placeholder types are properly inserted into the merged schema when the relevant schemas are available but the types are not.
-
The cache is invalidated when the store returns an updated version of a cache entry's dependency.
Migration
Subgraph composition is an additive feature which doesn't require a special migration plan.
Documentation
Documentation on https://thegraph.com/docs needs to outline:
- The reserved Schema type and how to define imports on it.
- The semantics of importing by subgraph ID vs. subgraph name, i.e. what happens when a subgraph imported by name removes expected types from the schema.
- How queries are processed when imported subgraphs schemas or types are not available on the graph-node processing the query.
Implementation Plan
- Implement the
merged_schema
function (2d) - Write tests for the
merged_schema
function (1d) - Integrate
merged_schema
intoStore::cached_schema
and update the cache to include the relevant information for imported schemas and types (1d) - Add cache invalidation logic to
Store::cached_schema
(2d)
Open Questions
- The execution of queries and subscriptions needs to be updated to leverage the types in a merged schema.
Obsolete Engineering Plans
Obsolete Engineering Plans are moved to the engineering-plans/obsolete
directory in the rfcs
repository. They are listed below for reference.
- No Engineering Plans have been obsoleted yet.
Rejected Engineering Plans
Rejected Engineering Plans can be found by filtering open and closed pull
requests by those that are labeled with rejected
. This list can be found
here.