Quest to determine the 'G' in GUAC

Working towards determining a persistent database for GUAC

Parth Patel

June 27, 2023

In the era of big data and interconnected systems, organizations face the challenge of effectively managing and analyzing complex relationships within their data. Graph databases have emerged as a powerful solution for representing and querying highly connected data, enabling businesses to gain valuable insights. With GUAC (Graph for Understanding Artifact Composition), we face a similar issue. With large amounts of data from SBOMs, SLSA attestations, and other ITE-6 based attestations, it is integral that we can effectively manage and create the edges that represent the relationships between these documents.

As we work to meet the goals of persistence in GUAC, we are running a series of analyses and comparisons among the many different graph database options. GUAC has a few critically important requirements for the backend, including: efficient ingestion of data, performant complex queries, the schema in which the data is stored, and finally optimization of the query based on the specific language.

Evaluating the Requirements

Efficient Data Ingestion: A graph database that excels in data ingestion is essential for managing large-scale datasets. The speed and efficiency of data ingestion directly impact the ability to update and integrate new information into the database seamlessly. This is important as software supply chain metadata will constantly be growing. For example, the time it takes to ingest packages and create the dependency relationship between them. In graphQL this would relationship would look like:

 	
mutation IngestPackage($pkg: PkgInputSpec!) {
	ingestPackage(pkg: $pkg) {
  	...AllPkgTree
  }
}

‍

 	
mutation IsDependency(
	$pkg: PkgInputSpec!
  $depPkg: PkgInputSpec!
  $dependency: IsDependencyInputSpec!
) {
	ingestDependency(pkg: $pkg, depPkg: $depPkg, dependency: $dependency) {
  	...allIsDependencyTree
  }
}

First, we would have to ingest the packages (both package and dependent) and after that ingest the actual dependency node that connects to two packages together. We have to work to reduce the time that it takes complete these series of steps in the most efficient manner.
Quick and Complex Query Execution: Graph databases provide powerful querying capabilities for traversing complex relationships and uncovering valuable insights. The efficiency of query execution is crucial, especially when dealing with large datasets. For GUAC, being able to answer specific queries quickly can allow for quick and proactive decision-making. While complex queries might still take time that a fast-paced industry leader cannot afford, we can work around this by pre-computing results from these queries and creating “proceed” or “not proceed” type attestations that can be updated based on a set schedule. For example, for GUAC we would want to quickly determine which all the vulnerabilities (even transitive) for a particular package or artifact:

 	
./bin/guacone query vuln "pkg:guac/spdx/ghcr.io/guacsec/vul-image-latest"
+-------------+-----------+---------------------------------------+
| NODE TYPE   | NODE ID # | ADDITIONAL INFORMATION                |
+-------------+-----------+---------------------------------------+
| certifyVuln | 131588    | vulnerability ID: ghsa-vwqq-5vrc-xw9h |
| certifyVuln | 131585    | vulnerability ID: ghsa-8489-44mv-ggj8 |
| certifyVuln | 131594    | vulnerability ID: ghsa-jfh8-c2jp-5v3q |
| certifyVuln | 131598    | vulnerability ID: ghsa-7rjr-3q55-vv33 |
| certifyVuln | 131600    | vulnerability ID: ghsa-p6xc-xr62-6r2g |
| certifyVuln | 131590    | vulnerability ID: ghsa-fxph-q3j8-mv87 |
| certifyVuln | 131574    | vulnerability ID: ghsa-599f-7c49-w659 |
+-------------+-----------+---------------------------------------+

The query above has to traverse the relationships between the packages and vulnerability information to provide us the output shown above.
Schema: Data ingestion and retrieval is not just reliant on the database you choose to work with, it also depends on the schema in which the data is stored. A complex schema with unnecessarily fragmented data can result in poor performance for ingestion and querying. Simplifying the schema greatly improves performance by reducing the number of reads and writes that need to take place. In graphQl the schema would look like the following for dependency between two packages:

 	
"DependencyType determines the type of the dependency."
enum DependencyType {
	"direct dependency"
  DIRECT
  "indirect dependency"
  INDIRECT
  "type not known/not specified"
  UNKNOWN
}

"IsDependency is an attestation to record that a package depends on another."
type IsDependency {
	id: ID!
  "Package that has the dependency"
  package: Package!
  "Package for the dependency; MUST BE PackageName, not PackageVersion"
  dependentPackage: Package!
  "Version range for the dependency link"
  versionRange: String!
  "Type of dependency"
  dependencyType: DependencyType!
  "Justification for the attested relationship"
  justification: String!
  "Document from which this attestation is generated from"
  origin: String!
  "GUAC collector for the document"
  collector: String!
}

The above schema contains the necessary information to create the dependency relationship. We can map this schema or a representation of this, for the graph database to store and retrieve the data needed to recreate the “isDependency” node.
Optimizing the query language: Many graph databases (such as ArangoDB or Neo4j) come with their own query language that can be used to ingest and read data. Optimizing the query for the most efficient reads and writes can be challenging and may require evaluation/inspection to determine where the costly operations are occurring. For example, a query to ingest dependencies may look like the following in Arango Query Language (AQL) which is quite complex:

 	
LET firstPkg = FIRST(
			FOR pkg IN Pkg
    				FILTER pkg.root == 'pkg'
            	FOR pkgHasType IN OUTBOUND pkg PkgHasType
              	FILTER pkgHasType.type == doc.pkgType && pkgHasType._parent == pkg._id
                	FOR pkgHasNamespace IN OUTBOUND pkgHasType PkgHasNamespace
                  		FILTER pkgHasNamespace.namespace == doc.namespace && pkgHasNamespace._parent == pkgHasType._id
                      	FOR pkgHasName IN OUTBOUND pkgHasNamespace PkgHasName			        
                        			FILTER pkgHasName.name == doc.name && pkgHasName._parent == pkgHasNamespace._id
                            FOR pkgHasVersion IN OUTBOUND pkgHasName PkgHasVersion
                            			FILTER pkgHasVersion.version == doc.version && pkgHasVersion.subpath == doc.subpath &&
pkgHasVersion.qualifier_list == doc.qualifier && pkgHasVersion._parent == pkgHasName._id	
			RETURN {
      	'type': pkgHasType.type,
      	'namespace': pkgHasNamespace.namespace,
        'name': pkgHasName.name,
        'version': pkgHasVersion.version,
        'subpath': pkgHasVersion.subpath,
        'qualifier_list': pkgHasVersion.qualifier_list,
        'versionDoc': pkgHasVersion
      }
  )
  
LET secondPkg = FIRST(
	FOR pkg IN Pkg
  	FILTER pkg.root == 'pkg'
    			FOR pkgHasType IN OUTBOUND pkg PkgHasType
          			FILTER pkgHasType.type == doc.secondPkgType && pkgHasType._parent == pkg._id
              FOR pkgHasNamespace IN OUTBOUND pkgHasType PkgHasNamespace
              			FILTER pkgHasNamespace.namespace == doc.secondNamespace && pkgHasNamespace._parent == pkgHasType._id
                    	FOR pkgHasName IN OUTBOUND pkgHasNamespace PkgHasName				        
                      	FILTER pkgHasName.name == doc.secondName && pkgHasName._parent == pkgHasNamespace._id
RETURN {
	'type': pkgHasType.type,
  'namespace': pkgHasNamespace.namespace,
  'name': pkgHasName.name,
  'nameDoc': pkgHasName
}
)

	LET isDependency = FIRST(
				UPSERT { packageID:firstPkg.versionDoc._id, depPackageID:secondPkg.nameDoc._id, versionRange:doc.versionRange,
dependencyType:doc.dependencyType, justification:doc.justification, collector:doc.collector, origin:doc.origin }
				INSERT { packageID:firstPkg.versionDoc._id, depPackageID:secondPkg.nameDoc._id, versionRange:doc.versionRange,
dependencyType:doc.dependencyType, justification:doc.justification, collector:doc.collector, origin:doc.origin }
				UPDATE {} IN isDependencies	
        RETURN NEW
        )
        
  LET edgeCollection = (FOR edgeData IN [
  				{fromKey: isDependency._key, toKey: secondPkg.nameDoc._key, from: isDependency._id, to: secondPkg.nameDoc._id,
label: 'dependency'},
					{fromKey: firstPkg.versionDoc._key, toKey: isDependency._key, from: firstPkg.versionDoc._id, to: isDependency._id, 
label: 'subject'}]

					INSERT { _key: CONCAT('isDependencyEdges', edgeData.fromKey, edgeData.toKey), _from: edgeData.from, _to:
edgeData.to, label : edgeData.label } INTO isDependencyEdges OPTIONS { overwriteMode: 'ignore' }
					)
          
    RETURN {
    			'firstPkgType': firstPkg.type,
          'firstPkgNamespace': firstPkg.namespace,
          'firstPkgName': firstPkg.name,
          'firstPkgVersion': firstPkg.version,
          'firstPkgSubpath': firstPkg.subpath,
          'firstPkgQualifier_list': firstPkg.qualifier_list,
          'secondPkgType': secondPkg.type,
          'secondPkgNamespace': secondPkg.namespace,
          'secondPkgName': secondPkg.name,
          'versionRange': isDependency.versionRange,
          'dependencyType': isDependency.dependencyType,
          'justification': isDependency.justification,
          'collector': isDependency.collector,
          'origin': isDependency.origin
    }

‍

While in neo4j, it would be:

 	
MATCH (root:Pkg)-[:PkgHasType]->(type:PkgType)-[:PkgHasNamespace]->(namespace:PkgNamespace)-[:PkgHasName]->(name:PkgName)-[:PkgHasVersion]->(version:PkgVersion), (objPkgRoot:Pkg)-[:PkgHasType]->(objPkgType:PkgType)-[:PkgHasNamespace]->(objPkgNamespace:PkgNamespace)-[:PkgHasName]->(objPkgName:PkgName) 

WHERE type.type = $pkgType AND namespace.namespace = $pkgNamespace AND name.name = $pkgName AND version.version = $pkgVersion AND version.subpath = $pkgSubpath AND version.qualifier_list = $pkgQualifierList AND objPkgType.type = $objPkgType AND objPkgNamespace.namespace = $objPkgNamespace AND objPkgName.name = $objPkgName

MERGE (version)<-[:subject]-
(isDependency:IsDependency{versionRange:$versionRange,dependencyType:$dependencyType,justification:$justification,origin:$origin,collector:$collector})-[:dependency]->(objPkgName) 

RETURN type.type, namespace.namespace, name.name, version.version, version.subpath, version.qualifier_list, isDependency, objPkgType.type, objPkgNamespace.namespace, objPkgName.name

Both queries are different but are trying to achieve the same operations, ingestion of the “isDependency” node. For each language, the query must have efficient traversal, proper indexing and limit the number of nodes that have to be filtered. We also need to remove unnecessary “UPSERT” or “MERGE” which either “INSERT” or “UPDATE” depending on the node which can be costly operations. These queries have to be carefully crafted for each language utilizing their own unique syntax.

Quantitative Analysis

As we work towards providing a persistent database for GUAC, the above characteristics need to be measured and quantified. Currently, we are in the process of evaluating ArangoDB, Neo4J and Cloud Spanner. The GUAC community is also evaluating AWS Neptune and ENT (an ORM framework that supports multiple backends such as SQLite, MySQL, and others). The flexibility of GUAC, via its graphQL interface, is that we are not limited to a singular database. We have the ability to support multiple depending on the use cases but maintenance can be troublesome. Therefore, we must do our due diligence to determine which databases should be incorporated as the persistent backend for GUAC. Stay tuned for our next blog where we go into further detail about the evaluations of the above characteristics in the databases we mentioned above. We will go into specific details and provide real word metrics around ingestion and execution times.

To learn more about GUAC, visit the official site at https://guac.sh, the docs at https://docs.guac.sh, or the source code repository at https://github.com/guacsec/guac (Give us a star while you’re there!).

‍

Like what you read? Share it with others.

Open Source Summit 2022

Takeaways & Learnings

June 23, 2022

Tim Miller

SPIFFE/SPIRE CSI Driver

Overview of the SPIFFE/SPIRE CSI Driver

June 27, 2022

Parth Patel

Not Just Third Party Risk

There’s a misconception in Cybersecurity among some that Software Supply Chain Security is just Third Party Risk Mana...

Quest to determine the 'G' in GUAC

Evaluating the Requirements

Quantitative Analysis

Other blog posts

Open Source Summit 2022

SPIFFE/SPIRE CSI Driver

Not Just Third Party Risk

Government Memo for Enhancing the Security of the Software Supply Chain

A High Fidelity View of Software Supply Chain

Kusari presenting at KubeCon and Cloud Native SecurityCon NA 2022

The Next Heartbleed?

Kusari's Software Supply Chain Security Overview

Applying Zero Trust to the Software Supply Chain

Figure Out Who's Lurking in Your Supply Chain With Signatures and Attestations

Kusari Open-Sources Spector

GUAC v0.1 Beta Release

Quest to determine the 'G' in GUAC

daBOM Podcast with Tim & DJ

Announcing Helm Chart for GUAC

Case Study: A discussion with Guidewire on GUAC

Announcing the Kusari YouTube Channel and GUACademy

Terror of cURL - Preparation is Half the Battle

Spooky Enhancements: Unveiling GUAC's OpenVEX Feature

What the NSA Missed in its SBOM Management Recommendations

Contributor to Leader: Securing Open Source Software at OpenSSF

Our $8M Funding Round Fuels our Mission to Make the Software Supply Chain Transparent and Secure

Kusari Soaks up Community at FOSDEM and Beyond

From Open Source Community to Joining a Start-up – while in High School

Unveiling GUAC as an OpenSSF Incubating Project for Software Dependency Management

Graph for Understanding Artifact Composition (GUAC) Joins OpenSSF as Incubating Project

XZ Backdoor: Software Security Lessons

Proactive Security in the Post-Log4j Era

Graph for Understanding Artifact Composition (GUAC) adds persistent storage in v0.6.0 release

Another Turn of the Page: GUAC v0.7.0 Released

Counting CVEs Was Never Enough

Kusari Signs the Secure by Design Pledge

To Fork or Not to Fork

Meeting Federal Software Supply Chain Mandates

Achieving Wisdom with GUAC Visualizer

Announcing GUAC v0.8.0 Enhancements

Why Software Cannot Be Secured by SBOMs Alone

Hack-Proof Artificial Intelligence Supply Chains Using Open Source Security

GUAC Boosts License Transparency

Understanding Prevalence is the First Step

You Can’t Fix Issues if You Can’t Find Them

Introducing the Kusari Platform—know your software

Is Your Supply Chain Haunted by CVEs?

The Path to Zero CVEs: Vanquishing Cyber Threats

Is the Internet on Fire? The State of Open Source Security

The Best Way to Secure Your Open Source Supply Chain is to Participate

Rust Won’t Fix Everything: Moving Toward a Memory-Safe Future

Threat Modeling in the Software Development Life Cycle

Solving the “Bottom Turtle” Problem in Supply Chain Security

Software Supply Chain Security Predictions: Hits & Misses from 2024

AI Alone Won’t Fix Your Supply Chain

Software Supply Chain Security Predictions for 2025

Stick a Pin in It: Managing Dependencies for Supply Chain Security

Alarms Raised by Critical Reverse Backdoor Vulnerability in Medical Devices

Unpacking the Kusari Score

Unpacking the Kusari “Effort to Fix” Capability

Building a Foundation of Trust for a Stronger Software Supply Chain

Analyzing Third-Party Risk in Open Source Software

Addressing Third-Party Risk in Open Source Software

Raising the Bar for Open Source Security: Introducing the OSPS Baseline

Unpacking Kusari Platform Views

Starting the Security Journey: Producing an SBOM

The Next Step in the Security Journey: Comparing SBOMs

Another Step on the Security Journey: A Constellation of SBOMs

The Last Step on the Security Journey: Kusari Platform

Securing Your AI Models

GUAC Now Supports Runtime Kubernetes SBOMs using Kubescape

Securing the Software Supply Chain book now available!

Providing Secure Updates with TUF

The Hidden Risk in Your Software: Understanding Transitive Dependencies

Codifying the SDLC with in-toto

The Hidden Risk in Your Software: Managing Transitive Dependencies

VulnCon 2025 recap

The Future of CVEs

Identifying Threats in the Implementation Phase

Endpoint Security is Supply Chain Security