TFDS now supports the Croissant 🥐 format! Read the documentation to know more.

web_graph

Description:

This dataset contains a sparse graph representing web link structure for a small subset of the Web.

Its a processed version of a single crawl performed by CommonCrawl in 2021 where we strip everything and keep only the link->outlinks structure. The final dataset is basically int -> List[int] format with each integer id representing a url.

Also, in order to increase the value of this resource, we created 6 different version of WebGraph, each varying in the sparsity pattern and locale. We took the following processing steps, in order:

We started with WAT files from June 2021 crawl.
Since the outlinks in HTTP-Response-Metadata are stored as relative paths, we convert them to absolute paths using urllib after validating each link.
To study locale-specific graphs, we further filter based on 2 top level domains: ‘de’ and ‘in’, each producing a graph with an order of magnitude less number of nodes.
These graphs can still have arbitrary sparsity patterns and dangling links. Thus we further filter the nodes in each graph to have minimum of K ∈ [10, 50] inlinks and outlinks. Note that we only do this processing once, thus this is still an approximation i.e. the resulting graph might have nodes with less than K links.
Using both locale and count filters, we finalize 6 versions of WebGraph dataset, summarized in the folling table.

Version	Top level domain	Min count	Num nodes	Num edges
sparse		10	365.4M	30B
dense		50	136.5M	22B
de-sparse	de	10	19.7M	1.19B
de-dense	de	50	5.7M	0.82B
in-sparse	in	10	1.5M	0.14B
in-dense	in	50	0.5M	0.12B

All versions of the dataset have following features:

"row_tag": a unique identifier of the row (source link).
"col_tag": a list of unique identifiers of non-zero columns (dest outlinks).
"gt_tag": a list of unique identifiers of non-zero columns used as ground truth (dest outlinks), empty for train/train_t splits.
Homepage: https://arxiv.org/abs/2112.02194
Source code: tfds.structured.web_graph.WebGraph
Versions:
- 1.0.0 (default): Initial release.
Download size: Unknown size
Auto-cached (documentation): No
Feature structure:

FeaturesDict({
    'col_tag': Sequence(int64),
    'gt_tag': Sequence(int64),
    'row_tag': int64,
})

Feature documentation:

Feature	Class	Shape	Dtype
	FeaturesDict
col_tag	Sequence(Tensor)	(None,)	int64
gt_tag	Sequence(Tensor)	(None,)	int64
row_tag	Tensor		int64

Supervised keys (See as_supervised doc): None
Figure (tfds.show_examples): Not supported.
Citation:

@article{mehta2021alx,
    title={ALX: Large Scale Matrix Factorization on TPUs},
    author={Harsh Mehta and Steffen Rendle and Walid Krichene and Li Zhang},
    year={2021},
    eprint={2112.02194},
    archivePrefix={arXiv},
    primaryClass={cs.LG}
}

web_graph/sparse (default config)

Config description: WebGraph-sparse contains around 30B edges and around 365M nodes.
Dataset size: 273.38 GiB
Splits:

Split	Examples
`'test'`	39,871,321
`'train'`	372,049,054
`'train_t'`	410,867,007

Examples (tfds.as_dataframe):

web_graph/dense

Config description: WebGraph-dense contains around 22B edges and around 136.5M nodes.
Dataset size: 170.87 GiB
Splits:

Split	Examples
`'test'`	13,256,496
`'train'`	122,815,749
`'train_t'`	136,019,364

Examples (tfds.as_dataframe):

web_graph/de-sparse

Config description: WebGraph-de-sparse contains around 1.19B edges and around 19.7M nodes.
Dataset size: 10.25 GiB
Splits:

Split	Examples
`'test'`	1,903,443
`'train'`	17,688,633
`'train_t'`	19,566,045

Examples (tfds.as_dataframe):

web_graph/de-dense

Config description: WebGraph-de-dense contains around 0.82B edges and around 5.7M nodes.
Dataset size: 5.90 GiB
Splits:

Split	Examples
`'test'`	553,270
`'train'`	5,118,902
`'train_t'`	5,672,473

Examples (tfds.as_dataframe):

web_graph/in-sparse

Config description: WebGraph-de-sparse contains around 0.14B edges and around 1.5M nodes.
Dataset size: 960.57 MiB
Splits:

Split	Examples
`'test'`	140,313
`'train'`	1,309,063
`'train_t'`	1,445,042

Examples (tfds.as_dataframe):

web_graph/in-dense

Config description: WebGraph-de-dense contains around 0.12B edges and around 0.5M nodes.
Dataset size: 711.72 MiB
Splits:

Split	Examples
`'test'`	47,894
`'train'`	443,786
`'train_t'`	491,634

Examples (tfds.as_dataframe):