| gptkbp:instanceOf | gptkb:nonprofit_organization 
 | 
                        
                            
                                | gptkbp:archiveFrequency | monthly 
 | 
                        
                            
                                | gptkbp:category | non-profit open data
 web archiving
 internet organization
 
 | 
                        
                            
                                | gptkbp:dataLanguage | multilingual 
 | 
                        
                            
                                | gptkbp:datasetAvailability | public domain 
 | 
                        
                            
                                | gptkbp:format | gptkb:WAT gptkb:WARC
 WET
 
 | 
                        
                            
                                | gptkbp:founded | 2007 
 | 
                        
                            
                                | gptkbp:founder | gptkb:Gil_Elbaz 
 | 
                        
                            
                                | gptkbp:headquartersLocation | gptkb:United_States 
 | 
                        
                            
                                | gptkbp:includes | web pages metadata
 HTTP headers
 crawled URLs
 outlinks
 text content
 
 | 
                        
                            
                                | gptkbp:license | gptkb:Creative_Commons_CC0 
 | 
                        
                            
                                | gptkbp:mission | to crawl the web and freely provide its archives and datasets to the public 
 | 
                        
                            
                                | gptkbp:notableProject | gptkb:Common_Crawl_Corpus 
 | 
                        
                            
                                | gptkbp:notableUser | gptkb:Google gptkb:OpenAI
 gptkb:Allen_Institute_for_AI
 gptkb:EleutherAI
 gptkb:Meta_AI
 
 | 
                        
                            
                                | gptkbp:numberOfVolumes | petabytes 
 | 
                        
                            
                                | gptkbp:partner | gptkb:Internet_Archive gptkb:Amazon_Web_Services
 gptkb:Archive-It
 gptkb:Data_Commons
 gptkb:Web_Data_Commons
 
 | 
                        
                            
                                | gptkbp:storage | gptkb:Amazon_S3 
 | 
                        
                            
                                | gptkbp:supportedBy | grants donations
 
 | 
                        
                            
                                | gptkbp:twitter | @CommonCrawl 
 | 
                        
                            
                                | gptkbp:type | web crawl data 
 | 
                        
                            
                                | gptkbp:usedBy | gptkb:researchers developers
 companies
 
 | 
                        
                            
                                | gptkbp:usedFor | gptkb:machine_learning data analysis
 natural language processing
 web mining
 search engine development
 
 | 
                        
                            
                                | gptkbp:website | https://commoncrawl.org/ 
 | 
                        
                            
                                | gptkbp:bfsParent | gptkb:WARC 
 | 
                        
                            
                                | gptkbp:bfsLayer | 5 
 | 
                        
                            
                                | https://www.w3.org/2000/01/rdf-schema#label | Common Crawl 
 |