enow.com Web Search

Search results

  1. Results from the WOW.Com Content Network
  2. List of datasets for machine-learning research - Wikipedia

    en.wikipedia.org/wiki/List_of_datasets_for...

    Dataset of legal contracts with rich expert annotations ~13,000 labels CSV and PDF Natural language processing, QnA 2021 The Atticus Project: Vietnamese Image Captioning Dataset (UIT-ViIC) Vietnamese Image Captioning Dataset 19,250 captions for 3,850 images CSV and PDF Natural language processing, Computer vision 2020 [113] Lam et al.

  3. Hugging Face - Wikipedia

    en.wikipedia.org/wiki/Hugging_Face

    datasets, mainly in text, images, and audio; web applications ("spaces" and "widgets"), intended for small-scale demos of machine learning applications. There are numerous pre-trained models that support common tasks in different modalities, such as:

  4. List of datasets in computer vision and image processing

    en.wikipedia.org/wiki/List_of_datasets_in...

    Aerial Image Segmentation Dataset 80 high-resolution aerial images with spatial resolution ranging from 0.3 to 1.0. Images manually segmented. 80 Images Aerial Classification, object detection 2013 [160] [161] J. Yuan et al. KIT AIS Data Set Multiple labeled training and evaluation datasets of aerial images of crowds.

  5. List of large language models - Wikipedia

    en.wikipedia.org/wiki/List_of_large_language_models

    363 billion token dataset based on Bloomberg's data sources, plus 345 billion tokens from general purpose datasets [66] Proprietary Trained on financial data from proprietary sources, for financial tasks. PanGu-Σ: March 2023: Huawei: 1085: 329 billion tokens [67] Proprietary OpenAssistant [68] March 2023: LAION: 17: 1.5 trillion tokens Apache 2.0

  6. Stable Diffusion - Wikipedia

    en.wikipedia.org/wiki/Stable_Diffusion

    Stable Diffusion was trained on pairs of images and captions taken from LAION-5B, a publicly available dataset derived from Common Crawl data scraped from the web, where 5 billion image-text pairs were classified based on language and filtered into separate datasets by resolution, a predicted likelihood of containing a watermark, and predicted ...

  7. GPT-2 - Wikipedia

    en.wikipedia.org/wiki/GPT-2

    GPT-2 was pre-trained on a dataset of 8 million web pages. [2] It was partially released in February 2019, followed by full release of the 1.5-billion-parameter model on November 5, 2019. [3] [4] [5] GPT-2 was created as a "direct scale-up" of GPT-1 [6] with a ten-fold increase in both its parameter count and the size of its training dataset. [5]

  8. The Pile (dataset) - Wikipedia

    en.wikipedia.org/wiki/The_Pile_(dataset)

    The Pile is an 886.03 GB diverse, open-source dataset of English text created as a training dataset for large language models (LLMs). It was constructed by EleutherAI in 2020 and publicly released on December 31 of that year. [1] [2] It is composed of 22 smaller datasets, including 14 new ones. [1]

  9. Large language model - Wikipedia

    en.wikipedia.org/wiki/Large_language_model

    That is an "image token". Then, one can interleave text tokens and image tokens. The compound model is then fine-tuned on an image-text dataset. This basic construction can be applied with more sophistication to improve the model. The image encoder may be frozen to improve stability. [80]