機械学習用のAIデータセット

AI開発のための既製データセット・ラベル付きデータセット

Appenのラベル付きデータセット

Appenは、100,000時間の音声データ、500,000枚の画像データ、1億ワードのテキストなどを含む、80を超える言語と方言に対応した800以上の認定データセットを提供しています。

ASRデータセット

90以上の言語に対応した100,000時間の音声データ

10,000時間を超える90以上の会話音声データセット

70,000時間を超える120以上の朗読音声データセット

20,000時間を超える20の自由発話音声データセット

70時間の赤ちゃんの泣き声、70時間の犬と猫の泣き声、子供の声などの特殊なデータセット

データセットを見る

大規模言語モデルデータセット

81億トークンの大規模言語モデルデータセット

500万の画像とテキストがペアになったマルチモーダルデータセット（日英韓に対応）

100万の動画とテキストがペアになったマルチモーダルデータセット（日英韓に対応）

100万のChain-of-Thoughtデータセット（日英韓に対応）

データセットを見る

テキストデータセット

98の言語に対応した523万エントリーの発音辞書

22の言語に対応した326万エントリーの品詞辞書

8の言語に対応した100万以上のNERデータセット

データセットを見る

画像データセット

合計600万枚の画像

多言語に対応した12,000枚のOCR画像

2,196枚のマルチラベル画像データベース

680枚の多様なポーズとライティングのポートレート

データセットを見る

動画データセット

乳児の泣き声動画100本 (各1分)

3言語の字幕付き動画（更新一下，新的说法：东亚各国朗读视频，海量开源爬取数据集）

データセットを見る

音声合成データセット

20以上の国籍、400人の声優

覆盖多种不同情绪的音色及应用场景（翻译）

データセットを見る

データセットの活用例

自動運転システム

運転者危険行動識別データセット：運転位置、危険行動、疲労度の検出に活用できます。

乗客安全監視データセット：車内に残された子供、ペット、危険物などを特定に活用できます。

車内音声データセット：音声ナビゲーションやインテリジェント運転体験の実現に活用できます。

車外データセット：車線、障害物、駐車スペースなど車外環境の識別に活用できます。

データセットを見る

カスタマーサポート

自然言語処理データセット：チャットプログラムを生成、効率的なオンラインカスタマーサポートの実現に活用できます。

音声合成データセット：テキストのリアルタイム変換、テキストを自然な音声に変換する音声合成に活用できます。

データセットを見る

スマートファイナンス

ファイナンス業界用OCRデータセット：金融や保険業界の契約書のレビュー、OCRの自動化、効率的で正確なテキスト文字起こしの自動化の実現に活用できます。

データセットを見る

スマートホーム

音声認識データセット：家庭用電子製品の機能的でスマートなインタラクションに活用できます。

障害物画像データセット：ロボット掃除機の物体識別、障害物回避などの機能に活用できます。

データセットを見る

スマートデバイス

顔認識・音声認識データセット：スマートデバイスアプリケーションの展開に活用できます。

データセットを見る

スマートセキュリティ

顔認識・危険行動追跡データセット：AIスマートセキュリティの構築に活用できます。

データセットを見る

データセット一覧

データセットにご興味のある方は、ダウンロードをクリックしてください。担当者がご連絡いたします。

フィルタ

すべてクリア

Dataset name	Dataset ID	Type	Language	Country/Area	Common application
Japanese (Japan) Pronunciation Dictionary	jpn_JPN_PHON		Japanese	Japan		Download
Japanese (Japan) Part of Speech Dictionary	jpn_JPN_POS	Dictionary	Japanese	Japan	ASR, Language modeling, TTS	Download
Japanese OCR invoice Dataset	IMG_JP_OCR Invoices_CN	Image	Japanese	Japan	Image recognition	Download
Japanese NER news text	JPY_NER001	Text	Japanese	Japan	Language modeling, LLM	Download
Japanese Inverse text normalisation	JPN_ITN001	Text	Japanese	Japan	Language modeling, Semantic Analysis, LLM	Download
English (United States) conversational smartphone	USE_ASR003	ASR	English	America	Speech analysis, Virtual assistant, ASR	Download
Thai telephone channel	THA_ASR003_CN	ASR	Thai	Thailand	Speech analysis, Virtual assistant, ASR	Download
Arabic image Dataset with annotation	IMG_OCR_ARU002_CN	Image	Arabic	Arab	Image recognition	Download
Japanese Free Speaking Speech/Business/daily conversation Dataset	JAP_ASR001_CN	ASR	Japanese	Japan	Speech analysis, Virtual assistant, ASR	Download
Indonesian Dialogue Dataset	IND_DH_ASR001_CN	ASR	Indonesian	Indonesia	Speech analysis, Virtual assistant, ASR	Download

Japanese (Japan) Pronunciation Dictionary

Download

Dataset ID

jpn_JPN_PHON

Type

Language

Japanese

Country/Area

Japan

Common application

Dataset name :

Japanese (Japan) Pronunciation Dictionary

Dataset ID :

jpn_JPN_PHON

Description :

The file format of the lexicon is a plain TXT file encoded in UTF-8.The lexicon contains the following columns. Each column is separated by a<tab> character: 1.Word/Name 2.Transcription 3.Rank 4.Comment (Optional)

Type :

Language :

Japanese

Country/Area :

Japan

Collection equipment :

Collection environment :

Unit :

262,000words

With transcription/annotation or not :

Common application :

Japanese (Japan) Part of Speech Dictionary

Download

Dataset ID

jpn_JPN_POS

Type

Dictionary

Language

Japanese

Country/Area

Japan

Common application

ASR, Language modeling, TTS

Dataset name :

Japanese (Japan) Part of Speech Dictionary

Dataset ID :

jpn_JPN_POS

Description :

Type :

Dictionary

Language :

Japanese

Country/Area :

Japan

Collection equipment :

Collection environment :

Unit :

265,000words

With transcription/annotation or not :

Common application :

ASR, Language modeling, TTS

Japanese OCR invoice Dataset

Download

Dataset ID

IMG_JP_OCR Invoices_CN

Type

Image

Language

Japanese

Country/Area

Japan

Common application

Image recognition

Dataset name :

Japanese OCR invoice Dataset

Dataset ID :

IMG_JP_OCR Invoices_CN

Description :

326 different formats 領収書,332 different formats 見積書,334 different formats 注文書

Type :

Image

Language :

Japanese

Country/Area :

Japan

Collection equipment :

Mobile phone/tablet/camera

Collection environment :

Multiple lighting options

Unit :

992images

With transcription/annotation or not :

Yes

Common application :

Image recognition

Japanese NER news text

Download

Dataset ID

JPY_NER001

Type

Text

Language

Japanese

Country/Area

Japan

Common application

Language modeling, LLM

Dataset name :

Japanese NER news text

Dataset ID :

JPY_NER001

Description :

The file contains 21,000 sentences annotated for Named Entities. The file is of XML format and includes annotation for person, title, organization, location, facility, religion, nationality and geo-political entity.

Type :

Text

Language :

Japanese

Country/Area :

Japan

Collection equipment :

Collection environment :

Unit :

20,629 sentences

With transcription/annotation or not :

Yes

Common application :

Language modeling, LLM

Japanese Inverse text normalisation

Download

Dataset ID

JPN_ITN001

Type

Text

Language

Japanese

Country/Area

Japan

Common application

Language modeling, Semantic Analysis, LLM

Dataset name :

Japanese Inverse text normalisation

Dataset ID :

JPN_ITN001

Description :

This dataset contains 5363 test cases across 14 categories, including address, alphanumeric, cardinal, currency, date, fraction, identifier, etc.

Type :

Text

Language :

Japanese

Country/Area :

Japan

Collection equipment :

Collection environment :

Unit :

5363 test cases

With transcription/annotation or not :

Common application :

Language modeling, Semantic Analysis, LLM

English (United States) conversational smartphone

Download

Dataset ID

USE_ASR003

Type

ASR

Language

English

Country/Area

America

Common application

Speech analysis, Virtual assistant, ASR

Dataset name :

English (United States) conversational smartphone

Dataset ID :

USE_ASR003

Description :

This database contains voice data recorded during 928 sessions. Each pair of 928 unique speakers recorded an average of about 60 minutes of conversation. Each pair of speakers can record up to 14 conversations about different topics. Provided the speaker with a topic for each conversation.

Type :

ASR

Language :

English

Country/Area :

America

Collection equipment :

Mobile phone

Collection environment :

Low background noise (home/office)

Unit :

1000hours

With transcription/annotation or not :

Yes

Common application :

Speech analysis, Virtual assistant, ASR

Thai telephone channel

Download

Dataset ID

THA_ASR003_CN

Type

ASR

Language

Thai

Country/Area

Thailand

Common application

Speech analysis, Virtual assistant, ASR

Dataset name :

Thai telephone channel

Dataset ID :

THA_ASR003_CN

Description :

The Thai telephone channel mainly covers topics such as electronic technology & digital time & education & politics & economy & sports & shopping.

Type :

ASR

Language :

Thai

Country/Area :

Thailand

Collection equipment :

Telephone

Collection environment :

Low background noise (home/office)

Unit :

1000hours

With transcription/annotation or not :

Yes

Common application :

Speech analysis, Virtual assistant, ASR

Arabic image Dataset with annotation

Download

Dataset ID

IMG_OCR_ARU002_CN

Type

Image

Language

Arabic

Country/Area

Arab

Common application

Image recognition

Dataset name :

Arabic image Dataset with annotation

Dataset ID :

IMG_OCR_ARU002_CN

Description :

Mainly includes the following types of images: billboards, business memos, lists, maps, packaging, slogans, store signs, posters

Type :

Image

Language :

Arabic

Country/Area :

Arab

Collection equipment :

Mobile phone/tablet/camera

Collection environment :

Multiple lighting options

Unit :

15054images

With transcription/annotation or not :

Yes

Common application :

Image recognition

Japanese Free Speaking Speech/Business/daily conversation Dataset

Download

Dataset ID

JAP_ASR001_CN

Type

ASR

Language

Japanese

Country/Area

Japan

Common application

Speech analysis, Virtual assistant, ASR

Dataset name :

Japanese Free Speaking Speech/Business/daily conversation Dataset

Dataset ID :

JAP_ASR001_CN

Description :

Japanese Free Speaking Speech Database

Type :

ASR

Language :

Japanese

Country/Area :

Japan

Collection equipment :

Mobile phone

Collection environment :

Low background noise (home/office)

Unit :

11.88hours

With transcription/annotation or not :

Yes

Common application :

Speech analysis, Virtual assistant, ASR

Indonesian Dialogue Dataset

Download

Dataset ID

IND_DH_ASR001_CN

Type

ASR

Language

Indonesian

Country/Area

Indonesia

Common application

Speech analysis, Virtual assistant, ASR

Dataset name :

Indonesian Dialogue Dataset

Dataset ID :

IND_DH_ASR001_CN

Description :

IND_DH_ASR001_CN is the recording of the conversation between Indonesian locals who speak Indonesian as their native language. Topics include: financial consumption, communication, social hot spots, tourism and shopping, sports and entertainment, digital time, local names, education and learning, medical COVID-19, and scientific and technological digital games. This database contains text transcription and labels have been added to the text.

Type :

ASR

Language :

Indonesian

Country/Area :

Indonesia

Collection equipment :

Mobile phone

Collection environment :

Low background noise (home/office)

Unit :

300hours

With transcription/annotation or not :

Yes

Common application :

Speech analysis, Virtual assistant, ASR

1 / 37

Appenが選ばれる理由

700以上のテキスト、画像、動画、音声のデータセットやラベル付きデータセットを提供

迅速なデプロイ

ラベル付きデータセットがAI機械学習のトレーニングを強力に支援

高いコストパフォーマンス

既製データセットを活用することで、費用対効果を高めることが可能

専門性

データ収集とデータセット分野で20年以上の経験を持つ専門家チーム

幅広いデータ形式

画像、動画、音声、テキストなど幅広いデータ形式に対応

大規模データ

大規模な高品質データで、効率的にモデルのトレーニングを実施

高品質データ

機械学習モデルの品質を向上させ、データのバイアスを低減

データ収集とアノテーション

上記の一覧に適したデータセットがない場合は、お客様の特定のユースケースに合わせたカスタムデータをご提供できます。

お問い合わせ

お問い合わせ

ブログ

07/01/2025

ヘルスケア医療に特化したAI開発におけるデータ課題と解決策