Formats of the resources

This page lists the formats that are used in Coccon to represent audio, video or text information, whether they are accepted as input, those used for their conservation or those used for their diffusion.

Formats accepted as input

  • For audio documents: WAV and FLAC.

    The WAV format will be preferred, except for large files (exceeding the practical or theoretical limit of the format, i.e. 2 to 4GB). The FLAC format will be used in other cases. For the WAV format, the encoding must be PCM (Pulse-Code modulation), i.e. without compression. Other formats and encodings are also possible, but insofar as these 2 formats allow their content to be expressed without loss or addition of information, they will be favoured and other formats will be converted into these target formats.

  • For video documents: MPEG-4 and MKV.

    The MPEG-4 container format should contain a video stream encoded in H.264 (also called AVC) and possibly an audio stream encoded in AAC. The MKV format (also called Matroska) will have to contain a video stream encoded in H.264 (also called AVC) and possibly an audio stream encoded in FLAC. Other formats will be converted into these target formats.

  • For annotation documents:

    These annotations may include (but are not limited to): transcriptions, translations, staging information, timecodes, physiological measurements related to the voice such as electroglottograms... In order for this information to be useful for understanding the recording and the analyses that have been made, it must be given in as explicit and standardised format as possible. Here are the formatting possibilities, in order of preference:

    • an XML document encoded in UTF-8, preferably using standards (TEI, TalkBank...) or schemas or DTDs manipulated by widespread tools (ELAN, transcriber…);
    • a Text-only document encoded in UTF-8, preferably using known conventions (such as the CHAT format of the CHILDES project, used in the CLAN tool);
    • a PDF document, used as a container format for scanned images of originals on paper media.
    • For electroglottograms (EGG), the WAV/PCM format is used as a container format.

Preservation formats

The preservation formats are those recommended as input formats (listed above). Other accepted formats will go through a conversion step to these formats and only the resulting file will be retained.

In the case of audio files, as the responsibility for preservation only concerns the audio aspects of the data, it is not advisable to place other types of information (metadata, timecodes, etc.) in these documents. This other information should be made explicit in other documents (metadata, annotations). The same applies to video files, where only the audio and video aspects will be retained.

Broadcasting formats

  • The data are disseminated in their preservation format - which represents the highest quality of information available - even though these files can sometimes be very large.

  • For audio data, a broadcast file in MP3 format with degraded quality is automatically derived from the preservation file. This format was chosen because of its good support by current browsers in their HTML5 implementations.

  • In the same way as for audio data, video data is broadcast using the MPEG-4 format (with a low bitrate and reduced size), a format chosen because of its good support by current browsers in their implementations of HTML5.

  • For annotations, the broadcast formats are:

    • the preservation format
    • if possible, an XML format (dtd Transcriber or Pangloss) allowing consultation in web mode
    • for formats from Clan, Transcriber, Elan and Praat: an XML/TEI format produced by the TEI-CORPO software to facilitate interoperability. Data in XML/Pangloss format are also distributed in XML/TEI format with Cocoon tools.

References

  • Text Encoding Intiative (TEI) :
  • Child Language Data Exchange System (CHILDES) :
  • "Guide méthodologique pour le choix de formats numériques pérennes dans un contexte de données orales et visuelles" on the FranceArchives site
  • Loïc Liégeois, Carole Etienne, Christophe Parisse, Christophe Benzitoun, Christian Chanard. Using the TEI as a pivot format for oral and multimodal language corpora. Text Encoding Initiative Conference and Member's meeting 2015, Oct 2015, Lyon, France. Modèles, Dynamiques, Corpus - UMR 7114 (MoDyCo) (2016). teicorpo [Outil]. ORTOLANG (Open Resources and TOols for LANGuage) - www.ortolang.fr, https://hdl.handle.net/11403/teicorpo.