Frequently Asked Questions (FAQ)

Organizational issues

What is Cocoon

Cocoon is a specialized data repository for speech recordings. These data are derived from research activities and can be field works, interviews, laboratory or professional experiments, etc. The "primary" resource is always an audio or video recording. This may or may not be accompanied by annotation resources (e.g., transcripts, translations, timecodes, electroglottograms, or other physiological measures related to the recorded speech, etc.).

What are the depositor's responsibilities?

The deposited data is and remains the responsibility of the depositor, who must therefore ensure upfront that he or she has all the rights to do so. Any consent forms and personal data processing declarations are under his responsibility and at his initiative.

It is up to the depositor to determine what data may be deposited, what descriptive information must be associated with it, under what conditions such data may be accessible to others, under what conditions it may be reused. In particular, it is the depositor who must anonymize or pseudonymize the information if necessary. He may be guided in his choices by his data protection delegate, by his ethics committees, his funders, his institution's policy, etc.

What are the responsibilities of the Cocoon repository?

As a service provider, Cocoon strives to ensure the availability of the repository, particularly the data access service. However, Cocoon reserves the freedom to interrupt its services for maintenance or any other reason deemed necessary.

Cocoon never interferes with depositors' data or metadata except at their request and to assist them in implementation. Cocoon does, however, reserve the freedom to standardize descriptions when necessary and to enrich them through alignment with repositories.

Who can deposit data?

All members of the Humanities and Social Sciences community of Higher Education Research in France can deposit in Cocoon whatever their disciplines (Linguistics, Anthropology, Ethnomusicology, History...).

Are deposits moderated?

The first step in moderation is to check that the depositor is a member of the community (see "Who can deposit data?") and that he understands and accepts his responsibilities as well as those of the repository. Once this stage has been completed, the depositor is free to deposit the data he wishes to publish in a preparation environment accessible only to him. The publication of any data goes through a control phase in which an administrator checks that the data and metadata have been properly formed, that the description is complete and accurate, standardizes the form of the data and metadata if necessary, and enriches the description by aligning it with others repositories. Lastly, for published data, any modification of data or metadata is also subject to this same type of control. In all cases (publication or modification), an exchange may take place to request additional information.

Data issues

What types of data can be deposited?

The data that can be deposited must be from research activities. They can be, for example, field works, interviews, laboratory or professional experiments, etc. The "primary" resource is always an audio or video recording. This may or may not be accompanied by annotation resources (e.g. transcripts, translations, etc.)

Cocoon distinguishes 3 main types of resources: recordings, annotations, collections.

It can be audio recording or video recording.
Annotation is any document (text, image, pdf, etc.) that provides commentary or direct information about an audio or video recording. Generally, these are transcriptions, translations, timecodes, "scenographic" indications, but they can also be recordings of physiological measurements such as electroglottograms, nasal pressure measurements, etc. Are to be excluded from the annotations, the publications, documentations, illustrations which must be deposited elsewhere (for example in HAL) and which can be the subject of a mention in the metadata. Also to be excluded are the metadata themselves which already belong outside the annotation documents.
Audio or video recordings, annotations as well as collections can be grouped together within collections. A collection is a grouping of resources that can be described as a whole. In particular, these collections are used to circumscribe corpora, projects, or holdings. The hierarchy of sub-collections in a collection represents its classification scheme. It is not advisable to make thematic classification schemes (by language, place, subject, etc.) that can be easily obtained by faceting with metadata, but rather to make structural classification schemes such as dividing a collection into assignments or an activity into projects.

What are the accepted formats?

Whether for audio and video recordings or for annotations, Cocoon distinguishes between preservation formats and broadcast formats. The formats accepted as input must be or be able to convert to the target formats for preservation. All of these formats are described in more detail on the Formats page .

Repository functionality issues?

What are the services of the repository?

The landing page with the files associated (broadcast and preservation files) are accessible in web mode through URLs. All of these URLs are described in more detail on the identifiers page .

Access to these files may be unrestricted or with authentication. These access conditions are specified in the metadata.

The data deposit can be done through an administration interface accessible to anyone with a user account on the repository. For large batch deposit, instructions for preparing data and metadata are presented on the Deposit guidelines page .

Data (metadata and files) are exposed on the repository web site which provides multimedia consultation interfaces as well as search functionalities (faceted search on metadata, full-text search on annotations, geographic search on audio and video recording places or search in collections).

Metadata is also exposed through a oai-pmh endpoint, a sparql endpoint, a "linked open data" web publication.

The metadata is automatically referenced to multiple service providers through the protocol oai-pmh , including Isidore, OLAC, CLARIN, openAIRE
The data (in their preservation format) are automatically copied in the CINES archiving system. For more details, see the page on the preservation .

Can a deposited data be deleted?

Data once published cannot be deleted except in cases of force majeure. Even in cases of deletion, a trace that the data existed remains through the declaration of its DOI whose metadata would then be reduced to keep only a citation and the date of deletion date. It is possible on the other hand and when the situation justifies it to modify the rule of accessibility of files.

Can a deposited data be modified?

Metadata can be changed at any time. The associated files, on the other hand, cannot be modified but new versions can be deposited. All versions remain accessible but it is the latest version that is the default version and will be highlighted by all processing (consultation, search).

Metadata issues

How to describe your data?

Metadata should be described following the Open Language Archive Community (OLAC) model. This is an extension of the qualified Dublin-Core model (see the metadata page for more information). Required metadata includes title, depositor, publisher, creation date, type and location of audio or video places. Of course depositors are strongly encouraged to do more and better. A cataloging guide can be consulted to understand the available categories and their interpretations within Cocoon.

Entering and editing metadata can be done through a administration interface accessible to anyone with a user account on the repository. For batch repositories, instructions for preparing its data and description are presented on the Repository Guidelines page . For batch changes, contact a repository administrator.

How are metadata exposed?

  • in a classic web interface on landing page of the web site.
  • in a "linked Open Data" interface .

How can metadata be retrieved?

  • by simply reading the content of the landing pages of the web site
  • in the code of the pages of the landing page of the web site in the form of tags <meta> expressed in different vocabularies of the semantic web (dublin-core,
  • using a tool like Zotero and passing the DOIs to it.
  • using the oai-pmh endpoint. Metadata can be requested in different models (OLAC, Dublin-Core simple, Dublin-Core qualified, Datacite)
  • using the sparql endpoint. The metadata is then expressed in the Europeana Data Model (EDM). In addition to the enrichment provided by the repositories used (VIAF, RAMEAU, Lexvo, Geonames, Dbpedia...), this model allows to gather within the same entity (Cultural Heritage Object), its different representations (recordings and annotations in conservation or broadcasting formats). Finally it allows the repositories to be used as documentary pivots to fetch additional information from other data repositorys.
  • using the exposure in "linked Open Data" that allows retrieving metadata expressed with the EDM model in various syntaxes.

How to describe people (contributors, speakers)?

To identify and describe people (mainly depositors, researchers), Cocoon uses the VIAF (Virtual International Authority File) repository, but for other people who are not necessarily "publishing" such as speakers, Cocoon maintains an internal repository to identify and describe them using classic dublin-core, foaf and vocabularies. To use this repository, a dialog with a repository administrator is required.

Data files issues

How to expose files associated with data outside of Cocoon?

Files associated with data can sometimes be in multiple formats (one format for preservation and one format for broadcasting). The preservation format may also, if it has been modified over time be in multiple versions. Each of these files has a corresponding URL. To the OAI identifiers of the data (see the page permanent identifiers for more explanations) are associated specific URLs that allow access to these different formats:

preservation[identifierOAI] (current version)[identifierOAI].version[n] (specific version with n = 1, 2, 3....)
broadcasting[identifierOAI].diffusion (current version)

These URLs can be used directly in the HTML code of your pages within suitable tags. For example for a video: <video src="[identifierOAI].diffusion" autoplay="true" preload="auto" controls></video>

Another solution, for recordings (audio or video), is to insert in the HTML code of your pages, the embedding code proposed on the records and which will display an adapted viewer and minimal metadata: <h:iframe src="[identifierOAI]" height="320" width="600"></h:iframe>

How to download the files associated with the data?

  • After harvesting the repository with the oai-pmh protocol, you can identify relevant identifiers in the results, build URLs to target the desired format and version, and then finally upload those URLs (e.g., with tools such as wget or curl).
  • After performing a targeted search using the right facets, you can download a CSV file that will list for this selection some criteria including the URLs to download the files. All that remains is to download these URLs (for example with tools like wget or curl).
  • You can express your targeted search using the sparql protocol and retrieve for example the list of URLs to download. All that remains is to download as with the previous solutions these URLs (for example with tools like wget or curl).
  • For collection of recordings and annotations, you can directly request to download a zip of all files in that collection (Only records in broadcast format and current versions of annotations in preservation format will be downloaded. Please note that the volumes to be downloaded can be substantial and may take varying amounts of time depending on the quality of your connection).

Miscellaneous issues

How to cite your data?

Digital object identifiers (DOIs) are assigned to each data. These identifiers can be cited in publications to reference the data. These DOIs are displayed on the landing pages and citation forms are proposed in different styles (APA, Harvard, Chicago...). For a wider range of citation forms it is possible to use the web site

The authentication in Cocoon?

There are currently two types of accounts used in Cocoon:

  • The Cocoon account is an account that allows you to edit the metadata of your documents.
  • The Huma-Num account is used by Cocoon to add depositors the right to deposit files on the server. Authentication is therefore requested on the fly when a file is uploaded on the server. It is also this account that is requested when someone accesses a recording or annotation file that is subject to controled access.

Data integrity management in Cocoon

To ensure the integrity of the data in Cocoon, the files, once standardised and checked, are subject to a hash code computation. These hash codes are stored separately from the files and checked regularly. Alerts are sent and recovery procedures can be triggered if an alteration is detected.

The algorithm used by Cocoon to compute hash codes is MD5. Once the files are deposited at CINES, new hash codes, this time computed with the SHA-256 algorithm, are added.

Wikipedia article on hash function

What licenses apply?

Cocoon recommends, where appropriate, using "Creative Commons" licenses. These licenses allow for fine-grained specification of what a user may or may not do in terms of resource reuse (recording, annotation). The URL of the chosen license is to be specified in the metadata in the Dublin-Core element of the same name.

The metadata, on the other hand, is systematically covered by the CC-BY-NC-ND-2.5 license as declared in oai-pmh in the metadataPolicy field.

Creative Commons License Creative Commons licenses are standard contracts for making works available online. They are non-exclusive permissions given by rights holders to the public. These authorizations specify the conditions of use of the works. In particular, these licenses make it possible to restrict commercial use and derivative works and to make the redistribution of works dependent on the mention of their authorship. Any rights not explicitly assigned in the license can be negotiated directly with the rights holders.

This movement is inspired by the free software movement, the "open source" movement and the "open access" movement. The "Creative Commons" organization was founded in 2001 at Stanford Law School under the impetus of law professor Lawrence Lessig (Cf. Lawrence Lessig's book: Free culture - How big media uses technology and the law to lock down culture and control creativity, 2004, in PDF format and of course licensed under the Creative Commons ). The licenses were initially written in English and with reference to American copyright law. Subsequently, the "International Commons" project was set up with the aim of translating and adapting the texts of the licenses to facilitate their application throughout the world while taking into account the specificities of national legislation.