This is a draft document; please comment and edit. I have borrowed extensively from the IU Digital Library Program http://wiki.dlib.indiana.edu/confluence/display/INF/Filename+Requirements+for+Digital+Objects.
MIT Libraries File Naming Scheme
As we create new collections, it is useful if there is consistency in the filenames assigned to digital objects. Each collection will impose its own restrictions on filenames, but following these requirements will ensure basic consistency across collections and make later processing of the files much easier.
The need for standardized filenames
The needs to be fulfilled by enforcing requirements on filenames are (in order of importance):
- Ease of identification. During the life of a digital file, it moves through various locations. These moves may be due to manual processing or automatic processing. A file may stay in a given location for an arbitrary length of time, during which human memory of its origin fades. At all points in the life of the file, humans and automatic processes must be able to easily identify which digital object the file belongs to and the position it occupies within that object. As a side effect of this, the filename will facilitate the process of locating metadata about the file.
- Automatic processing systems must be able to make basic assumptions about the filenames they will process. Developers should not be concerned with complex processing to handle special characters. (Although as a security matter, it is good practice to verify that a filename conforms to these requirements before initializing an automatic process.)
Requirements for filenames
Absolute requirements
All files must conform to the following requirements:
- Each filename must contain an identifier that uniquely specifies a single digital object within the parent collection.
- If a digital object consists of multiple files, each filename must contain the object's identifier, along with a unique sequence number.
- Each filename must be fully specified. It cannot just be a sequence number that is dependent on location within a directory structure for context.
Rationale: Files are often moved between locations for processing or testing. It is not always feasible to move an entire directory structure with the file, so all necessary information must be in the filename itself. - Filenames must not include spaces.
Rationale: There are many instances where using a space in a filename can cause programs to misbehave. Automatic processing as well as human access to the file becomes more difficult when spaces are involved. - The first character of the filename must be an ASCII letter ('a' through 'z' or 'A' through 'Z').
Rationale: Many programming and metadata languages place this restriction on their identifiers. Filenames should be usable as identifiers in these languages (e.g., section ID's in a METS document). - The "base" filename may include only ASCII letters ('a' through 'z' and 'A' through 'Z'), ASCII digits ('0' through '9'), hyphens, underscores, and periods. No other characters are permitted. See "Best Practices" regarding the use of periods and uppercase letters.
Rationale: Characters from other character sets can be difficult to read, depending on program support and available fonts. Many operating systems and programs are unable to correctly process non-ASCII characters. Punctuation and other ASCII characters not listed here may have special meanings, depending on the context; files using these characters may cause unexpected problems. - The "base" filename must be followed by a single period and a suitable extension to specify the type of file. The extension should consist of three letters (e.g., jpg, txt, xml, tif), but longer extensions are permissible if they are widely used (e.g., html, tiff, djvu, aiff).
Rationale: Whenever possible, the extension should make sense to a human. On systems where the file extension dictates automatic behaviors, the file should exhibit the expected behavior. - A derivative file must have the same name as the master file, except the "base" filename should have an indication of the derivative's type appended (e.g., "full" or "screen" for images, an indication of the bitrate for audio files). Derivative files will typically have a different file type, and therefore a different extension, than the master file.
Rationale: It should always be easy to identify files with master-derivative relationships. - Directory (folder) names should not include periods.
- Document the file naming scheme for each project and explain the decisions that were made. (U South Carolina)
Best practices
The following "best practices" should be followed whenever possible. If one of these practices is not followed, the change should be well documented, with a description of the reasons for not following the practice.
- While periods are permissible in "base" filenames, it is highly recommended that they be avoided.
Rationale: Some programs assume that there is only a single period in a filename, and will behave strangely if multiple periods are present. - It is preferable that all letters in a filename be lowercase. If a filename includes consecutive human-readable words, they may be denoted by CamelCase (e.g., wnp-04-RoyalSociety-ncn-t123.tif). This is expected to be relatively rare, though.
Rationale: Lowercase letters aid human readability and make it easier to type the filename. In collections where filenames contain many human-readable words, CamelCase aids readability. - Portions of the filename should indicate more specific detail as they are read from left to right. That is, the far left portion of the name should indicate the class of item, the next portion should be the item-specific ID, followed by a page/section number, and ending with the indication of derivative size. (Any of these portions that do not apply to the current file may be omitted.)
Rationale: Alphabetical listings of files make more sense with this organization. - Distinct portions of the filename should be separated by underscores.
Rationale: Separating the portions makes the filename both easier to read and easier to process automatically. Note that it is reasonable for the "identifier" portion of the filename to retain hyphens in identifiers from external sources, as in ihs-SHMU_01_13-01-05.tif. This reduces confusion when locating items provided by other institutions. - File names should be limited to 31 characters or fewer (including the period and file extension). Total path length (directories + file name) should not exceed 256 characters. (BCR-CDP; controlledvocabulary.com)
- While it is permissible for two different collections to contain files with identical names, this should be avoided.
Rationale: It will not be possible to know of all filenames in use. Nonetheless, identical names can be confusing, and care should be taken to reduce the probability of identical names. - Page numbers should be padded with leading zeros so that all filenames in a collection have the same number of characters for the page number portion. In most cases, this will be two or three digits.
Rationale: This forces pages to display in the correct order when listed alphabetically, and provides more visual consistency when scanning a long list of files. - When creating filename standards for a new collection, the standards should be based on existing collections/objects with similar characteristics.
Rationale: Minimizing the variability in filename standards eases both automatic and manual processing. - Whenever possible, the digital object's "primary" identifier (the identifier appearing in the filenames) should correspond to an identifier in use for the original (physical) object, such as the Aleph Bibliographic record number, the OCLC number, or the Archives collection number. If the format of the primary identifier conflicts with the absolute filename requirements, appropriate changes should be made. If the format of the primary identifier conforms to the absolute filename requirements but violates best practices, it may be left intact.
Rationale: It should be easy to determine the relationship between digital files and physical objects. This is easier if the identifier in the filename is as similar as possible to the identifier associated with the physical object. - For derivative files intended primarily for Web display, one consideration for naming is that images may need to be cited by users in order to retrieve other higher-quality versions. If so, the derivative file name should contain enough descriptive or numerical meaning to allow for easy retrieval of the original or other digital versions. (NARA)
Sample filenames
Collection |
Filename |
Notes |
|
---|---|---|---|
Edgerton Collection |
MC025_nb41_017.tif |
MC025 is the collection number for the Edgerton Collection in the Institute Archives. nb stands for notebook; nb41 stands for notebook #41. 017 stands for the 17th sequential image of notebook #41 (which may or may not be exactly page #17). In this case, the source for the digital image is the notebook itself. |
|
Edgerton Collection |
MC025_nb41-mf_017.tif |
As above, except that the source for the digital image is the microfilm of the notebook. nb41-mf stands for the microfilm (mf) of notebook #41. 017 stands for the image of the 17th whole frame on the microfilm. |
|
Edgerton Collection |
MC025_nb41-mf-split_017.tif |
As above, except that the digital images have been split and cropped so that they no longer represent a whole frame from the microfilm. nb41-mf-split stands for split images made from the microfilm of notebook #41. 017 is the 17th sequential image. |
|
Edgerton Collection |
MC025_nb41-mf-split_017-tn.jpg |
As above, except this is a derivative file. 017-tn stands for the thumbnail (tn) of the 17th image made from splitting digital images of the microfilm frames. |
|
Edgerton Collection |
MC025_nb41_017-tn.jgp |
A derivative file based on Example 1, at top of chart. 017-tn stands for the thumbnail (tn) of the 17th image. The source for the digital image is the original notebook. |
|
Science Journals |
000291693_v001_0001.tif |
000291693 is the system number for the Aleph record on which this volume of Transactions of the American Society for Steel Treating appears. v00001 stands for volume 1, and 0001 stands for the first image of this volume. |
|
Book (Off Campus Collection) |
001020855_0064.tif |
001020855 is the Aleph system number for this edition of The Principle of Relativity. 0064 stands for the 64th image of the book. The source for the digital image is the book itself. |
|
Book (Off Campus Collection) |
001020855_0064-tn.jpg |
As above, except this is a derivative file. 0064-tn stands for the thumbnail (tn) of image 0064. |
|
What is Engineering? Freshman Lecture Series |
000565646_pt01.mj2 |
000565646 is the Aleph system number for lecture no.2 from the 1977 What is Engineering freshman lecture series. pt01 and pt02 stand for two Motion JPEG 2000 files, one for each part of the lecture. (The original lecture was recorded on two tapes, parts 1 and 2, and this division has been maintained.) [Hypthetical example; these mj2 files do not exist] |
]]></ac:plain-text-body></ac:structured-macro> |
What is Engineering? Freshman Lecture Series |
WIE77_n02_000565646_pt01.mj2 |
Alternate file name for the above. Allows sorting by collection, year, and lecture number. WIE77 stands for the What is Engineering? Freshman Lecture Series from 1977. n02 stands for the lecture number. How does having the Aleph system number *not* in the first position affect our ability to connect image files to Aleph metadata? |
|
What is Engineering? Freshman Lecture Series |
000566595_pt01.mj2 |
000566595 is the Aleph system number for lecture no.2 from the 1978 What is Engineering freshman lecture series. pt01 and pt02 stand for two Motion JPEG 2000 files, one for each part of the lecture. |
|
What is Engineering? Freshman Lecture Series |
WIE78_n02_000566595_pt01.mj2 |
Alternate file name for the above. Allows sorting by collection, year, and lecture number. WIE78 stands for the What is Engineering? Freshman Lecture Series from 1978. n02 stands for the lecture number. |
|
What is Engineering? Freshman Lecture Series |
000566603_pt01.mj2 |
000566603 is the Aleph system number for lecture no.3 from the 1978 What is Engineering freshman lecture series. pt01 and pt02 stand for two Motion JPEG 2000 files, one for each part of the lecture. |
|
What is Engineering? Freshman Lecture Series |
WIE78_n03_000566603_pt01.mj2 |
Alternate file name for the above. Allows sorting by collection, year, and lecture number. WIE78 stands for the What is Engineering? Freshman Lecture Series from 1978. n03 stands for the lecture number. |
|
RVC or Rotch example |
|
|
|
Archives collection using folder numbers? |
MC###_f021_003_003.tif |
MC### stands for the collection number. f021 stands for folder #21. 003 stands for the third item in folder #21. The last 003 stands for the page/image number from the third item in folder #21. |
Directory Structure
When vendor is sending files on hard disk:
Shipment number (MIT generated)
Collection number or Aleph system number, with volume number if applicable
Image files, with derivatives interfiled
checksum.md5
Collection number or Aleph system number, with volume number if applicable
Image files, with derivatives interfiled
checksum.md5