Software for extracting file format information
This page covers software that identifies, validates, or extracts information from files according to their format. Software which does these functions as an incidental part of file processing is mentioned only if it has significant features in these areas as a separate operation. The notations for each application are:
- I: Identification. This tells you what type a file appears to be. The software may just examine the file extension or the first few bytes.
- V: Validation. This attempts to determine if the file conforms to the format. Software varies in how exhaustive its checks are and how nit-picking its criteria are.
- M: Metadata extraction. Software in this category makes information about the file's origin, creation details, copyright status, relationship to other information, etc., available.
- E: Error correction. Attempts to fix files.
- D: Desktop or command line application.
- L: Library, suited for incorporation into other software.
- O: Online application, may accessed through a website or a dedicated application that interacts with a remote service.
- OS: Open source.
- F: Freeware.
- F1: Free to some classes of users (e.g., personal, nonprofit).
- C: Commercial software requiring payment.
The software
- Adobe Bridge (M,D,C): Part of Adobe Creative Suite, this provides a convenient way to get information on Adobe and image files.
- Apache PDFBox (V,L,OS,F): The Apache PDFBox™ library is an open source Java tool for working with PDF documents. This project allows creation of new PDF documents, manipulation of existing documents and the ability to extract content from documents.
- Aperture (I,L,OS,F): A Java framework for extracting and querying full-text content and metadata from various information systems (e.g. file systems, web sites, mail boxes) and the file formats (e.g. documents, images) occurring in these systems.
- BWF MetaEdit (D,M,V,OS,F): BWF MetaEdit permits embedding, validating, and exporting of metadata in Broadcast WAVE Format (BWF) files.
- DROID (I,D,L,OS,F): Free, open source. Uses data provided by the UK National Archives.
- ExifTool (M,D,L,OS,F): Tool for reading and writing metadata. In spite of its name, not restricted to Exif.
- ExifViewer (M,O,F): Website where you can upload images and see their Exif metadata.
- ffident (I,L,OS,F): Java library to extract information from files and identify their formats.
- FIDO (I,D,OS,F): A Python command-line tool to identify the file formats of digital objects. It is designed for simple integration into automated work-flows.
- file (I,C,OS,F): A Unix/Linux shell command for identifying files. Open source in Linux implementations.
- FITS (I,V,M,D,L,OS,F): File Information Tool Set, developed at Harvard to run multiple validation and extraction tools in a single operation.
- Firefox Dublin Core Viewer Extension (M,L,F): Firefox extension that lets you access an overview list of Dublin Core Metadata embedded in HTML/XHTML documents with META and LINK elements.
- FlightCheck (V,D,C): Does detailed validation of Adobe formats and some others.
- ImageMagick (I,M,D,F): The
identify
command in ImageMagick identifies a file's type and extracts information about it. - Jaudiotagger (L,OS,F): A Java library for tagging a variety of audio file formats.
- JHOVE (I,V,M,D,L,OS,F): Written mostly by me at Harvard, under the direction of Stephen Abrams.
- JHOVE2 (I,V,M,D,L,OS,F): A ground-up rewrite of JHOVE.
- libmagic (I,L,OS,F):
The library which underlies the Linux
file
command and can be called from other software. - MP3 Validator (V,E,D,F): Checks and repairs MP3 files. Mac OS X.
- NLNZ Metadata Extraction Tool (M,D,OS,F): Application from the National Library of New Zealand to programmatically extract preservation metadata from the headers of a range of file formats.
- TIFF Scrubber (V,E,D,C): Fixes problems in TIFF files.
- Taglib (M,L,OS,F): TagLib is a library for reading and editing the meta-data of several popular audio formats. Currently it supports both ID3v1 and ID3v2 for MP3 files, Ogg Vorbis comments and ID3 tags and Vorbis comments in FLAC, MPC, Speex, WavPack TrueAudio, WAV, AIFF, MP4 and ASF files.
- Tika (M,D,OS,F): The Apache Tika™ toolkit detects and extracts metadata and structured text content from various documents using existing parser libraries.
- Validome XML Validator (V,O,F): Online XML validation.
- W3C HTML Validator (V,O,F): Online HTML validation.
- XML Validation (V,O,F): Online XML validation.
The information provided here doesn't constitute an endorsement and isn't guaranteed to be correct.It should be considered just a first step to finding more detail. I have received no compensation for listing these items here; however, I was paid to write JHOVE.