Format of SoftBook .imp files

from http://krausyaoj.tripod.com/reb1200.htm

An .imp file consists of a header, book properties, name of .RES directory, table of contents, and the contents of each file. Version 1 is created by Softbook Publisher 1.0. Version 2 is created by Softbook Publisher 1.5. The header is the same for both versions except for the value of the version field. The versions differ in their table of contents.

Header which totals 48 bytes
offset 0x00: 2 bytes, version
offset 0x02: 8 bytes, constant "BOOKDOUG"
offset 0x0A: 8 bytes, unknown
offset 0x12: 2 bytes, count of included files
offset 0x14: 2 bytes, length of directory name
offset 0x16: 2 bytes, count of bytes remaining for header and book properties
offset 0x18: 4 bytes, unknown
offset 0x1C: 4 bytes, unknown
offset 0x20: 4 bytes, compression off=0, on=1
offset 0x24: 4 bytes, encryption plain=0, encrypted=2
offset 0x28: 4 bytes, zoom state, both=0, small=1, large=2
offset 0x2C: 4 bytes, unknown

Book properties start at offset 0x30
ID: null terminated C string
Bookshelf Category: null terminated C string
Subcategory: null terminated C string, not displayed on REB1200
Title: null terminated C string
Last name: null terminated C string
Middle name: null terminated C string
First name: null terminated C string

Note: Name fields are displayed in order "first middle last". Most REB1200 eBooks place the entire name in First.

Name of .RES directory

Offset to start of table of contents is 48 plus length of directory name and length of book properties. Files are included in the same order in the table of contents as the contents of the files following the table of contents. A file name of four spaces (0x20) becomes "DATA.FRK" when the .imp file is expanded into a .RES directory. The 12 bytes following "BOOKDOUG" are removed from the header and the book properties are appended and become the "RSRC.INF" file.

Version 1

Table of contents for each included file except RSRC.INF:
4 bytes: file name
2 bytes: 0x0000 or 0x0001
4 bytes: file size

For each included file except RSRC.INF
4 bytes: file name
2 bytes: 0x0000 or 0x0001
4 bytes: file size
variable: contents of files

Version 2

Table of contents for each included file except RSRC.INF:
4 bytes: file name
4 bytes: 0x00000000
4 bytes: file size
4 bytes: file type
4 bytes: 0x00000000

For each included file except RSRC.INF:
4 bytes: file name
4 bytes: 0x00000000
4 bytes: file size
4 bytes: file type
4 bytes: 0x00000000
variable: contents of files

RSRC.INF File

This file is composed of the header and book properties sections of the .imp file with 12 bytes removed after the constant "BOOKDOUG". The identifer comes from the <dc:Identifier> element after the <dc:Identifier> element whose id attribute equals the unique-identifier attribute of the package element. If this element is not present the identifier will be a null string and the category and title of the Settings command will be changed to those of this ebook.

This file is modified after download by a number of Set-Cookie HTTP headers. The values of cookies SOURCE_ID and SOURCE_TYPE are prepended to the BOOK_ID. The bytes following first name are added including cookies ISSUE_NUMBER and CONTENT_FEED. SOURCE_ID is usually "3" and SOURCE_TYPE is usually "B".

32 byte header
offset 0x00: 2 bytes, version constant 0x001
offset 0x02: 8 bytes, signature constant "BOOKDOUG"
offset 0x0A: 2 byte offset 12 bytes before end of file
4 bytes, constant 0x00000001 unknown
4 bytes, constant 0x00000007 unknown
offset 0x14: 4 bytes, compression off=0, on=1
offset 0x18: 4 bytes, encryption plain=0, encrypted=2
offset 0x1C: 4 bytes, zoom state, both=0, small=1, large=2
offset 0x20: 4 bytes, unknown
C string: identifier, format SOURCE_ID:SOURCE_TYPE:BOOK_ID
C string: category
C string: subcategory (not used on REB1200), always 0x00
C string: title
C string: last name
C string: middle name
C string: first name (usually entire name in this field on REB1200)
padding bytes so next record is 4-byte aligned
4 bytes: constant 0x00000002
4 bytes: constant 0xFFFFFFFF for books, ISSUE_NUMBER for issues of periodicals
C string: null string for books, CONTENT_FEED for issues of periodicals,
C string: SOURCE_ID:SOURCE_TYPE:None
4 bytes: unknown

DATA.FRK File

Element text is extracted and placed in this file. Elements tags are replaced with control characters. This file can be compressed and encrypted with compression occuring before encryption. This file is compressed when the element <meta name="x-SBP-compress" content="on"/> is included in the <x-metadata> element of the package file. The compression algorithm used is LZSS. This file is encrypted when the element <meta name="x-SBP-encrypt" content="on"/> is included in the <x-metadata> element of the package file. The encryption algorithm used is DES. The 8 byte encryption key is in the SoftBook Edition Encryption Key File (.key) at offset 0x0C.

Characters less than 0x20 are removed expect for line break which is replaced with 0x20. Mutliple 0x20 characters are replaced with a single 0x20.

Control characters
0x0A end of document, forced page break
0x0B start of element except <span>
0x0D line break element <br />
0x0E start of table element <table>
0x0F image element <img />
0x13 end of table cell </td> tag
0x14 horizontal rule element <hr />
0x15 before and after page header content
0x16 before and after page footer content

.key File

When an encrypted .imp file is created the DES encryption key is stored in a file with extension .key. The DES key is used to encrypt the content of file DATA.FRK after compression. This key is encrypted with the REB1200 device key and stored in resource !!ky.

offset 0x00: 4 bytes, constant 0x00000001
offset 0x04: 4 bytes, constant 0x00000000
offset 0x08: 4 bytes, constant 0x00000001
offset 0x0C: 8 bytes, DES key
offset 0x14: 4 bytes, constant 0x00000000

Resource Files

A Softbook Edition file (.imp) is split into the files of a .RES directory for loading onto an REB1200 using its compact flash card. These files are named RSRC.INF, DATA.FRK and a number of files with names of four (4) uppercase letters.

Many of the file types follow a common pattern of having a 32 byte header, a variable length data section, followed by an index. A 4 byte pointer at offset 0x0A of the header points to the start of the index. The index entries point to the data in the variable length section in reverse order. Each index entry has an offset to the beginning of its data and the length of that data.

offset 0x00: 2 bytes, file version, currently version 0x0001.
offset 0x02: 4 bytes, file type
offset 0x06: 4 bytes, null
offset 0x0A: 4 bytes, offset to start of index
offset 0x0E: 18 bytes, null
offset 0x20: start of variable length data

offset 0x00: 2 bytes resource ID
offset 0x02: 4 bytes length of resource
offset 0x06: 4 bytes offset to start of resource
offset 0x0A: 2 bytes constant 0x0000

File types

Each of the files whose name is four characters have a file type in them at offset 0x02 that is 4 bytes long. The minimal set of file types for a plain text file are: !!cm, !!sw, AncT, BGcl, BPgz, BPgZ, ESts, Mrgn, pInf, PPic, StRn, Styl

These types include the following:
!!cm : compression
!!ky : encryption key container
!!sw
AncT : anchor <a> elements
Ano2: annotations
AtTp
BGcl : background color
BPgz book pagination
BPgZ
BPos book position
Clos
Dict
eLnk : external links containing the href attributes of the <a> elements
ESts : extended styles
Glos
HfPz : header and foot pages
HfPZ
HRle : horizontal rule
Hyp2
ImRn : image run
Lnks : internal and external links
MASK
Mrgn : margin
MRPs: markups?
Offs
Pcz0
PcZ0
Pcz1 table border
PcZ1
PNG: PNG images
PIC2
PICT
pInf
PPic
SKtb
StR2 : extended string runs
StRn : string runs linking element text to styles
STR#
Styl : styles
Tabl tables
TagS
TCel table cells
TGNt
TRow table rows

!!cm format

Present only in compressed files. One method of compression supported is LZSS. Resource 0x65 contains 1 or more 10 bytes records. If resource 65 has only one record its format is that of the last record. Resource 64 is 10 bytes.

standard header, 32 bytes
offset 0x00: constant 0x0001 version
offset 0x02: constant "!!cm" file type
offset 0x0A: offset to start of index

resource 0x65 records, 0x0A (10) bytes each
offset 0x00: 4 bytes, byte position in uncompressed data
offset 0x04: 4 bytes, byte position in compressed data
offset 0x08: 2 bytes, bit position past byte position in compressed data
0x01: 7 bits, 0x02: 6 bits, 0x04: 5 bits, 0x08: 4 bits, 0x10, 3 bits, 0x20, 2 bits, 0x40, 1 bits, 0x80: exactly at byte position

last record
offset 0x00: 4 bytes, size of file DATA.FRK before compression
offset 0x04: 4 bytes, constant 0x00000000
offset 0x08: 2 bytes, constant 0x0000

resource 0x64, 0x0A (10) bytes
offset 0x00: 2 bytes, constant 0x0001 unknown
offset 0x02: 2 bytes, constant 0x0000 unknown
offset 0x04: 2 bytes, constant 0x0001 unknown
offset 0x06: 2 bytes, compression window size, default value of 0x000E (14) bits
offset 0x08: 2 bytes, look-ahead buffer size, default value of 0x0003

standard index, 2 index entries
2 bytes: index
4 bytes: record length
4 bytes: offset to start of record
2 bytes: constant 0x0000

!!ky format

When a Softbook Edition file (.imp) is encrypted only the DATA.FRK file is encrypted. When the same ebook is downloaded to two different REB1200 ebook readers only the 8 bytes at offset 0x02C are different. These 8 bytes are not the encryption key for the file. They are the book encryption key from file .key encrypted with the REB1200 device key. If the format version is 3 then the encrypted book key is not stored in this resource and the value of device specific key is 0. The value is 3 for books downloaded with version 3.1 of the viewer software.

32 bytes: standard header
offset 0x00: 2 bytes constant 0x0001 version
offset 0x02: 4 bytes constant "!!ky" file type
offset 0x0A: 4 bytes offset to start of index

4 bytes: constant 0x00000001
4 bytes: constant 0x00000000
4 bytes: format version constant 0x00000002 or 0x00000003
8 bytes: device specific key
16 bytes: book specific key

12 bytes: standard index

9 bytes: constant "Data fork"

!!sw format

Standard header, 32 bytes
offset 0x00: constant 0x0001
offset 0x02: constant "!!sw"
offset 0x0A: offset to beginning of index

Content

Index, each entry is 16 bytes long
2 bytes: sequence number
4 bytes: length of item
4 bytes: offset to beginning of item
2 bytes: constant 0x0004
4 bytes: file type

AncT format

These are the targets of <a> tags and can be any element that has an id property. Resource ID 1 is for small view and resource ID 0 is for large view.

32 bytes: header,
offset 0x00: 2 bytes constant 0x0001
offset 0x02: 4 bytes constant "AncT"
offset 0x0A: 4 bytes offset to index

4 bytes: count of anchor tags
anchor tags, 8 bytes each
offset 0x00: 4 bytes offset to control character 0x0F for anchor tag in DATA.FRK
offset 0x04: 4 bytes page number of anchor tag

12 bytes: index
2 bytes: resource ID
4 bytes: length of resource
4 bytes: offset to resource
2 bytes: constant 0x0000

BGcl format

A single resource with background color set by attribute bgcolor of element body.

32 bytes: standard header
offset 0x00: 2 bytes constant 0x0001
offset 0x02: 4 bytes constant "BGcl"
offset 0x0A: 4 bytes offset to start of index

8 bytes: resource
offset 0x00: 2 bytes constant 0xFFFF
offset 0x02: 1 byte color red
offset 0x03: 1 byte 0x00 if background color set or 0xFF
offset 0x04: 1 byte color green
offset 0x05: 1 byte 0x00 if background color set or 0xFF
offset 0x06: 1 byte color blue
offset 0x07: 1 byte 0x00 if background color set or 0xFF

12 bytes: standard index
2 bytes: resource ID, constant 0x0080
4 bytes: length of resource
4 bytes: offset to resource
2 bytes: constant 0x0000

BPgz and BPgZ formats

Encodes page and line information. BPgz for small view and BPgZ for large view. Each page has a pair of offsets into file DATA.FRK and each line has a relative offset and length into the text of the page.

standard header, 32 bytes
offset 0x00: 2 bytes constant 0x0001
offset 0x02: 4 bytes constant "BPgz" for small view or "BPgZ" for large view
offset 0x0A: offset to index

content, subrecords for each page
offset 0x00: 2 bytes, constant 0x1003
offset 0x02: 4 bytes, constant 0x00000000
offset 0x06: 2 bytes, count of line geometry records

line geometry records, 8 bytes each
2 bytes: offset from left edge of screen
2 bytes: offset from top, 0x8000 is top of screen, 0 or positive is offset from previous line, negative is offset from top of screen
2 bytes: line width, note: changing value does not alter display
2 bytes: line height

4 bytes: count of page index records

page index records, 4 byte each
4 bytes: offset to corresponding page record

4 bytes: count of page records

page records, 8 bytes each
4 bytes: offset into DATA.FRK of first character of page record
4 bytes: offset into DATA.FRK of last character of page record

4 bytes: 0x0000FFFF without header or page number within HfPz or HfPZ resource with display:oeb-page-head
4 bytes: 0x0000FFFF without footer or page number within HfPz or HfPZ resource with display:oeb-page-foot
4 bytes: count of page region records

page region records, 20 bytes each
4 bytes: constant 0x00000000
4 bytes:screen top, value with one page region 0x00000000
4 bytes:screen right, value with one page region 0x0001D8 (472)
4 bytes:screen bottom, value with one page region 0x000253 (595)
4 bytes:screen left, value with one page region 0x00000000

4 bytes: count of line records

line records, 3 bytes each
1 byte: high nybble: flags, low nybble: upper offset into page record
1 byte: lower offset into page record
1 byte: number of character in line

index of line geometry records, remaining bytes
If the most significant bit is set (0x80) then the remaining bits identify the line geometry record used for a single line, if the two most significant bits are set (0xC0) then the lower nyble (0x0F) plus the following byte identify the line geometry record for a single line, otherwise this byte identifies the line geometry record and the following byte counts the number of lines using this line geometry record.

4 bytes: count of border records

border records, 4 bytes each
offset 0x00: 4 bytes, number of border record in Pcz1 or PcZ1

4 bytes: count of image records

image records, 4 bytes each
offset 0x00: 4 bytes, number of image record in Pcz0 or PcZ0

standard index, 12 bytes each
2 bytes: resource id
4 bytes: length of resource
4 bytes: offset to start of resource
2 bytes: constant 0x000

BPos format

standard header, 32 bytes
offset 0x00: constant 0x0001
offset 0x02: constant "BPos"
offset 0x06: offset to start of record A
offset 0x0A: offset to index

Record A, variable length

Record B, usually 0x0A bytes long
offset 0x00: 2 bytes current page

standard index
2 bytes: constant 0x0080
4 bytes: length of record B, usually 0x0A
4 bytes: offset to start of record B
2 bytes: constant 0x000

eLnk format

Links to external sites are stored in "eLnk" files. This file has the standard header, the contents of the href attribute of <a> elements in reverse order and an index table. Each item in the index table is twelve (12) bytes long.

Standard header, 32 bytes

Content
Text of href attributes of <a> elements without delimiters. Each attribute has a corresponding index entry.

Standard index, each entry is 12 bytes
2 bytes: ID used by Lnks, offset 0x1A
4 bytes: offset to the href attribute value
4 bytes: length
2 bytes: constant 0x0000

ESts format

Extended style information for CSS properties x-sbp-widow-push and x-sbp-orphan-pull.

standard header, 32 bytes
offset 0x00: 2 bytes, version constant 0x0001
offset 0x02: 4 bytes, file type constant "ESts"
offset 0x0A: 4 bytes, offset to start of index

Content
resource 1, x-sbp-orphan-pull
resource 2, x-sbp-widow-push
resource 3, 0x0C bytes constant

standard index, each entry 12 bytes long
offset 0x00: 2 bytes, index
offset 0x02: 4 bytes, resource length
offset 0x06: 4 bytes, offset to start of resource
offset 0x0A: 2 bytes, constant 0x0000

HfPz and HfPZ format

These resources are used for the header and footer portions of pages and have the same format as BPgz and BPgZ.

HRle format

Horizontal rule elements in a single resource with ID of 80. Each hr element is a record.

32 bytes: header
offset 0x00: 2 bytes constant 0x0001
offset 0x02: 4 bytes constant "HRle"
offset 0x0A: 4 bytes offset to index

record, size 0x0C (12) bytes each
offset 0x00: 2 bytes, attribute size
offset 0x02: 2 bytes, attribute width, positive value width in pixels, negative value width in percent
offset 0x04: 2 bytes, attribute align 0xFFFD=justify (default), 0xFFFF=right, 0xFFFE=left, 0x0001=center
offset 0x06: 2 bytes, constant 0x0100 unknown
offset 0x08: 4 bytes, offset into DATA.FRK

12 bytes, index
2 bytes, resource ID constant 0x0080
4 bytes, size of resource
4 bytes, offset to resource
2 bytes, constant 0x000

ImRn format

Indexes all of the images used in the book along with their height and width. Image records are refenced by resources BPgz and PBgZ

32 bytes: standard header
offset 0x00: constant 0x0001
offset 0x02: constant "ImRn"
offset 0x0A: 4 bytes offset to index

image records, each 0x20 (32) bytes
offset 0x00: 8 bytes constant 0xFFFFFFFFFFFFFFFF unknown
offset 0x08: 2 bytes image width
offset 0x0A: 2 bytes image height
offset 0x0C: 4 bytes constant 0x00000000 unknown
offset 0x10: 2 bytes constant 0xFFFB unknown
offset 0x12: 4 bytes offset within file DATA.FRK of control character 0x0F for this image
offset 0x16: 4 bytes unknown
offset 0x1A: 4 bytes image type: "PNG ", "GIF ", "PICT"
offset 0x1E: 2 bytes image resource ID

12 bytes: standard index
2 bytes: constant 0x0080
4 bytes: length of resource
4 bytes: offset to resource
2 bytes: constant 0x0000

Lnks format

Internal links in the same order as the <a> elements with an href attribute. The target of links is encoded by resource AncT. If the link is the special table of content link the offset 0x00 is 0x7FFFFFFF and offset 0x04 is 0xFFFFFFFF and the link type is internal.

32 bytes, standard header
offset 0x00: constant 0x0001
offset 0x02: constant "Lnks"
offset 0x0A: offset to start of index

link record, each 0x22 (34) bytes
offset 0x00: 4 bytes offset into DATA.FRK of start of link text
offset 0x04: 4 bytes offset into DATA.FRK of end of link text including the following added space (0x20) character
offset 0x08: 4 bytes link type, internal=0xFFFFFFFF or external=0xFFFFFFFC
offset 0x0C: 4 bytes constant 0x00000000 unknown
offset 0x10: 4 bytes offset into DATA.FRK of link target
offset 0x14: 4 bytes constant 0x00000000 unknown
offset 0x18: 2 bytes constant 0x0000 unknown
offset 0x1A: 2 bytes internal link=0x0000, external link=resource ID of type eLnk
offset 0x1C: 4 bytes constant 0x00000000 unknown
offset 0x20: 2 bytes constant 0x0000 unknown

12 bytes, standard index
2 bytes: constant 0x0080
4 bytes: length of data
4 bytes: offset to start of data
2 bytes: constant 0x0000

Mrgn format

This format may have been for margins, but changing any of the four bytes from 0xFF to 0x00 and even deleting the file has no visible impact on the ebook.

standard header, 32 bytes

records, each 0x02 bytes long, two records

standard index, 12 bytes each entry, two entries

Pcz0 and PcZ0 formats

If the same image is used more than once it will have multiple records in this resource that all refer to the same picture resource. Image width and height are not those of the image but those set by the width and height attributes of the img element.

32 bytes: header
offset 0x00: 2 bytes version constant 0x0001
offset 0x02: 4 bytes constant "Pcz0" or "PcZ0"
offset 0x0A: 4 bytes offset to index

image position record each 0x2E (46) bytes long
offset 0x00: 4 bytes constant "imgp"
offset 0x04: 4 bytes horizontal offset
offset 0x08: 4 bytes vertical offset
offset 0x0C: 4 bytes image width
offset 0x10: 4 bytes image height
offset 0x14: 2 bytes constant 0xFFFB unknown
offset 0x16: 4 bytes at offset into DATA.FRK of the image control character 0xF for this image
offset 0x1A: 4 bytes constant 0x00000001 unknown
offset 0x1E: 4 bytes type of image: "PICT"
offset 0x22: 2 bytes constant 0x0001 unknown
offset 0x24: 2 bytes unknown
offset 0x26: 2 bytes image resource ID
offset 0x28: 6 bytes constant 0x000000000000 unknown

12 bytes: index
2 bytes: resource ID
4 bytes: length of resource
4 bytes: offset to resource
2 bytes: constant 0x0000

Pcz1 and PcZ1 formats

Single resource composed of records for each cell and for each table.

32 bytes: standard header
offset 0x00: 2 bytes version constant 0x0001
offset 0x02: 4 bytes constant "Pcz1" or "PcZ1"
offset 0x0A: 4 bytes offset to index

border record, 0x1E (30) bytes each
offset 0x00: 4 bytes, constant "borp"
offset 0x04: 4 bytes, 0x40 + left position
offset 0x08: 4 bytes, 0x40 + top position
offset 0x0C: 4 bytes, width
offset 0x10: 4 bytes, height
offset 0x14: 2 bytes constant 0xFFF9 unknown
offset 0x16: 4 bytes constant 0xFFFFFFFF unknown
offset 0x1A: 4 bytes constant 01000001 unknown

12 bytes: standard index
2 bytes: resource ID
4 bytes: length of resource
4 bytes: offset to resource
2 bytes: constant 0x0000

PIC2 format

One or more images in PNG format.

32 bytes: standard header with location of index at offset 0x0A.
variable length data: one image for each index entry
index: each index entry is 12 bytes long.
2 bytes: ID of picture
4 bytes: length of image
4 bytes: offset to beginning of image
2 bytes: 00 00

pInf format

Page information with two 10 bytes resources. The first resource is for the small font view and the second resource is for the large font view.

standard header, 32 bytes
offset 0x00: constant 0x0001
offset 0x02: constant "pInf"
offset 0x0A: offset to index

page information resource, each 10 (0x0A) bytes
offset 0x00: 2 bytes, constant 0x0103
offset 0x02: 2 bytes, constant 0x0000 unknown
offset 0x04: 2 bytes, last page number
offset 0x06: 2 bytes, count of image resources
offset 0x08: 2 bytes, constant 0x0032 unknown

standard index, 12 bytes each
2 bytes: resource ID, either 0 or 1
4 bytes: length of resource
4 bytes: offset to start of resource
2 bytes: constant 0x0000

PPic format

The two resources have the same values. The first resource has ID 1 and the second resource has ID 0.

32 bytes: standard header
offset 0x00: version 0x0001
offset 0x02: constant "PPic"
offset 0x0A: offset to index

resource, each 0x12 (18) bytes long
offset 0x00: 2 bytes constant 0x03
offset 0x02: 4 bytes count of cell and table borders encoded by Pcz1 and PcZ1 resources
offset 0x06: 4 bytes constant 0x00000000 with no border or 0x00000064 with borders
offset 0x0A: 4 bytes count of images encoded by Pcz0 and PcZ0 resources
offset 0x0E: 4 bytes constant 0x00000000 with no pictures or 0x00000064 with pictures

12 bytes: standard index
2 bytes: resource id
4 bytes: resource length
4 bytes: offset to resource
2 bytes: constant 0x0000

StR2 format

Either StRn or StR2 will be used to link text in file DATA.FRK to styles in Styl. If StR2 is used one resource with ID 0x8001 is an index to the other two or more resources. These others resources have ID 0x8002 and higher.

32 bytes, header
offset 0x00: 2 bytes constant 0x0001
offset 0x02: 4 bytes constant "StR2"
offset 0x0A: 4 bytes offset to index

Index resource, ID 0x8001
offset 0x00: 4 bytes unknown
offset 0x04: 4 bytes unknown
offset 0x08: 4 bytes unknown

For each additional resource their is an additional four byte value. The first value is 0x00000000, the next value is that of the first offset into DATA.FRK of the first string run resource, the next value is that of the second string run resource, and the last value is 0xFFFFFFFF.

String run resource, ID 0x8002 and higher
offset 0x00: 4 bytes offset into file DATA.FRK
offset 0x04: 4 bytes record number of Styl

12 bytes, index for each resource
2 bytes: resource ID
4 bytes: offset to start of resource
4 bytes: length of resource
2 bytes: constant 0x0000

StRn format

Links element text and the styles applied to that text. Styl 0 is the body style, Styl 1 and 2 appear unused. Styl 3 is the first user defined Styl.

standard header, 32 bytes
offset 0x00: constant 0x0001 version
offset 0x02: constant "StRn" file type
offset 0x0A: 4 bytes offset to start of index

single record composed of 8 byte subrecords
4 bytes: offset into DATA.FRK of start of string
4 bytes: index into Styl table

standard index, 12 bytes
1 index record

Styl format

Styles defined but not used are not included. The styles are in the order they are used instead of defined.

32 bytes standard header
offset 0x00: 2 bytes version, constant 0x0001
offset 0x02: 4 bytes file type, constant "Styl"
offset 0x0A: 4 bytes offset to start of index

Content 46 (0x2E) bytes each style
offset 0x00: 2 bytes unknown
offset 0x02: 2 bytes element sub=0x0002, element sup=0x0001, "text-decoration:line-through"=0x0004
offset 0x04: 2 bytes unknown
offset 0x06: 2 bytes font-family
offset 0x08: 2 bytes style, "font-weight:bold"=1, "font-style:italic"=2, "font-weight:bold; font-style:italic"=3; "text-decoration:underline"=4, "font-weight:bold;text-decoration:underline"=5, "font-style:italic;text-decoration:underline"=6
offset 0x0A: 2 bytes font-size
offset 0x0C: 2 bytes text-align: none=0000, left=FFFE, right=FFFF,center=0001, justify=FFFD
offset 0x0E: 3 bytes text color, RRGGBB. Softbook Publisher supports only numbered colors
offset 0x11: 3 bytes background-color
offset 0x14: 2 bytes margin-top or 0xFFFF when not defined
offset 0x16: 2 bytes, text-indent
offset 0x18: 2 bytes, margin-right
offset 0x1A: 2 bytes, margin-left
offset 0x1C: 2 bytes unknown
offset 0x1E: 2 bytes, oeb-column-number
offset 0x20: 14 bytes unknown constant 0x000000000000000000003B23FFFF

12 bytes standard index
2 bytes: constant 0x0080
4 bytes: length of resource
4 bytes: offset to start of resource
4 bytes: constant 0x0000

font-family: serif=14, sans-serif=15, smallfont=3, monospace=4
font-size: xx-small=1, x-small=2, small=3, medium=4, large=5, x-large=6, xx-large=7

The default font-size is x-small. If all the values of a style equal those of the default style a record is not created for that style.

Changing style margin-left of the <body> element changed only files of type BPgz, BPgZ, and Styl. This changed offsets 23 and 27 of the style records except for the first three. Style records are created for elements html, head, and title. A separate style record is not created for the body element.

Tabl format

All tables are in a single resource with each table composed of a record of length 0x18

32 bytes, header
offset 0x00: constant 0x0001
offset 0x02: constant "Tabl"
offset 0x0A: offset to index

table record, 0x18 (24) bytes each
offset 0x00: 2 bytes, attribute align, not specified 0xFFFA, center 0x0001, right 0xFFFF, left 0xFFFE; justify 0xFFFD
offset 0x02: 2 bytes, attribute width, negative values width in percentage, example 0xFFA5=90%, not specified 0xFFFF, absolute value 0x0000
offset 0x04: 2 bytes, border: 0x0000 for single, 0xFFFF for double
offset 0x06: 2 bytes, cellspacing 0xFFFF for not set
offset 0x08: 2 bytes, cellpadding 0xFFFF for not set
offset 0x0A: 4 bytes, element caption present 0x00000001 otherwise 0xFFFFFFFF
offset 0x0E: 4 bytes, length of caption
offset 0x12: 2 bytes, constant 0x0000 unknown
offset 0x14: 2 bytes, list-style-type: default 0xFFFF, decimal 0x0000, lower-alpha 0x0000, upper-alpha 0x0000
offset 0x16: 2 bytes, TRow ID, reference to rows of this table

12 bytes, index
2 bytes: constant 0x0080
4 bytes: length of resource
4 bytes: offset to resource
2 bytes: constant 0x0000

TCel format

The cells of a row are in a single resource composed of cell records each of length 0x1A bytes

32 bytes standard header
offset 0x00: constant 0x0001
offset 0x02: constant "TCel"
offset 0x0A: offset to index

Contents, 0x1A (26) bytes for each record
offset 0x00: 2 bytes, constant 0x0000 unknown
offset 0x02: 2 bytes, constant 0x0001 unknown
offset 0x04: 2 bytes, constant 0x0001 unknown
offset 0x06: 2 bytes, 0xFFFA table cell, 0xFFFF definition list cell
offset 0x08: 2 bytes, vertical-align: middle 0xFFFA, top 0xFFFC, bottom 0xFFFB
offset 0x0A: 2 bytes, width 0x0000 is not set
offset 0x0C: 2 bytes, height 0x0000 is not set
offset 0x0E: 4 bytes, offset into DATA.FRK of cell content
offset 0x12: 4 bytes, length of cell content
offset 0x16: 4 bytes, no background color is 0xFFFFFFFF, background color is 0x00RRGGBB

Standard index, each index entry is 12 bytes long
2 bytes: ID, not sequential
4 bytes: length of record which for this type is 0x34
4 bytes: offset to start of record
2 bytes: constant 0x0000

TRow format

The rows of a table are in a single resource composed of row records each of length 0x10 bytes. Lists are equivalent to tables.

32 bytes standard header
offset 0x00: constant 0x0001
offset 0x02: constant "TRow"
offset 0x0A: offset to start of index

Each record is 16 (0x10) bytes long
offset 0x00: 4 bytes 0xFFFAFFFA table row, 0xFFFFFFFC definition list row
offset 0x04: 2 bytes, border: 0x0000 for single, 0xFFFF for double
offset 0x06: 2 bytes, TCel ID, reference to cells of this row
offset 0x08: 4 bytes, offset into DATA.FRK of row content
offset 0x0C: 4 bytes, length of row content including terminating control character 0x13

Standard index, each index entry is 12 bytes long
2 bytes: ID number, not sequential
4 bytes: length of record group
4 bytes: offset to beginning of record group
2 bytes: constant 0x0000

Online Bookshelf

The online bookshelf is built when the REB1200 sends an HTTP GET request to address bookshelf.softbook.net/bookshelf/default.asp. It is sent back a list of books in an HTTP response. The format of this list has Content-Type of text/x-booklist. Each line of the list is terminated by a line feed. The first line of the list is "1", which may be format version. The last line is empty so the list ends with two line feeds. The items of the list are separated by tabs

SOURCE_ID ":" SOURCE_TYPE ":" BOOK_ID
Title
Author
Category
Size of .imp file in bytes
Download URL
Constant 1
File type: 17 for books and 21 for issues of periodicals

Example
3:B:0-7410-0284-1&#9;Random House Webster's Pocket American Dictionary&#9;Random House&#9;General Interest&#9;3357685&#9;http://bookshelf.softbook.net/bookshelf/default.asp?BOOK%5FID=0%2D7410%2D0284%2D1&SOURCE%5FID=3&SOURCE%5FTYPE=B&#9;1&#9;17

Note: &#9; is the entity for the tab character.


A primer on the .IMP specification has never been published, but a very detailed explanation of the .IMP file format can be found here. It was reverse-engineered by Jeffrey Kraus-yao back in 2002. Jeffrey indicated that how he reversed engineered the .imp format was by building, with eBook Publisher, .oeb test ebooks. Then he would drop that .oeb onto a desktop shortcut of the imp viewer.exe. Then while the viewer was still running, he would examine his temp folder and noticed one of four .RES folders was complete. He wrote down the changed bits and then repeatedly made small changes to the .oeb ebook. He started this even though he didn't own the REB1200 yet.

It was quite the accomplishment! Now please realize that back then only the REB1200 used the .IMP format as there weren't a lot of GEB1150's (predecessor to the EBW1150) in 2002/3. Oh, how the tables have turned as now for every REB1200 in use there are tens or hundreds more EBW1150's!

Jeffrey's website is a great start to understanding the .IMP file format, but lacks such basic information as:

You may have even noticed that the EBW1150 .imp is slightly bigger (filesize-wise) than the same REB1200 .imp when no images are present. That is because the EBW1150 .imp uses two additional (irrelevant/unused) bytes for most records and when multiplied by thousand of records results in a larger file! I think a different "programming team" came out with the EBW1150 .imp file format as most .RES filetype records are reversed byte wise (i.e. BE vs LE). It makes this .IMP reverse-engineering unnecessarily more difficult!

I would like to herein build a knowledge-base for the "definitive" understanding of the .IMP file format. As others have already expressed to me their own foray into .imp "nuts & bolts" investigations, I propose to start off this knowledge-base with my preliminary findings written as a Perl script. That script is imp_dump.pl (along with it's required support files) and can be used to exploded any un-encrypted .imp ebook into it's (decompressed) text and images components.

Now, take note, that I said text and images NOT .html and images.

The original html is not stored in the .imp file. Only the basic components are, like a record that tells you where all the font/styles changes are located in the file, another record indicates where to end the line so that it doesn't spill over the screen size of that .imp and other records that stores the images, hyperlinks used, etc. Basically all the building blocks are there (scattered) and we require those components to be re-assembled somehow into a .html!

BTW, release v4.0 of EBook-Tools should have basic .imp support for .html generation with image linking, but will initially lack table/hyperlink/styles support. Those are planned for future releases.

I plan to collect postings from this thread and compile a wiki page with the relevant parts of the .IMP file format specification as reverse-engineered by ALL of us!

Below are all the .RES filetypes that exist (thus far) and volunteers can pick the un-documented .RES filetypes on a "first come, first serve" basis.

Code:

.IMP file comprises these groups of .RES filetypes:

where:

text:

'!!cm'

'!!ky'

DATA.FRK - decompressor written in Perl, C and soon to be C#.

page_line:

'BPgz'

'BPgZ'

'ImRn' - written in Perl (see imp_dump.pl)

'Pcz0' - written in Perl (see imp_dump.pl)

'PcZ0' - written in Perl (see imp_dump.pl)

'Pcz1' - written in Perl (see imp_dump.pl)

'PcZ1' - written in Perl (see imp_dump.pl)

page_header_footer:

'HfPz'

'HfPZ'

links:

'AncT' - written in Perl (see imp_dump.pl)

'AnTg'

'Lnks'

'eLnk'

misc_info:

'Batr'

'Binf'

'BGcl' - written in Perl (see imp_dump.pl)

'BPos'

'Clos'

'Devm' - written in Perl (see imp_dump.pl)

'Dict'

'FRgs'

'Glos'

'MASK'

'Mrgn' - written in Perl (see imp_dump.pl)

'Hyp2'

'Hyph'

'Offs'

'pInf' - written in Perl (see imp_dump.pl)

'Pc31'

'PPic' - written in Perl (see imp_dump.pl)

'SKtb'

'SMnu'

'stbd'

'!!sw' - written in Perl (see imp_dump.pl)

formatting:

'ESts' - written in Perl (see imp_dump.pl)

'HRle'

'Styl'

'StRn' - written in Perl (see imp_dump.pl)

'StR#'

'StR2'

tables:

'Tabl'

'TCel'

'TRow'

images:

'GIF ' - written in Perl (see imp_dump.pl)

'JPEG' - written in Perl (see imp_dump.pl)

'PIC2' - written in Perl (see imp_dump.pl)

'PICT' - written in Perl (see imp_dump.pl)

'PNG ' - written in Perl (see imp_dump.pl)

markups:

'MRPs'

'Ano2'

'Hlts'

'BTok'

'BMks'

form_data:

'TGNt'

'Form'

'FItm'

'FIDt'

'FrDt'

One of the component records in the .RES directory when the .IMP file is exploded with unimp.exe is the DATA.FRK file. It contains the basic text used in the ebook and is the same for both the Color VGA (REB 1200) & Grayscale Half-VGA (EBW 1150) .IMP files. This DATA.FRK file is decompressed by deimp.exe if it was originally (LZSS) compressed, when created, along with control characters (see below) being substituted/expanded.

DATA.FRK File

Element text is extracted and placed in this file. Elements tags are replaced with control characters. This file can be compressed and encrypted with compression occuring before encryption. This file is compressed when the element <meta name="x-SBP-compress" content="on"/> is included in the <x-metadata> element of the package file. The compression algorithm used is LZSS. This file is encrypted when the element <meta name="x-SBP-encrypt" content="on"/> is included in the <x-metadata> element of the package file. The encryption algorithm used is DES. The 8 byte encryption key is in the SoftBook Edition Encryption Key File (.key) at offset 0x0C.

Characters less than 0x20 are removed expect for line break which is replaced with 0x20. Mutliple 0x20 characters are replaced with a single 0x20.

Control characters

Code:
0x0A end of document, forced page break   
0x0B start of element except < span >  
0x0D line break element < br / >  
0x0E start of table element < table >   
0x0F image element < img / >   
0x13 end of table cell < /td > tag   
0x14 horizontal rule element < hr / >   
0x15 before and after page header content   
0x16 before and after page footer content
As previously stated, my deimp.exe program used as it's base the lzss-0.6 code by Michael Dipperstein (http://michael.dipperstein.com/lzss), with tweaks by me to get it to decode the .imp text. I added the ability to insert/substitute some characters that are not part of the lzss decompression so that the resulting .imp text looked better. Just remove those and then after decompression, you can substitute them back.

In addition to those control characters above, characters to "substitute/convert" would be:
Code:
        HEX => Should be (actual char)          
0x8E => "&eacute;"  (i.e. "é"),          
0xA0 => "&nbsp;",   (i.e. " "),          
0xA5 => "&bull;",     (i.e. "•"),          
0xA8 => "&reg;",     (i.e. "®"),          
0xA9 => "&copy;",   (i.e. "©"),          
0xAA => "&trade;",  (i.e. "™"),          
0xAE => "&AElig;",   (i.e. "Æ"),          
0xC7 => "&laquo;",  (i.e. "«"),          
0xC8 => "&raquo;",  (i.e. "»"),          
0xC9 => "&hellip;",   (i.e. "…"),          
0xD0 => "&ndash;",  (i.e. "–"),          
0xD1 => "&mdash;", (i.e. "—"),          
0xD2 => "&ldquo;",    (i.e. "“"),          
0xD3 => "&rdquo;",   (i.e. "”"),          
0xD4 => "&lsquo;",    (i.e. "‘"),          
0xD5 => "&rsquo;",   (i.e. "’"),          
0xE1 => "&middot;",  (i.e. "·"),
I attach the source code to my deimp.exe (and original lzss-0.6) for your use and further study. Please excuse the coding hacks as this was a work-in-progress until I "nailed" the decompression algorithm. It didn't lend itself to good programming style.