BM307–File Organization
Gazi University
Computer Engineering Department
9/24/2014
1
Index

Sequential File Organization




Binary Search
Interpolation Search
Self-Organizing Sequential Search
Direct File Organization




9/24/2014
Locating Information
Hashing Functions
Collision Resolution
Coalesced Hashing
2
File Organization

Goal  Organizing files efficiently in terms of both
space and performance

File Organization
sequential
indexed sequential
direct
9/24/2014
File Access
sequential
sequential & direct
direct (random)
3
File Access Types

Sequential – accessing multiple records (often an entire
file) and usually according to a predefined order

Direct (random) – locating a single record

Question  How can we have an effective organization?
 Answer  matching the type of organization with the
type of intended access
9/24/2014
4
Sequential File Organization
Background





Fields (eg.: Employee name, number)
Records  contain data about individual entities
Files (eg.: employee list)
Primary Key  field(s) which uniquely distinguishes a record
from all others
Secondary Key  all the remaining fields
9/24/2014
5
Sequential File Organization

File  consists of records of the same format
 Fixed-length records
 Variable-length records

Sequential File Organization  (i+1)st element of a file
is stored immediately after the ith element.
9/24/2014
6
Sequential File Organization

Sequential access  moving from one record in the
file to the next by incrementing the address of the
current record by the record size

Direct access  processing a single record directly if
we know subscript
9/24/2014
7
Sequential File Organization



Probe  access to a distinct location
Sequential Search
In an entire file of N records

N/2 probes are needed in average
Need to probe entire file for an unseccessful retrieval

Computational complexity O(N)



Appropriate when N is small
Performance improvement?

9/24/2014
Sorting
8
Eg. - Sequential Search








100000 records, each record size is 400 bytes,
block size is 2400 bytes.
Sequential search time for retrieving 10000 records?
Each probe  one block of data
(100000*400)/2400 = 16667 blocks
Reading time for one block  0.84ms (IBM 3380)
Time requirement for each record 
(16667/2)*0.84 = 7 sec.
For 10000 records  7sec * 10000 = 19 hours
Better organization is needed!!
9/24/2014
9
Sequential File Organization

Binary Search




Requires sorting
Compares the key of the sought record with the middle record of the file
Half of the file is eliminated in each turn
Computational complexity O(log2n)
Eg. the key of the sought record  17
9/24/2014
10
Sequential File Organization
Binary Search (Algoritma)
9/24/2014
11
Sequential File Organization
Interpolation Search

Approximate relative position
 Eg.: Searching a name in a telephone book
 Choses the next position for a comparison based upon the
estimated position of the sought keyrelative to the remainder
of the file to be searched
key[sought] – key [LOWER]
NEXT := LOWER + ––––––––––––––––––––––––––––––––––––––––– (UPPER-LOWER)
key[UPPER] – key [LOWER]



Worst case computational complexity O(n)
Average case computational complexity O(log2 log2n)
Its performance improves as the distribution of keys becomes
more uniform
9/24/2014
12

binary search should be preferred when the data is stored in
primary memory


Why?
interpolation search should be preferred when the data is stored in
auxilary memory

9/24/2014
Why?
13

binary search should be preferred when the data is stored in
primary memory


The additional calculations needed for the interpolation search cancel
any savings gained from fewer probes
interpolation search should be preferred when the data is stored in
auxilary memory

9/24/2014
An access of auxiliary storage is an order of magnitude greater than
the time required for the additional calculations
14
Sequential File Organization
Self-Organizing Sequential Search


Modifies the order of records
Moves the most frequently retrieved records to the beginning
of the file
Most popular algorithms:
 Move_to_front
 Transpose
 Count
9/24/2014
15
Sequential File Organization

Move_to_front

The sought record is moved to the front position of the file

Potential of making big mistakes if a record accessed , moved to the
front of the file, and then rarely if ever accessed again!

A linked implementation is preferable even though it takes more storage

Appropriate when space is not limited and locality of access is important

Essentially the same as the LRU (least recently used) paging algorithm
used by operating systems
9/24/2014
16
Eg. - Move_to_front

The records are accessed in the order of “fileediting”

abcdefghijklmnoprqstvwyz
fabcdeghijklmnoprqstvwyz
ifabcdeghjklmnoprqstvwyz
lifabcdeghjkmnoprqstvwyz
elifabcdghjkmnoprqstvwyz
elifabcdghjkmnoprqstvwyz
delifabcghjkmnoprqstvwyz
idelfabcghjkmnoprqstvwyz
tidelfabcghjkmnoprqsvwyz
itdelfabcghjkmnoprqsvwyz
nitdelfabcghjkmoprqsvwyz
gnitdelfabchjkmoprqsvwyz











9/24/2014
17
Sequential File Organization

Transpose

Interchanges the sought record with its immediate predecessor

More stable than the Move_to_front algorithm

A record needs to be accessed many times before it is moved to the
front of the list

Easily implemented

Does not need additional space

Should be used when space is premium
9/24/2014
18
Eg. - Transpose

The records are accessed in the order of “fileediting”

abcdefghijklmnoprqstvwyz
abcdfeghijklmnoprqstvwyz
abcdfegihjklmnoprqstvwyz
abcdefgihjklmnoprqstvwyz
abcedfgihjklmnoprqstvwyz
abcdefgihjklmnoprqstvwyz
abcdefighjklmnoprqstvwyz
abcdefighjklmnoprqtsvwyz
abcdeifghjklmnoprqtsvwyz
abcdeifghjklnmoprqtsvwyz
abcdeigfhjklnmoprqtsvwyz










9/24/2014
19
Sequential File Organization

Count

Keeps count of the number of accesses of each record

The file is always ordered in a decreasing order of frequency of
access

Requires extra sorage to keep the count

Use it only when the counts are needed for another purpose
9/24/2014
20
Direct File Organization


Ideally, we want to go directly to the address where the record is stored
A key can be unique address  one probe
0
0
1–1
Key
space
correspondence
999-99-9999

Address
space
999-99-9999
More address space than needed
 Eg.1 billion addresses for 300 million people
9/24/2014
21
Direct File Organization

Converting information into a unique address

Eg. : Airline reservation system



Flight numbers from 1 to 999
Days are numbered from 1 to 366
Flight number and day of the year could be concatenated to determine the
location
Location = flight number || day of the year, address range  001001-999366
(???367 - ???999 would not exist)
Location = day of the year || flight number , address range  001001-366999
9/24/2014
22
Direct File Organization

The key converts to a probable address


If we remove most of the empty spaces in the address space, we have
lost the 1-1 correspondence btw keys & addresses
Hashing functions are used to map the wider range of key values into
the narrower range of address values
Hash (key)


probable address
Initial probable address  home address
Hashing function should


Evenly distribute the keys among the addresses
Executes efficiently
9/24/2014
23
Direct File Organization

A collision occurs when two distinct keys map to the same
address
0
0
Key
space
Address
space
1200
999-99-9999

Hashing is then composed of two aspects;


The function
The collision resolution method
9/24/2014
24
Direct File Organization
Hashing Functions
9/24/2014
25
Direct File Organization
Hashing Functions

Squaring


Taking square of a key and then substringing or truncating a portion
of the result
Radix conversion




9/24/2014
The key is considered to be in a base other than 10 ans is then
converted into a number in base 10
Eg.: Base 11
1234 = 1 * 113 + 2 * 112 + 3 * 111 + 4 * 110= 1331 + 242 + 33 + 4
= 1610
substringing or truncation could then be used
26
Direct File Organization
Hashing Functions

Polynomial hashing



The key is divided by a polynomial
f(information area)
cyclic check bytes
Alphabetic keys

9/24/2014
Alphabetic or alphanumeric key values can be input to a hashing
function if the values are interpreted as integers
27
Direct File Organization
Collisions

For a given set of data, one hashing function may distribute the keys
more evenly over the address space than another

A hashing function that has a large number of collisions is said to
exhibit primary clustering

It is better to have a slightly more expensive hashing function for data
that need to be stored on auxiliary storage

Another method for reducing collisions is reducing the packing factor
Packing factor =
total number of storage locations
9/24/2014
collisions
number of records stored
storage
28
Direct File Organization
Collision Resolution

Collision resolution with links

Collision resolution without links
 Static positioning of records
 Dynamic positioning of records

Collision resolution with pseudolinks
9/24/2014
29
Direct File Organization
Collision resolution with links

If multiple synonyms occur for a
particular home address, we form
a chain of synonym records

Disadvantage  extra storage is
needed
Collision resolution without links

We can use implied links by
applying a convention , or set of
rules for deciding where to go
next

A simple convention is to look at
the next location in memory

Advantage  NO extra storage is
needed
9/24/2014
30
Direct File Organization
Coalesced Hashing

Occurs when we attempt to insert a record with a home
address that is already occupied by a record from a chain with
a different home address

The two chains with records having different home addresses
coalesce or grow together
X ,D, Y were inserted
9/24/2014
31
Direct File Organization
Coalesced Hashing
(Eg.)


Hash (key) = key mod 11
27, 18, 29, 28, 39, 13, 16
Average # of probes  1.8
42 & 17 added
9/24/2014
32
Direct File Organization
Coalesced Hashing
Discussion

Packing factor of the final table = 9/11 (82%)

One method of reducing coalescing is to reduce the packing factor

It would be advisable to place the most frequently accessed records early in
the insertion process

Deleting a record is complicated

If coalescing has occurred,
a simple deletion procedure is
to move a record later in the
probe chain into the position of
the deleted record
Final table after deleting 39 ---------->
9/24/2014
33
Direct File Organization
Coalesced Hashing
Variants



Table organization (whether or not a seperate
overflow area is used)
The manner of linking a colliding item into a chain
The manner of choosing unoccupied locations
Table Organization



Table  primary area + overflow area
Adres factor = (primary area ) / (total table size)
Best performance  when the adres factor is 0.86
9/24/2014
34
Direct File Organization
Coalesced Hashing
Variants
Late Insertion Standart Colesced Hashing (LISCH)


New records are inserted at the end ofa probe chain

Lack of a cellar
Late Insertion Coalesced Hashing (LICH)

Uses a cellar


Eg. Keys: 27, 18, 29, 28, 39, 13, 16, 42, 17
hashing function: key mod 7

Average # of probes  1.3
(It was 1.8 for LISCH)


In general, for a 90 percent packing factor,
using a cellar will reduce the number of
probes by about 6 percent compared
with LISCH
9/24/2014
35
Direct File Organization
Coalesced Hashing
Variants
Early Insertion Standart Colesced Hashing (EISCH)

İnserts a new record into a position on the probe chain immediately after the record
srored at its home address



İnsertion of the record with key 17 according to EISCH algorithm:
Hash (key) = key mod 11
9/24/2014
36
Direct File Organization
Coalesced Hashing
Variants
Random Early Insertion Standart Colesced Hashing (REISCH)

Choosing a random unoccupied location for the new insertion
Gives only a 1% improvement over EISCH



Random Late Insertion Standart Colesced Hashing (RLISCH)

Bidirectional Late Insertion Standart Colesced Hashing (BLISCH)
Choosing the overflow location for a collision insertion by alternating the
selection between the top and bottom of the table


Bidirectional Early Insertion Standart Colesced Hashing (BEISCH)
9/24/2014
37
Direct File Organization
Coalesced Hashing
Comparison
9/24/2014
38
Download

BM307_W2