Distributed Database

Just another WordPress site

Distributed Database

hello students welcome to the lecture on distributed database and after this lecture we will be able to learn the following objectives understand the concept of distributed data storage discuss about network transparency the concept of query processing in distributed databases understand the concept of commit protocol explain the applications of two-phase commit and describe the concurrency control let us first discuss about the meaning of distributed database distributed database DDB is a collection of multiple logically interrelated databases distributed over a computer network distributed databases a distributed database system allows applications to access data from local and remote databases in a homogenous distributed database system each database is an Oracle database in a heterogeneous distributed database system at least one of the databases is not an Oracle database distributed databases use client-server architecture to process information requests in a homogenous distributed database all sides have identical software are aware of each other and agree to cooperate in processing user requests each side surrenders part of its autonomy in terms of right to change schemas or software appears to user as a single system distributed databases versus distributed processing distributed database a set of databases in a distributed system that can appear to applications as a single data source distributed processing the operations that occur when an application distributes its tasks among different computers in a network distributed databases versus replicated databases the terms distributed database system and database replication are related yet distinct in a pure that is not replicated distributed database the system manages a single copy of all data and supporting database objects the term replication refers to the operation of copying and maintaining database objects in multiple databases belonging to a distributed system while replication relies on distributed database technology database replication offers applications benefits that are not possible within a pure distributed database environment heterogeneous distributed database systems in a heterogeneous distributed database system at least one of the databases is a non Oracle database system to the application the heterogeneous distributed database system appears as a single local Oracle database advantages preservation of investment and existing hardware system software applications local autonomy and administrative control step towards a unified homogeneous DBMS full integration into a homogeneous DBMS faces technical difficulties and cost of conversion organizational political difficulties organizations do not want to give up control on their data local databases wish to retain a great deal of autonomy disadvantages complexity extra work must be done by the DBAs to ensure that the distributed nature of the system is transparent economics increased complexity and a more extensive infrastructure means extra labor costs security remote database fragments must be secured and they are not centralized so the remote sites must be secured as well difficult to maintain integrity but in a distributed database enforcing integrity over a network may require too much of the network’s resources to be feasible inexperienced distributed databases are difficult to work with and as a young field there is not much readily available experience on proper practice lack of standards there are no tools or

methodologies yet to help users convert a centralized DBMS into a distributed DBMS database designed more complex besides the normal difficulties the design of a distributed database has to consider fragmentation of data allocation of fragments to specific sites and data replication additional software is required operating systems should support distributed environment concurrency control it is a major issue it can be solved by locking and time-stamping distributed access to data analysis of distributed data distributed data storage replication replication is the process of copying and maintaining database objects in multiple databases that make up a distributed database system replication objects groups and sites the following sections explain the basic components of a replication system including replication sites replication groups and replication objects replication objects a replication object is a database object existing on multiple servers in a distributed database system replication groups in a replication environment Oracle Managers replication objects using replication groups by organizing related database objects within a replication group it is easier to administer many objects together replication sites a replication group can exist at multiple replication sites replication environments support two basic types of sites master sites and snapshot sites a master site maintains a complete copy of all objects in a replication group a snapshot site supports read-only and updatable snapshots of the table data at an Associated master site data replication advantages of replication availability failure of site containing relation R does not result in unavailability of R is replicas exist parallelism queries on are may be processed by several nodes in parallel reduced data transfer relation R is available locally at each site containing a replica of are disadvantages of replication increased cost of updates each replica of relation R must be updated increased complexity of concurrency control concurrent updates two distinct replicas may lead to inconsistent data unless special concurrency control mechanisms are implemented data fragmentation horizontal fragmentation each tuple of R is assigned to one or more fragments vertical fragmentation the schema for relation R is split into several smaller schemas all schemas must contain a common candidate key or super key to ensure lossless joint property a special attribute the tuple ID attribute may be added to each schema to serve as a candidate advantages of fragmentation horizontal allows parallel processing on fragments of a relation allows a relation to be split so that tuples are located where they are most frequently accessed vertical allows tuples to be split so that each part of the tuple is stored where it is most frequently accessed double ID attribute allows efficient joining of vertical fragments allows parallel processing on a relation vertical and horizontal fragmentation can be mixed fragments may be successively fragmented to an arbitrary depth data transparency degree to which system user may remain unaware of the details of how and where the data items are stored in a distributed system network transparency network transparency in its most general sense refers to the ability of a protocol to transmit data over the network in a manner which is transparent invisible to those using the applications that are using the protocol the term is often

applied in the context of the X Window System which is able to transmit graphical data over the network and integrate it seamlessly with applications running and displaying locally firewalls transparency in firewall technology can be defined at the networking IP or internet layer or at the application layer transparency and the IP layer means the client targets the real IP address of the server transparency and the application layer means the client application uses the protocol in a different way naming of data items criteria every data item must have a system-wide unique name it should be possible to find the location of data items efficient it should be possible to change the location of data items transparently each site should be able to create new data items autonomously centralized scheme name server structure name server assigns all names each site maintains a record of local data items sites ask name server to locate non-local data items advantages satisfies naming criteria one two three disadvantages does not satisfy naming criterion for name server is a potential performance bottleneck name server is a single point of failure use of aliases aliases are used to give some meaningful alternate names to the existing database tables these can refer tables which are local to the system as well as those which are remote if a table is dropped then the Associated aliases are not dropped on their own an alias is a substitute for the three part name of a table or view an alias can be a maximum of 128 characters qualified by an owner ID you use the create alias and drop alias statements to manage aliases creating and using alias names to create an alias use the create alias statement you can create an alias for a table or view a member of a table for example if there is a multiple member file my lib my file with members M Bri and MBR – an alias can be created for the second member so that SQL can easily refer to it create alias my Lib dot my MBR – alias for my Lib dot my file MBR – in vocation this statement can be embedded in an application program or issued through the use of dynamic SQL statements authorization the privileges held by the authorization ID of the statement must include at least one of the following MP licit schema authority on the database if the implicit or explicit schema name of the alias does not exist create in privilege on the schema if the schema name of the alias refers to an existing schema since a DM or DBA DM Authority syntax description alias name names the alias for table name view name nickname or alias name to identifies the table view nickname or alias for which alias name is defined query processing in distributed databases for centralized systems the primary criterion for measuring the cost of a particular strategy is the number of disk accesses in a distributed system other issues must be taken into account the cost of a data transmission over the network the potential gain in performance from having several sites process parts of the query in parallel distributed transaction model transaction may access data at several sites each site has a local transaction manager responsible for maintaining a law for recovery purposes participating in coordinating the concurrent execution

of the transactions executing at that site each site has a transaction coordinator which is responsible for starting the execution of transactions that originate at the site distributing sub transactions at appropriate sites for execution coordinating the termination of each transaction that originates at the site system failure modes failure of a site and loss of messages are handled by network transmission control protocols such as TCP IP failure of a communication link handled by network protocols by routing messages via alternative links network partition a network is set to be partitioned when it has been split into two or more subsystems that lat any connection between them a subsystem may consist of a single node commit protocols commit protocols are used to ensure autonomous city across sites a transaction which executes at multiple sites must either be committed at all the sites or aborted at all the sites not acceptable to have a transaction committed at one site and aborted at another two-phase commit protocol to PC the protocol involves all the local sites at which the transaction is executed let t be a transaction initiated at site si and let the transaction coordinator at si B CI phase one obtaining a decision coordinator asks all participants to prepare to commit transaction TI CI adds the records prepared T greater than to the log and forces log to stable storage sense prepared T messages to all sites at which T executed upon receiving message transaction manager at site determines if it can commit the transaction if not add a record n o T to the log and send abort T message to see I if the transaction can be committed then and the record ready T to the log force all records 42 stable storage send ready T message to see I face to recording the decision T can be committed of C I received a ready T message from all the participating sites otherwise team must be aborted coordinator adds a decision record commit T or abort T to the log and forces record on to stable storage once the record stable storage it is irrevocable even if failures occur coordinator sends a message to each participant informing it of the decision commit or abort participants take appropriate action locally handling of failures coordinator failure if coordinator fails while the commit protocol for T is executing then participating sites must decide on T’s fate handling of failures Network partition if the coordinator and all its participants remain in one partition the failure has no effect on the commit protocol recovery and concurrency control in doubt transactions have a ready T but neither a committee nor an abort T log record the recovering site must determine the commit abort status of such transactions by contacting other sites thus can slow and potentially block recovery recovery algorithms can note lock information in the log three phase commit three PC assumptions no network partitioning at any point at least one site must be up at most k sites participants as well as coordinator can fail phase 1 obtaining preliminary decision identical to two PC phase 1 every site is ready to commit if instructed to do so phase 2 phase 2 of 2 PC is split into two phases phase 2 and phase 3 of 3 PC in phase 2 coordinator makes a decision as in 2 PC

called the pre commit decision and records it in multiple at least key sites in phase 3 coordinator sense commit a bot message to all participating sites under 3 PC knowledge of pre-commit decision can be used to commit despite coordinator failure avoids blocking problem as long as K sites fail drawbacks higher overheads assumptions may not be satisfied in practice alternative models carry out transactions by sending messages code to handle messages must be carefully designed to ensure otama city and durability properties for updates error conditions with persistent code to handle messages has to take care of variety of failure situations even assuming guaranteed message delivery resistant messaging and workflows workflows provide a general model of transactional processing involving multiple sites and possibly human processing of certain steps concurrency control modify concurrency control schemes for use and distributed environment we assume that each site participates in the execution of a commit protocol to ensure global transaction otama City we assume all replicas of any item are updated single lock manager approach system maintains a single lock Manager that resides in a single chosen site say si when a transaction needs to lock a data item it sends a lock request to si and Locke manager determines whether the lock can be granted immediately if yes block manager sends a message to the site which initiated the request if no request is delayed until it can be granted at which time a message is sent to the initiating site advantages of scheme simple implementation simple deadlock handling disadvantages of scheme our bottleneck lock manager site becomes a bottleneck vulnerability system is vulnerable to lock manager site failure distributed lock manager in this approach functionality of locking is implemented by lock managers at each site log managers control access to local data items lock managers cooperate for deadlock detection several variants of this approach primary copy choose one replica of data item to be the primary copy site containing the replica is called the primary site for that data item different data items can have different primary sites majority protocol local lock manager at each site administers lock and unlock requests for data items stored at that site when transaction wishes to lock an unrelated data item queue residing at site si a message is sent to s i–‘s lock manager bias protocol local lock manager at each site as a majority protocol however requests for shared locks are handled differently than requests for exclusive locks shared locks when a transaction needs to lock data item queue it simply requests a lock on queue from the lock manager at one site containing a replica of queue exclusive locks when transaction needs to lock data item queue it requests a lock on cue from the lock manager at all sites containing a replica of Q Gorham consensus a generalization of both majority and biased protocols each site is assigned a weight let s be the total of all site weights choose two values read quorum QR and write quorum Q W such that Q R plus Q W is greater than s and two star Q W is greater than s Gorham’s can be chosen and s computed separately for each item deadlock handling consider the following two transactions and history with item X and transaction t1 at site 1 and item V

and transaction t2 at site 2 t1 right X right yt2 right X right Y centralized approach a global wait for graph is constructed and maintained in a single site the deadlock detection coordinator real graph real but unknown state of the system constructed graph approximation generated by the controller during the execution of its algorithm Multi database system unified view of data agreement on a common data model typically the relational model agreement on a common conceptual schema different names for same relation attribute same relation attribute name means different things agreement on a single representation of shared data example data types precision character sets ASCI I versus ebcdic sought order variations agreement on units of measure variations names example corn versus cologne Mumbai versus Bombay query processing several issues in query processing in a heterogeneous database schema translation write a wrapper for each data source to translate data to a global schema wrappers must also translate updates on global schema to updates on local schema limited query capabilities some data sources allow only restricted forms of selections queries have to be broken up and processed partly at the source and partly at a different site removal of duplicate information when sites have overlapping information decide which sites to execute query global query optimization mediator systems mediator systems are systems that integrate multiple heterogeneous data sources by providing an integrated global view and providing query facilities on global view directory systems typical kinds of directory information employee information such as name ID email phone office address even personal information to be accessed from multiple places entry is organized by name or identifier entries organized by properties for reverse lookup to find entries matching specific requirements distributed directory trees organizational information may be split into multiple directory information trees suffix of a DI T gives Rd n to be tagged on to all entries to get an overall DN organizations often split up di T’s based on geographical location or by organizational structure a node in a DI T may be a ref to a node in another di t summary now in the end let us summarize what we have learnt in this lecture adi DBMS consists of a single logical database that is split into a number of partitions of fragments each partition is stored on one or more computers under the control of a separate DBMS with the computers connected by a communication network a distributed DBMS provides a number of advantages over centralized DBMS but it has several disadvantages also a distributed system can be classified as homogeneous distributed DBMS or heterogeneous distributed DBMS in a distributed system if all sides use the same DBMS product it is called a homogeneous distributed database system in a heterogeneous distributed system different sites may run different DBMS products which need not be based on the same underlying data model a distributed DBMS consists of the components computer workstations computer network communication media transaction processor and data processor