Archive for March, 2014

Fixing slony error (invalid input for txid_snapshot) without rebuilding slony replication

This is probably a slony bug that I will log with the slony community but for now I have figured out a workaround that would keep slony running after the below procedure:

Environment:
OS: CentOS 6
PostgreSQL version: 9.3.3
Slony version: 2.2.2 (I have seen this issue in earlier versions also since the origin of 2.1.3 or maybe prior)

NOTE: This fix is specifically for the error invalid input for txid_snapshot. It has worked for me numurous times and I thought this might be useful for someone. Please be aware that manually updating slony catalog tables is not recommended and may cause further issues in slony replication. Use the below procedure at your own risk.

The Error:

ERROR:  invalid input for txid_snapshot: "233477269:233478374:233477269,233477846,233477978,233478062,233478210,233478287,233478308,233478314,233478360,233478360,233478371,233478372"

Description:
The above error although is not clear as to what the invalid string is in the slony event logs but a little more searching for it through the slony schema suggested that there is a duplication entry for a transaction id in the event log that slony is not happy about. Looking at the above error we see:

"233477269:233478374:233477269,233477846,233477978,233478062,233478210,233478287,233478308,233478314,233478360,233478360,233478371,233478372"

The duplicate entry for the transaction id is 233478360,233478360. It looks like it has two entries of the same transaction id in the event log and cannot process the request.
NOTE: we are only looking at the transaction id’s that are listed after the colon entries for the transaction list. For eg the entries below in bold is what we are concerned about.
“233477269:233478374:233477269,233477846,233477978,233478062,233478210,233478287,233478308,233478314,233478360,233478360,233478371,233478372

Procedure to fix the above error:

1. Connect to the master replication node with psql

#psql testdb

2. Search the transaction id in the sl_event table to find the ev_seqno like below

testdb=# select * from _slonyschema.sl_event where ev_snapshot::text like '233477269:233478374:233477269,233477846,233477978,233478062,233478210,233478287,233478308,233478314,233478360,233478360,233478371,233478372';
 ev_origin |  ev_seqno  |         ev_timestamp          |                                             ev_snapshot                                             | ev_type | ev_data1 | ev_data2 | ev_data3 | ev_
data4 | ev_data5 | ev_data6 | ev_data7 | ev_data8 
-----------+------------+-------------------------------+-----------------------------------------------------------------------------------------------------+---------+----------+----------+----------+----
------+----------+----------+----------+----------
         2 | 5002155051 | 2014-03-11 16:27:20.155651+00 | 233477269:233478374:233477269,233477846,233477978,233478062,233478210,233478287,233478308,233478314,233478360,233478360,233478371,233478372 | SYNC    |          |          |          |    
     
(1 row)

NOTE: make sure you have one row listed in there so that you can update the row by removing the duplicate value of the transaction id. In this case it is 233478360,233478360

3. Update the row in sl_event table

testdb=# UPDATE _slonyschema.sl_event set ev_snapshot='233477269:233478374:233477269,233477846,233477978,233478062,233478210,233478287,233478308,233478314,233478360,233478371,233478372' where ev_origin=2 and ev_seqno=5002155051;

NOTE: I updated the sl_event table and took out the duplicate value from ev_snapshot.

4. Restart slony process, restarting slony will clear out the errors as slony retries with the updated information. There could be more duplicates in the sl_event table that the slony log file may show up and you would need to follow the above procedure for all of them. But in most cases slony starts up fine and replication is back online.

Remember this is a temperory solution(be careful while you make the above change) and updating the wrong row might mess up slony but I guess if you have the above error in replication it just means you have to rebuild it anyway. So might as well give the above procedure a try and let me know if it worked for you.

Advertisements

Leave a comment

PGREPLAY (replicating production load in a test environment)

Ever wondered why all the testing passes ok but when it comes to real production load things start to look bad on the database. Only if we knew how to replicate the database load in real time on a testing environment we would have some data to analyze. I was given the same task and pgreplay came to rescue.
I find this tool very helpul when:

–> I want to test a new version of PostgreSQL database
–> Or even when I need to have some statistics around how the database would behave on a new hardware.
–> Also, if I make a configuration change to the database how would it behave in a real time load scenario.

All of the above scenarios are worth spending time and you can have a clear idea how your database will behave when its live in production.

About the module:

pgreplay reads the SQL statements in a PostgreSQL log file, extracts them and executes them in the same order and relative time against a PostgreSQL database cluster.

Installation:

Assuming you have a test box with the same data as production and are using postgres as your database.
Download and install pgreplay from the below link(I would install this on a test box and not production)

su - postgres
wget http://pgfoundry.org/frs/download.php/3345/pgreplay-1.2.0.tar.gz
tar -xvf pgreplay-1.2.0.tar.gz
./configure --with-postgres=/opt/PostgreSQL/9.2.4/bin
make
sudo make install

Configuration:

pgreplay needs to read a postgres log file and that log has to be in a specific format. The change would need a restart of you postgres database. Make a copy of the postgresql.conf file before you go ahead with this change.

log_destination = 'stderr'
logging_collector = on
log_directory = '/somemountpoint/pg_log' (this is the location where your postgres log file will be generated. Make sure postgres user has write privs)
log_min_duration_statement = 100 (this value is set to milliseconds. If you want to log all statements put 0)
log_min_messages = error  (or more)
   (if you know that you have no cancel requests, 'log' will do)
log_min_error_statement = log  (or more)
log_connections = on
log_disconnections = on
log_line_prefix = '%m|%u|%d|%c|'  (if you don't use CSV logging)
log_statement = 'all'
lc_messages must be set to English (the encoding does not matter)
bytea_output = escape  (from version 9.0 on, only if you want to replay
                        the log on 8.4 or earlier)

Thats it! restart postgres and logging will start the way pgreplay can parse the statements.

Usage:

There are two ways to replay the postgres logs to another postgres database

1. pgreplay [<parse options>] [<replay options>] [<infile>] 

This will parse a PostgreSQL log file and replays the
statements against a database.

2. pgreplay -f [<parse options>] [-o <outfile>] [<infile>]

This will parse a PostgreSQL log file and writes the
contents to a “replay file” that can be replayed with -r

I like the second option better because then I can just copy the parsed file to any database and do whatever testing is necessary
Now that you have a couple of postgres log files in the log location you specified in postgresql.conf file all you have to do is copy the logs over to a test box that has pgreplay installed and run the below procedure:

I first create a parse file:

pgreplay -f -o test.replay /somemountpoint/pg_log/postgresql.log
notice you will find the test.replay log file in the current working directory

Now just replay the test.replay parse file and watch the postgres database replay all the statements from production to a test box:

pgreplay -r -j test.replay

There are a lot more options that you can try out with pgreplay I will not be explaining them all but feel free to ask any questions about the module. More information on pgreplay is available here.

3 Comments

WAL-E (Incremental backups with S3 support)

It has been a while since I wrote on my blog. Thought this would be a good addition to the knowledgebase. I came accross this online backup tool that I believe is worth writing about. Incremental backups are taken care of and base backups are compressed and sent across to S3. No more writting shell scripts when it comes to shipping them to S3 anymore which I thought was pretty neat. I did struggle a bit to get it installed on a CentOS box and did not find much online help for it. Hopefully this will help someone get this module up and going in no time.

Dependencies:

python (>= 2.6)
lzop
pv
And ofcourse we are talking about postgres database here so any postgres version >=8.4 should work with it

First we have to get python26 or greater which for some reason I could not get it from the CentOS repo even after a yum update.

1. Download python version from the below link

wget http://python.org/ftp/python/2.7.6/Python-2.7.6.tar.xz

Extract the file and compile it with
./configure
make
sudo make install

NOTE: if you are having trouble xtracting the tar.xz file install xz
wget http://tukaani.org/xz/xz-5.0.5.tar.gz
./configure
make
sudo make install

2. We will need Setuptools as well to have our module compiled

wget https://bitbucket.org/pypa/setuptools/raw/bootstrap/ez_setup.py

Then install it for Python 2.7 that you installed above.
sudo /usr/local/bin/python2.7 ez_setup.py

NOTE: Here I was set back again because the ez_setup.py script was trying to download setuptools with a certificate check. All I did was added –no-check-certificate in the script where it was doing a wget like:

def download_file_wget(url, target):
cmd = ['wget', url, '--no-check-certificate', '--quiet', '--output-document', target]
_clean_check(cmd, target)

3. Install pip using the newly installed setuptools:

sudo easy_install-2.7 pip

4. Install virtualenv for Python 2.7

pip2.7 install virtualenv

sudo pip install wal-e

Now that we have wal-e installed we are going to make a couple of configuration changes that will enable wal-e to work with S3.

1. Create an environment directory to use wal-e

mkdir -p /etc/wal-e.d/env
chown -R postgres:postgres /etc/wal-e.d
echo "secret_key_goes_here"> /etc/wal-e.d/env/AWS_SECRET_ACCESS_KEY
echo "access_id_for_s3_goes_here"> /etc/wal-e.d/env/AWS_ACCESS_KEY_ID
echo 's3://specify_bucket_name/directory_if_you_have_created_on_in_the_bucket'> /etc/wal-e.d/env/WALE_S3_PREFIX

2. Since this is going to be an incremental backup setup we would have to turn archiving on.

wal_level = archive
archive_mode = yes
archive_command = 'envdir /etc/wal-e.d/env /usr/local/bin/wal-e wal-push %p'
archive_timeout = 60

NOTE: you would have to restart your postgres database so that these changes can be read by postgres

Thats it! Now you can start making a base backup and forget about the incremental as wal-e automatically ships those wal files to S3 :). Reason is the archive command we have setup in the postgresql.conf file.

1. To take a base backup:

su postgres
envdir /etc/wal-e.d/env /usr/local/bin/wal-e backup-push /path to your datadir

You can always list the backups that you have on S3 by:
envdir /etc/wal-e.d/env /usr/bin/wal-e backup-list

name last_modified expanded_size_bytes wal_segment_backup_start wal_segment_offset_backup_start wal_segment_backup_stop wal_segment_offset_backup_stop
base_00000001000000AD000000C7_00000040 2014-03-06T17:51:26.000Z 00000001000000AD000000C7 00000040

2. Deleting or retaining number of backups is easy as well.

If you want to delete a specific backup
wal-e delete [--confirm] base_00000004000002DF000000A6_03626144

Or you can just delete backups older than a base backup by using the before clause:
wal-e delete [--confirm] before base_00000004000002DF000000A6_03626144

Retaining number of backups as:
wal-e delete [--confirm] retain 5

3. Restoring using backup-fetch

To restore the complete database on a seperate box:
envdir /etc/wal-e.d/env wal-e backup-fetch

Wal fetch can also be accomplished with wal-e:
envdir /etc/wal-e.d/env wal-e wal-fetch

There are a couple of more things that can be done with wal-e like using encryption on backups, managing tablespace backups(this is imp if you have user defined tablespaces in your database) controlling I/O of base backup, increasing throughput of wal-push etc. You might want to check into those options before putting this in production as base back I/O’s can take a decent amount of CPU overhead if not configured properly. Here is the link that will help with further information on this module.

Feel free to ask questions and hope this helped.

1 Comment

%d bloggers like this: