Fixing slony error (invalid input for txid_snapshot) without rebuilding slony replication

This is probably a slony bug that I will log with the slony community but for now I have figured out a workaround that would keep slony running after the below procedure:

Environment:
OS: CentOS 6
PostgreSQL version: 9.3.3
Slony version: 2.2.2 (I have seen this issue in earlier versions also since the origin of 2.1.3 or maybe prior)

NOTE: This fix is specifically for the error invalid input for txid_snapshot. It has worked for me numurous times and I thought this might be useful for someone. Please be aware that manually updating slony catalog tables is not recommended and may cause further issues in slony replication. Use the below procedure at your own risk.

The Error:

ERROR:  invalid input for txid_snapshot: "233477269:233478374:233477269,233477846,233477978,233478062,233478210,233478287,233478308,233478314,233478360,233478360,233478371,233478372"

Description:
The above error although is not clear as to what the invalid string is in the slony event logs but a little more searching for it through the slony schema suggested that there is a duplication entry for a transaction id in the event log that slony is not happy about. Looking at the above error we see:

"233477269:233478374:233477269,233477846,233477978,233478062,233478210,233478287,233478308,233478314,233478360,233478360,233478371,233478372"

The duplicate entry for the transaction id is 233478360,233478360. It looks like it has two entries of the same transaction id in the event log and cannot process the request.
NOTE: we are only looking at the transaction id’s that are listed after the colon entries for the transaction list. For eg the entries below in bold is what we are concerned about.
“233477269:233478374:233477269,233477846,233477978,233478062,233478210,233478287,233478308,233478314,233478360,233478360,233478371,233478372

Procedure to fix the above error:

1. Connect to the master replication node with psql

#psql testdb

2. Search the transaction id in the sl_event table to find the ev_seqno like below

testdb=# select * from _slonyschema.sl_event where ev_snapshot::text like '233477269:233478374:233477269,233477846,233477978,233478062,233478210,233478287,233478308,233478314,233478360,233478360,233478371,233478372';
 ev_origin |  ev_seqno  |         ev_timestamp          |                                             ev_snapshot                                             | ev_type | ev_data1 | ev_data2 | ev_data3 | ev_
data4 | ev_data5 | ev_data6 | ev_data7 | ev_data8 
-----------+------------+-------------------------------+-----------------------------------------------------------------------------------------------------+---------+----------+----------+----------+----
------+----------+----------+----------+----------
         2 | 5002155051 | 2014-03-11 16:27:20.155651+00 | 233477269:233478374:233477269,233477846,233477978,233478062,233478210,233478287,233478308,233478314,233478360,233478360,233478371,233478372 | SYNC    |          |          |          |    
     
(1 row)

NOTE: make sure you have one row listed in there so that you can update the row by removing the duplicate value of the transaction id. In this case it is 233478360,233478360

3. Update the row in sl_event table

testdb=# UPDATE _slonyschema.sl_event set ev_snapshot='233477269:233478374:233477269,233477846,233477978,233478062,233478210,233478287,233478308,233478314,233478360,233478371,233478372' where ev_origin=2 and ev_seqno=5002155051;

NOTE: I updated the sl_event table and took out the duplicate value from ev_snapshot.

4. Restart slony process, restarting slony will clear out the errors as slony retries with the updated information. There could be more duplicates in the sl_event table that the slony log file may show up and you would need to follow the above procedure for all of them. But in most cases slony starts up fine and replication is back online.

Remember this is a temperory solution(be careful while you make the above change) and updating the wrong row might mess up slony but I guess if you have the above error in replication it just means you have to rebuild it anyway. So might as well give the above procedure a try and let me know if it worked for you.

Advertisements
  1. Leave a comment

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: