Ha Cluster (corosync/pacemaker) VirtualDomain Problem

Roonix · Beitrag von **Roonix** » 12.12.2019 11:23:09

Hallo liebes Forum,

wir haben ein Testsystem mit corosync/pacemaker, einer DRBD Ressource und einer VirtualDomain aufgesetzt. Das ganze wird über das LCMC Frontend gemanaged. Hier wurde eine VirtualDomain angelegt und die dazugehörige VHD auf dem DRBR gespeichert, dass jeweils in /var/lib/libvirt/images gemounted wird. Domain ist auf beiden Nodes bekannt.

Anfänglich wollte die Domain, nachdem der aktive Node1 in Standby gegangen ist, auf Node2 nicht starten. Die DRBD Ressource wurde aber Problemlos auf Node2 gemountet.

Nach einem

Code: Alles auswählen

crm resource cleanup

ging es dann plötzlich.

Also, neue VM auf Node2 angelegt, VHD auf DRBD gespeichert und XML Datei auf Node1 kopiert und per virsh define bekanntgemacht. Node1 wieder online genommen und Node2 auf Standby gesetzt. Die erste Domain startet wie erwartet auch richtig, nur die zweite jetzt nicht. Ein resource cleanup hat keinen Erfolg gebracht.

crm_mon gibt folgende Fehlermeldung aus:

Code: Alles auswählen

Failed Resource Actions:
* res_VirtualDomain_debian10-lcmctest_start_0 on CHUCK 'unknown error' (1): call=331, status=complete, e
xitreason='Failed to start virtual domain debian10-lcmctest.',
    last-rc-change='Thu Dec 12 11:01:39 2019', queued=0ms, exec=2177ms

Leider kann ich daraus keine Rückschlüsse auf den Fehler ziehen. Kann mir irgendjemand weiterhelfen?

LG und vielen Dank

Colttt · Beitrag von **Colttt** » 13.12.2019 09:13:18

erstmal wäre die config interessant, also die Ausgabe von

Code: Alles auswählen

crm conf sh

und dann die Log von /var/log/corosync/corosync.log, am besten du leerst diese und machst ein ressource cleanup, dann sieht man recht gut was da schief läuft

Roonix · Beitrag von **Roonix** » 27.12.2019 14:28:23

Colttt hat geschrieben:
13.12.2019 09:13:18
erstmal wäre die config interessant, also die Ausgabe von
Code: Alles auswählen
crm conf sh
und dann die Log von /var/log/corosync/corosync.log, am besten du leerst diese und machst ein ressource cleanup, dann sieht man recht gut was da schief läuft

Hallo Colttt,
vielen Dank für deine schnelle und sorry für meine mega späte Antwort. Ich bin am 6.1 aus dem Urlaub zurück, dann kann ich die Info's posten! vielen dank dir

Roonix · Beitrag von **Roonix** » 07.01.2020 10:51:43

Colttt hat geschrieben:
13.12.2019 09:13:18
erstmal wäre die config interessant, also die Ausgabe von
Code: Alles auswählen
crm conf sh
und dann die Log von /var/log/corosync/corosync.log, am besten du leerst diese und machst ein ressource cleanup, dann sieht man recht gut was da schief läuft

Hallo Colttt,
nun bin ich wieder im Büro und habe mich direkt an die Shell geklemmt - anbei die Ausgaben:

crm conf sh Node1

Code: Alles auswählen

node 1: CHUCK \
        attributes standby=off
node 2: NORRIS \
        attributes standby=on
primitive res_Filesystem_r0 Filesystem \
        params device="/dev/drbd/by-res/r0/0" directory="/drbd/r0" fstype=none \
        operations $id=res_Filesystem_r0-operations \
        op start interval=0 timeout=60 \
        op stop interval=0 timeout=60 \
        op monitor interval=20 timeout=40 start-delay=0 \
        op notify interval=0 timeout=60 \
        meta target-role=started
primitive res_VirtualDomain_VCAT VirtualDomain \
        params config="/etc/libvirt/qemu/VCAT.xml" hypervisor="qemu:///system" save_config_on_stop=false sync_config_on_stop=false \
        operations $id=res_VirtualDomain_VCAT-operations \
        op start interval=0 timeout=90 \
        op stop interval=0 timeout=90 \
        op monitor interval=10 timeout=30 start-delay=0 \
        op migrate_from interval=0 timeout=60 \
        op migrate_to interval=0 timeout=120 \
        meta target-role=started \
        utilization cpu=6 hv_memory=8192
primitive res_VirtualDomain_debian10-lcmctest VirtualDomain \
        params config="/etc/libvirt/qemu/debian10-lcmctest.xml" hypervisor="qemu:///system" save_config_on_stop=false sync_config_on_stop=false \
        operations $id=res_VirtualDomain_debian10-lcmctest-operations \
        op start interval=0 timeout=90 \
        op stop interval=0 timeout=90 \
        op monitor interval=10 timeout=30 start-delay=0 \
        op migrate_from interval=0 timeout=60 \
        op migrate_to interval=0 timeout=120 \
        meta target-role=started \
        utilization cpu=1 hv_memory=1024
primitive res_drbd_r0 ocf:linbit:drbd \
        params drbd_resource=r0 unfence_extra_args=false \
        operations $id=res_drbd_r0-operations \
        op start interval=0 timeout=240 \
        op promote interval=0 timeout=90 \
        op demote interval=0 timeout=90 \
        op stop interval=0 timeout=100 \
        op monitor interval=10 timeout=20 start-delay=0 \
        op reload interval=0 timeout=30 \
        op notify interval=0 timeout=90 \
        meta target-role=started
ms ms_drbd_r0 res_drbd_r0 \
        meta clone-max=2 notify=true interleave=true target-role=started
colocation col_res_Filesystem_r0_ms_drbd_r0 inf: res_Filesystem_r0 ms_drbd_r0:Master
colocation col_res_VirtualDomain_VCAT_res_Filesystem_r0 inf: res_VirtualDomain_VCAT res_Filesystem_r0
colocation col_res_VirtualDomain_debian10-lcmctest_res_Filesystem_r0 inf: res_VirtualDomain_debian10-lcmctest res_Filesystem_r0
location loc_res_VirtualDomain_VCAT_CHUCK res_VirtualDomain_VCAT 0: CHUCK
location loc_res_VirtualDomain_VCAT_NORRIS res_VirtualDomain_VCAT 2: NORRIS
order ord_ms_drbd_r0_res_Filesystem_r0 inf: ms_drbd_r0:promote res_Filesystem_r0:start
order ord_res_Filesystem_r0_res_VirtualDomain_VCAT inf: res_Filesystem_r0 res_VirtualDomain_VCAT
order ord_res_Filesystem_r0_res_VirtualDomain_debian10-lcmctest inf: res_Filesystem_r0 res_VirtualDomain_debian10-lcmctest
property cib-bootstrap-options: \
        stonith-enabled=false \
        dc-version=2.0.1-9e909a5bdd \
        have-watchdog=false \
        cluster-infrastructure=corosync
rsc_defaults rsc-options: \
        target-role=started

crm conf sh Node2

Code: Alles auswählen

node 1: CHUCK \
        attributes standby=off
node 2: NORRIS \
        attributes standby=on
primitive res_Filesystem_r0 Filesystem \
        params device="/dev/drbd/by-res/r0/0" directory="/drbd/r0" fstype=none \
        operations $id=res_Filesystem_r0-operations \
        op start interval=0 timeout=60 \
        op stop interval=0 timeout=60 \
        op monitor interval=20 timeout=40 start-delay=0 \
        op notify interval=0 timeout=60 \
        meta target-role=started
primitive res_VirtualDomain_VCAT VirtualDomain \
        params config="/etc/libvirt/qemu/VCAT.xml" hypervisor="qemu:///system" save_config_on_stop=false sync_config_on_stop=false \
        operations $id=res_VirtualDomain_VCAT-operations \
        op start interval=0 timeout=90 \
        op stop interval=0 timeout=90 \
        op monitor interval=10 timeout=30 start-delay=0 \
        op migrate_from interval=0 timeout=60 \
        op migrate_to interval=0 timeout=120 \
        meta target-role=started \
        utilization cpu=6 hv_memory=8192
primitive res_VirtualDomain_debian10-lcmctest VirtualDomain \
        params config="/etc/libvirt/qemu/debian10-lcmctest.xml" hypervisor="qemu:///system" save_config_on_stop=false sync_config_on_stop=false \
        operations $id=res_VirtualDomain_debian10-lcmctest-operations \
        op start interval=0 timeout=90 \
        op stop interval=0 timeout=90 \
        op monitor interval=10 timeout=30 start-delay=0 \
        op migrate_from interval=0 timeout=60 \
        op migrate_to interval=0 timeout=120 \
        meta target-role=started \
        utilization cpu=1 hv_memory=1024
primitive res_drbd_r0 ocf:linbit:drbd \
        params drbd_resource=r0 unfence_extra_args=false \
        operations $id=res_drbd_r0-operations \
        op start interval=0 timeout=240 \
        op promote interval=0 timeout=90 \
        op demote interval=0 timeout=90 \
        op stop interval=0 timeout=100 \
        op monitor interval=10 timeout=20 start-delay=0 \
        op reload interval=0 timeout=30 \
        op notify interval=0 timeout=90 \
        meta target-role=started
ms ms_drbd_r0 res_drbd_r0 \
        meta clone-max=2 notify=true interleave=true target-role=started
colocation col_res_Filesystem_r0_ms_drbd_r0 inf: res_Filesystem_r0 ms_drbd_r0:Master
colocation col_res_VirtualDomain_VCAT_res_Filesystem_r0 inf: res_VirtualDomain_VCAT res_Filesystem_r0
colocation col_res_VirtualDomain_debian10-lcmctest_res_Filesystem_r0 inf: res_VirtualDomain_debian10-lcmctest res_Filesystem_r0
location loc_res_VirtualDomain_VCAT_CHUCK res_VirtualDomain_VCAT 0: CHUCK
location loc_res_VirtualDomain_VCAT_NORRIS res_VirtualDomain_VCAT 2: NORRIS
order ord_ms_drbd_r0_res_Filesystem_r0 inf: ms_drbd_r0:promote res_Filesystem_r0:start
order ord_res_Filesystem_r0_res_VirtualDomain_VCAT inf: res_Filesystem_r0 res_VirtualDomain_VCAT
order ord_res_Filesystem_r0_res_VirtualDomain_debian10-lcmctest inf: res_Filesystem_r0 res_VirtualDomain_debian10-lcmctest
property cib-bootstrap-options: \
        stonith-enabled=false \
        dc-version=2.0.1-9e909a5bdd \
        have-watchdog=false \
        cluster-infrastructure=corosync
rsc_defaults rsc-options: \
        target-role=started

Und die Ausgabe von /var/log/corosync/corosync.log:

Code: Alles auswählen

root@CHUCK: ~ # tail -f /var/log/corosync/corosync.log
Jan 07 10:10:05 [1063] CHUCK corosync notice  [VOTEQ ] Waiting for all cluster members. Current votes: 1 expected_votes: 2
Jan 07 10:10:05 [1063] CHUCK corosync notice  [QUORUM] Members[1]: 1
Jan 07 10:10:05 [1063] CHUCK corosync notice  [MAIN  ] Completed service synchronization, ready to provide service.
Jan 07 10:10:10 [1063] CHUCK corosync notice  [TOTEM ] A new membership (1:128) was formed. Members joined: 2
Jan 07 10:10:10 [1063] CHUCK corosync warning [CPG   ] downlist left_list: 0 received
Jan 07 10:10:10 [1063] CHUCK corosync warning [CPG   ] downlist left_list: 0 received
Jan 07 10:10:10 [1063] CHUCK corosync notice  [VOTEQ ] Waiting for all cluster members. Current votes: 1 expected_votes: 2
Jan 07 10:10:10 [1063] CHUCK corosync notice  [QUORUM] This node is within the primary component and will provide service.
Jan 07 10:10:10 [1063] CHUCK corosync notice  [QUORUM] Members[2]: 1 2
Jan 07 10:10:10 [1063] CHUCK corosync notice  [MAIN  ] Completed service synchronization, ready to provide service.

Tatsächlich aht sich im Log nichts getan. Ich verwenden den LCMC um das ganze Einzustellen. Tatsächlich ist mir gerade aufgefallen, dass, wenn ich Node1 in Standby schicke jetzt das DRBD nichtmehr auf Node2 in Betrieb geht.

Ich hoffe ich habe alle notwendigen Informationen beigetragen und du kannst mir evtl. weiterhelfen!

Gruß, Roonix

Colttt · Beitrag von **Colttt** » 07.01.2020 11:23:30

ui.. LCMC, kannte ich garnicht.. ich mach das immer lieber auf der konsole kaputt..

ok config etc sieht erstmal alles soweit gut aus.. und du brachst mir die config nur von einem node zeigen, sollten auf beiden gleich sein

ok leere mal das log und die ressource säubern dann sollte er sie neu starten und dann die log nach nopaste posten, dann sehen wir was da schief läuft - hoffentlich..

Code: Alles auswählen

> /var/log/corosync/corosync.log
crm resource cleanup

Roonix · Beitrag von **Roonix** » 07.01.2020 17:30:01

Das ist hier in der Firma Standard, bietet eine gute Übersicht. Nachteile überwiegen aber, wie ich finde. Java reicht das sogar schon

Naja, whatever, plötzlich funktioniert alles wieder. Heute morgen beim Standby Test kam nichtmal das DRBD hoch, jetzt kommt das DRBD und alle VM's hoch. Die VM's gingen ja vor meinem Urlaub nicht alle nach einem Standby was mich ja zu diesem Post veranlasst hat. Ich bin verwirrt. Und so kann man das ja keinesfalls produktiv einsetzen.

Anbei nochmal das Log:

Code: Alles auswählen

root@CHUCK: ~ # cat /var/log/corosync/corosync.log
Jan 07 17:05:02 [1063] CHUCK corosync notice  [TOTEM ] A new membership (1:132)                                      was formed. Members left: 2
Jan 07 17:05:02 [1063] CHUCK corosync warning [CPG   ] downlist left_list: 1 rec                                     eived
Jan 07 17:05:02 [1063] CHUCK corosync notice  [QUORUM] Members[1]: 1
Jan 07 17:05:02 [1063] CHUCK corosync notice  [MAIN  ] Completed service synchro                                     nization, ready to provide service.
Jan 07 17:05:03 [1063] CHUCK corosync notice  [MAIN  ] Node was shut down by a s                                     ignal
Jan 07 17:05:03 [1063] CHUCK corosync notice  [SERV  ] Unloading all Corosync se                                     rvice engines.
Jan 07 17:05:03 [1063] CHUCK corosync info    [QB    ] withdrawing server socket                                     s
Jan 07 17:05:03 [1063] CHUCK corosync notice  [SERV  ] Service engine unloaded:                                      corosync vote quorum service v1.0
Jan 07 17:05:03 [1063] CHUCK corosync info    [QB    ] withdrawing server socket                                     s
Jan 07 17:05:03 [1063] CHUCK corosync notice  [SERV  ] Service engine unloaded:                                      corosync configuration map access
Jan 07 17:05:03 [1063] CHUCK corosync info    [QB    ] withdrawing server socket                                     s
Jan 07 17:05:03 [1063] CHUCK corosync notice  [SERV  ] Service engine unloaded:                                      corosync configuration service
Jan 07 17:05:03 [1063] CHUCK corosync info    [QB    ] withdrawing server socket                                     s
Jan 07 17:05:03 [1063] CHUCK corosync notice  [SERV  ] Service engine unloaded:                                      corosync cluster closed process group service v1.01
Jan 07 17:05:03 [1063] CHUCK corosync info    [QB    ] withdrawing server socket                                     s
Jan 07 17:05:03 [1063] CHUCK corosync notice  [SERV  ] Service engine unloaded:                                      corosync cluster quorum service v0.1
Jan 07 17:05:03 [1063] CHUCK corosync notice  [SERV  ] Service engine unloaded:                                      corosync profile loading service
Jan 07 17:05:03 [1063] CHUCK corosync notice  [SERV  ] Service engine unloaded:                                      corosync resource monitoring service
Jan 07 17:05:03 [1063] CHUCK corosync notice  [SERV  ] Service engine unloaded:                                      corosync watchdog service
Jan 07 17:05:03 [1063] CHUCK corosync notice  [MAIN  ] Corosync Cluster Engine e                                     xiting normally
Jan 07 17:08:52 [999] CHUCK corosync notice  [MAIN  ] Corosync Cluster Engine 3.                                     0.1 starting up
Jan 07 17:08:52 [999] CHUCK corosync info    [MAIN  ] Corosync built-in features                                     : dbus monitoring watchdog augeas systemd xmlconf snmp pie relro bindnow
Jan 07 17:08:52 [999] CHUCK corosync warning [MAIN  ] interface section bindneta                                     ddr is used together with nodelist. Nodelist one is going to be used.
Jan 07 17:08:52 [999] CHUCK corosync warning [MAIN  ] Please migrate config file                                      to nodelist.
Jan 07 17:08:52 [999] CHUCK corosync notice  [TOTEM ] Initializing transport (UD                                     P/IP Unicast).
Jan 07 17:08:52 [999] CHUCK corosync notice  [TOTEM ] The network interface [192                                     .168.255.212] is now up.
Jan 07 17:08:52 [999] CHUCK corosync notice  [SERV  ] Service engine loaded: cor                                     osync configuration map access [0]
Jan 07 17:08:52 [999] CHUCK corosync info    [QB    ] server name: cmap
Jan 07 17:08:52 [999] CHUCK corosync notice  [SERV  ] Service engine loaded: cor                                     osync configuration service [1]
Jan 07 17:08:52 [999] CHUCK corosync info    [QB    ] server name: cfg
Jan 07 17:08:52 [999] CHUCK corosync notice  [SERV  ] Service engine loaded: cor                                     osync cluster closed process group service v1.01 [2]
Jan 07 17:08:52 [999] CHUCK corosync info    [QB    ] server name: cpg
Jan 07 17:08:52 [999] CHUCK corosync notice  [SERV  ] Service engine loaded: cor                                     osync profile loading service [4]
Jan 07 17:08:52 [999] CHUCK corosync notice  [SERV  ] Service engine loaded: cor                                     osync resource monitoring service [6]
Jan 07 17:08:52 [999] CHUCK corosync warning [WD    ] Watchdog not enabled by co                                     nfiguration
Jan 07 17:08:52 [999] CHUCK corosync warning [WD    ] resource load_15min missin                                     g a recovery key.
Jan 07 17:08:52 [999] CHUCK corosync warning [WD    ] resource memory_used missi                                     ng a recovery key.
Jan 07 17:08:52 [999] CHUCK corosync info    [WD    ] no resources configured.
Jan 07 17:08:52 [999] CHUCK corosync notice  [SERV  ] Service engine loaded: cor                                     osync watchdog service [7]
Jan 07 17:08:52 [999] CHUCK corosync notice  [QUORUM] Using quorum provider coro                                     sync_votequorum
Jan 07 17:08:52 [999] CHUCK corosync notice  [VOTEQ ] Waiting for all cluster me                                     mbers. Current votes: 1 expected_votes: 2
Jan 07 17:08:52 [999] CHUCK corosync notice  [SERV  ] Service engine loaded: cor                                     osync vote quorum service v1.0 [5]
Jan 07 17:08:52 [999] CHUCK corosync info    [QB    ] server name: votequorum
Jan 07 17:08:52 [999] CHUCK corosync notice  [SERV  ] Service engine loaded: cor                                     osync cluster quorum service v0.1 [3]
Jan 07 17:08:52 [999] CHUCK corosync info    [QB    ] server name: quorum
Jan 07 17:08:52 [999] CHUCK corosync notice  [TOTEM ] adding new UDPU member {19                                     2.168.255.212}
Jan 07 17:08:52 [999] CHUCK corosync notice  [TOTEM ] adding new UDPU member {19                                     2.168.255.222}
Jan 07 17:08:52 [999] CHUCK corosync notice  [TOTEM ] A new membership (1:136) w                                     as formed. Members joined: 1
Jan 07 17:08:52 [999] CHUCK corosync notice  [VOTEQ ] Waiting for all cluster me                                     mbers. Current votes: 1 expected_votes: 2
Jan 07 17:08:52 [999] CHUCK corosync warning [CPG   ] downlist left_list: 0 rece                                     ived
Jan 07 17:08:52 [999] CHUCK corosync notice  [VOTEQ ] Waiting for all cluster me                                     mbers. Current votes: 1 expected_votes: 2
Jan 07 17:08:52 [999] CHUCK corosync notice  [VOTEQ ] Waiting for all cluster me                                     mbers. Current votes: 1 expected_votes: 2
Jan 07 17:08:52 [999] CHUCK corosync notice  [QUORUM] Members[1]: 1
Jan 07 17:08:52 [999] CHUCK corosync notice  [MAIN  ] Completed service synchron                                     ization, ready to provide service.
Jan 07 17:08:56 [999] CHUCK corosync notice  [TOTEM ] A new membership (1:140) w                                     as formed. Members joined: 2
Jan 07 17:08:56 [999] CHUCK corosync warning [CPG   ] downlist left_list: 0 rece                                     ived
Jan 07 17:08:56 [999] CHUCK corosync warning [CPG   ] downlist left_list: 0 rece                                     ived
Jan 07 17:08:56 [999] CHUCK corosync notice  [VOTEQ ] Waiting for all cluster me                                     mbers. Current votes: 1 expected_votes: 2
Jan 07 17:08:56 [999] CHUCK corosync notice  [QUORUM] This node is within the pr                                     imary component and will provide service.
Jan 07 17:08:56 [999] CHUCK corosync notice  [QUORUM] Members[2]: 1 2
Jan 07 17:08:56 [999] CHUCK corosync notice  [MAIN  ] Completed service synchron                                     ization, ready to provide service.

Colttt · Beitrag von **Colttt** » 08.01.2020 12:04:03

nagut da es jetzt magischerweise wieder geht, dann teste das ganze.. verschiebe die ressorcen, setz ein host auf standby dann mal den anderen und mach den anderen mal hart aus..

wenn das alles klappt, füge mal eine neue VM hinzu und guck ob noch alles geht wenn ja schreib auf wie man die VM hinzufügt.

PS: wegen dem Standard, ist zwar schick und schön das tool, aber es sieht spontan so aus als würde es nur noch stiefmütterlich entwickelt werden. evtl mal hawk (von Suse und crm als backend) oder pcs (von Redhat mit gui) anschauen.

Roonix · Beitrag von **Roonix** » 09.01.2020 10:21:57

Colttt hat geschrieben:
08.01.2020 12:04:03
nagut da es jetzt magischerweise wieder geht, dann teste das ganze.. verschiebe die ressorcen, setz ein host auf standby dann mal den anderen und mach den anderen mal hart aus..

wenn das alles klappt, füge mal eine neue VM hinzu und guck ob noch alles geht wenn ja schreib auf wie man die VM hinzufügt.

PS: wegen dem Standard, ist zwar schick und schön das tool, aber es sieht spontan so aus als würde es nur noch stiefmütterlich entwickelt werden. evtl mal hawk (von Suse und crm als backend) oder pcs (von Redhat mit gui) anschauen.

Das funktioniert alles wunderbar. Nur das WARUM es plötzlich geht ist mir ein großes Rätsel. Die VM füge ich per virt-install hinzu. Das Image liegt auf den DRBD Ressourcen und die XML Datei wird per SCP auf den anderen Host kopiert. Hier könnte man natürlich auch auf die Ressource verlinken um das nichtmehr kopieren zu müssen -> TODO.
Gibt es noch andere "Belastungstest" die man für den Cluster in Betracht ziehen könnte?

Ja, die Sache mit den GUI's. Ich finde LCMC ehrlich gesagt katastrophal (viele Abstürze, langsame Reaktion) aber der Wechsel auf ein anderes System, Suse oder RH wird mit Sicherheit nicht in Betracht gezogen.
An dieser Stelle nochmal vielen Dank für deine Hilfe, Colttt.

debianforum.de

Ha Cluster (corosync/pacemaker) VirtualDomain Problem

Ha Cluster (corosync/pacemaker) VirtualDomain Problem

Re: Ha Cluster (corosync/pacemaker) VirtualDomain Problem

Re: Ha Cluster (corosync/pacemaker) VirtualDomain Problem

Re: Ha Cluster (corosync/pacemaker) VirtualDomain Problem

Re: Ha Cluster (corosync/pacemaker) VirtualDomain Problem

Re: Ha Cluster (corosync/pacemaker) VirtualDomain Problem

Re: Ha Cluster (corosync/pacemaker) VirtualDomain Problem

Re: Ha Cluster (corosync/pacemaker) VirtualDomain Problem