We keep seeing errors of type:
ERROR ArtifactReplicator - Failed to tar into file="C:\Program Files\Splunk\var\run\splunk\dispatch\_splunktemps\send\s2s\scheduler_<...>.tar"
On our search heads. We have four search heads in our cluster and four indexers that are clustered evenly across two sites.
This doesn't seem to be affecting Splunk search or alerting, but the members of the search head cluster keep flipping between "Up" and "Pending" status, which I think might be related.
I've searched extensively, but haven't really been able to find anything. Does anyone know what might be going on?
Thanks in advance!
↧
Search Head Cluster: Failed to tar into file...
↧
is it safe to delete the files and directories under splunk/var/run/ for search heads
I am having issues with search head members not pushing changes to the captain .I read in one of the post to delete the the files and directories under splunk/var/run/ and then do a restart
↧
↧
How to setup search head clustering?
I've done all the steps (several times) in the docs to setup a SH cludster from scratch configureing deployer; initialize each SH member, and then set a captain. But each time I run the cmd to set the captain (splunk bootstrap shcluster-captain ....) the cmd will just hang forever, timeout, or throw the error "In handler 'shclustermemberconsensus': CONFIGURATION ID MISMATCH". But then if I run the cmd (splunk show shcluster-status ) it does show a captain set and members. In other words it does provide a result which looks correct to me. How can there be a captain set?
**My question is "how do I really know if the captain has been set since I never got the bootstrap cmd to get a good result?**
I tried many approaches to get the cmd to run but no luck on any of my attempts:
* Changed the pass4symkey on BOTH the [general] & [shclustering] stanzas so they match across the deployer, and the 2 SH members + restarts each time.
* I also tried the same text string and 2 different text strings with no special characters between the two stanzas. bootstrap cmd still fails..
* I did notice that the hashed result string on the 2 SH members always match, but the hash on the deployer does not match the other 2 members (even though I started off with the same text string to be encrypted. Don't know if this really means anything but to have a different hash value between the deployer & members seems normal.
* I did full tear down's and rebuilds of deployer, and 2 SH members + all the steps again. bootstrap cmd still fails..
* I even hit the individual mgmt_uri's (mgmt_uri = https://mdcsueve.fer.com:8089) in a browser and they all came back with data. bootstrap cmd still fails..
* These are brand new vanilla VM's working fine. I installed fresh splunk copies v6.4.1 with no issues.
* I tried different replication ports also just to make sure they were not being used. Still bad results.
I am out of ideas, but need the SH cluster setup. Any help is appreciated.
DOC used: https://docs.splunk.com/Documentation/Splunk/6.4.1/DistSearch/SHCdeploymentoverview
↧
OpsGenie for Splunk app on a Search Head Cluster
We're having some issues getting the OpsGenie for Splunk app working on a Search Head cluster.
We've been able to get it to work on a test instance of Splunk with a single search head but it doesn't work in the cluster. There seem to be a few issues. I can get the API key to be saved successfully in the OpsGenie app, but none of the Splunk alerts are sent. Looking at the logs we can see the below errors:
ERROR sendmodalert - action=opsgenie STDERR - Unexpected error: Could not get opsgenie credentials from splunk. Error: [HTTP 403] Client is not authorized to perform requested action; /servicesNS/nobody/opsgenie/admin/passwords
WARN sendmodalert - action=opsgenie - Alert action script returned error code=3
Has anyone been able to get this app to work in a clustered environment? Is there something additional that needs to be done?
↧
How to upgrade Search head pooling in upgrading Splunk from 6.0.1 to 7.2.3
Hi,
I need urgent assistance on upgrading Search head pooling. Mine is distributed environment(6.0.1) with below details
Two indexers(Clustered)
Two search heads(SHP)
One Cluster master
As per the Splunk docuemntation I need to upgrade in below sequence
Licence Master ->Search head ->Cluster master ->Indexer
For Search head pooling i have below doubt as mentioned in Splunk documents
**Test apps prior to the upgrade**
*Before you upgrade a distributed environment, confirm that Splunk apps work on the version of Splunk Enterprise that you want to upgrade to. You must test apps if you want to upgrade a distributed environment with a search head pool, because search head pools use shared storage space for apps and configurations.
When you upgrade, the migration utility warns of apps that need to be copied to shared storage for pooled search heads when you upgrade them. It does not copy them for you. ****You must manually copy updated apps, including apps that ship with Splunk Enterprise (such as the Search app) - to shared storage during the upgrade process****. Failure to do so can cause problems with the user interface after you complete the upgrade.
On a reference machine, install the full version of Splunk Enterprise that you currently run.
Install the apps on this instance.
Access the apps to confirm that they work as you expect.
Upgrade the instance.
Access the apps again to confirm that they still work.
If the apps work as you expect, move them to the appropriate location during the upgrade of your distributed environment:
If you use non-pooled search heads, move the apps to $SPLUNK_HOME/etc/apps on each search head during the search head upgrade process.
If you use pooled search heads, move the apps to the shared storage location where the pooled search heads expect to find the apps.*
**My Question is**
1) I have already apps placed on NAS. How can i copy and paste from Search head again ? Is this makes sense ?
PS:- I know Search head pooling is depreciated feature. We will upgrade to Search head clustering later as a different project.
↧
↧
Search Head Cluster 7.1 presents status_line="Error connecting: Connect Timeout" socket_error="Connect Timeout" frequently
The cluster is running on 7.1.6 for months but in the last weeks we are seeing more and more errors like:
08-11-2019 20:48:39.441 +0000 ERROR SHCSlave - event=SHPSlave::handleHeartbeatDone heartbeat failure (reason: failed method=POST path=/services/shcluster/captain/members/7A0DC929-2222-4AE2-B8A2-C87C47085DE6 captain=xxxxxxxxxxx:8089 rc=0 actual_response_code=502 expected_response_code=200 status_line="Read Timeout" socket_error="Read Timeout")
08-11-2019 20:48:39.441 +0000 WARN SHCMasterHTTPProxy - Low Level http request failure err=failed method=POST path=/services/shcluster/captain/members/7A0DC929-2222-4AE2-B8A2-C87C47085DE6 captain=xxxxxxxxxxxxxxxx:8089 rc=0 actual_response_code=502 expected_response_code=200 status_line="Read Timeout" socket_error="Read Timeout"
08-11-2019 20:45:09.149 +0000 WARN SHCMasterHTTPProxy - Low Level http request failure err=failed method=POST path=/services/shcluster/captain/members captain=xxxxxxxxxxxxxxxx:8089 rc=0 actual_response_code=502 expected_response_code=201 status_line="Write Timeout" socket_error="Write Timeout"
08-11-2019 20:45:09.063 +0000 WARN SHCMasterHTTPProxy - Low Level http request failure err=failed method=POST path=/services/shcluster/captain/members captain=xxxxxxxxxxxxxxxx:8089 rc=0 actual_response_code=502 expected_response_code=201 status_line="Write Timeout" socket_error="Write Timeout"
This is happening regardless of the captain.
The bin/splunk show shcluster-status is also failing from time to time, either showing all members as down, or showing:
"Failed to proxy call to member https://xxxxxxx:8089.
Encountered some errors while trying to obtain shcluster status."
We have a 11 node search head cluster, 28 threads and 128 GB of RAM each, running on Oracle linux 7.6 version, all patched, all physical servers.
We have a high limit for splunk user:
splunk@xxxxxxxx:~$ cat /proc/194162/limits
Limit Soft Limit Hard Limit Units
Max cpu time unlimited unlimited seconds
Max file size unlimited unlimited bytes
Max data size unlimited unlimited bytes
Max stack size 8388608 unlimited bytes
Max core file size 0 unlimited bytes
Max resident set unlimited unlimited bytes
Max processes 200000 200000 processes
Max open files 200000 200000 files
Max locked memory 65536 65536 bytes
Max address space unlimited unlimited bytes
Max file locks unlimited unlimited locks
Max pending signals 514519 514519 signals
Max msgqueue size 819200 819200 bytes
Max nice priority 0 0
Max realtime priority 0 0
Max realtime timeout unlimited unlimited us
Other relevant info:
splunk@xxxxxxxxx:~$ bin/splunk btool limits list --debug | egrep -v /opt/splunk/etc/system/default/limits.conf
/opt/splunk/etc/apps/xxxxx_us_all_base_sh/default/limits.conf [scheduler]
/opt/splunk/etc/apps/xxxxx_us_all_base_sh/default/limits.conf max_searches_perc = 75
/opt/splunk/etc/apps/xxxxx_us_all_base_sh/default/limits.conf max_searches_perc.1 = 95
/opt/splunk/etc/apps/xxxxx_us_all_base_sh/default/limits.conf max_searches_perc.1.when = * 01-10 * * *
/opt/splunk/etc/apps/xxxxx_us_all_base_sh/default/limits.conf [search]
/opt/splunk/etc/apps/xxxxx_us_all_base_sh/default/limits.conf base_max_searches = 6
/opt/splunk/etc/apps/xxxxx_us_all_base_sh/default/limits.conf fetch_remote_search_log = disabled
/opt/splunk/etc/apps/xxxxx_us_all_base_sh/default/limits.conf max_chunk_queue_size = 100000000
/opt/splunk/etc/apps/xxxxx_us_all_base_sh/default/limits.conf max_searches_per_cpu = 2
And:
splunk@xxxxxx:~$ cat /opt/splunk/etc/apps/xxxxxxxx/default/server.conf
[clustering]
mode = searchhead
master_uri = clustermaster:ha_primary
multisite = true
[clustermaster:ha_primary]
master_uri = https://us-splunk-cm.xxx.com:8089
pass4SymmKey = $1$G2VD3sbFfwIQaBo=
multisite = true
[shclustering]
scheduling_heuristic = round_robin
captain_is_adhoc_searchhead = true
executor_workers = 50
conf_replication_period = 5
conf_replication_max_pull_count = 1000
conf_replication_max_push_count = 100
[httpServer]
maxSockets = -1
maxThreads = -1
Appreciate any help or tips. Splunk case opened but it is being too slow to fix, weeks with the issue now.
↧
Palo Alto Networks App for Splunk: ERROR - Could not load lookup=LOOKUP-minemeldfeeds_dest_lookup
I recently upgraded our Splunk Enterprise deployment up to 7.2.7 and Palo Alto Networks App to 6.1.1 and the Add-on to 6.1.1. I now receive the messages on the search head cluster members:
Could not load lookup=LOOKUP-minemeldfeeds_dest_lookup
Could not load lookup=LOOKUP-minemeldfeeds_src_lookup
I've looked through all the answers from previous threads and none of them fix the issue.
Thanks.
↧
SHCluster Rolling restart stops frequently
After pushing a new shcluster bundle, the shcluster self-initiated a rolling restart. We have 15 servers in this cluster. 5 completed the restart at some point after the "splunk rolling-restart shcluster-members -status 1" command shows the 10th server in "restarting" status but at the very next time for the same command it started to respond saying that there was not a rolling restart in progress.
We also found the captain has moved to another search head.
Running the states command from there also showed that the rolling restart was not running.
Now, the problem is, these 10 servers has new configuration bundle but the other 5 that had not restarted are still showing messages about having received new configs that require the restart but they haven't been restarted to pick them up.
↧
American search head cluster app needs en-US URL
Hello guys,
we installed one app deploying our SHC but when we navigate to it then there is error message :
![alt text][1]
How do you update the URL?
Thanks.
![alt text][2]
[1]: /storage/temp/275579-capture.png
[2]: /storage/temp/275578-sans-titre.png
↧
↧
I can access every search head web page except for the captain
I've got a search head cluster running and have a host that I've set as the cluster captain. Other than the configuration settings for it being the cluster captain it is set up like all the other hosts in the cluster. But I cannot load the web page for it and I can load it for every other search head.
Instead I get a "404" error despite the boot showing that the web interface has come up just fine.
> Splunk> Winning the War on Error>> Checking prerequisites...> Checking http port [4443]: open> Checking mgmt port [5500]: open> Checking appserver port [127.0.0.1:8065]: open> Checking kvstore port [8191]: open> Checking configuration... Done.> Checking critical directories... Done> Checking indexes...> Validated: _audit _internal _introspection _telemetry _thefishbucket history main summary> Done> Checking filesystem compatibility... Done> Checking conf files for problems...> Done> Checking default conf files for edits...> Validating installed files against hashes from '/opt/splunk/splunk-7.3.1.1-7651b7244cf2-linux-2.6-x86_64-manifest'> All installed files intact.> Done> Checking replication_port port [8080]: open> All preliminary checks passed.>> Starting splunk server daemon (splunkd)...> Done> [ OK ]>> Waiting for web server at https://127.0.0.1:4443 to be available............. Done>>> If you get stuck, we're here to help.> Look for answers here: http://docs.splunk.com>> The Splunk web interface is at https://splunk-search-lead:4443
Any idea why this would be happening? I've deleted the installation twice and reinstalled from scratch and the same outcome happens each time. This is the first time using the latest version of Splunk. Our previous installations ran 7.1.2 and we never had this problem at all using the same deployment steps. This is also on new hosts so there's no previous installation on these hosts either.
↧
Is it possible to replicate bash script generated lookup across search head cluster?
Hi folks,
I am using a bash script to download data to populate a CSV that I'd like to use as a lookup in Splunk.
So far I have created the empty lookup on our deployer, which has successfully pushed it out to the search head cluster members.
I have a script running on the cluster-master that populates the empty lookup, the changes, however, are not replicating across the cluster, I expected this to work as the lookup location is whitelisted for replication in server.conf.
The changes to the CSV are failing to replicate across the cluster, do the changes to a lookup need to happen within Splunk for lookup replication to work?
An alternative solution would be to generate the script on the deployer and script a bundle push every 12 hours. I'm reluctant to have automated bundle pushes occurring outwith office hours due to issues we've experienced previously.
Has anyone ever attempted to do something similar, and is able to offer any guidance?
Many thanks,
Miles
↧
SHC - failed on handle async replicate request
I have noticed something odd in a SHC deployment. Im consistently seeing "SHCMasterArtifactHandler - failed on handle async replicate request" errors, these report to be caused by the reason "active replication count >= max_peer_rep_load"
While these errors dont appear to be causing any actual impact on executing scheduled searches or anything else, I would like to get to the bottom of what is causing them. It doesnt seem to be a particular node or search or user that these occur for.
See the end of the post for the relevant error messages.
There are 4x nodes in the multi-site search head cluster (2 in each site)
The [shclustering] stanza of server.conf has replication_factor=2 configured
Where I suspect the problem is occuring is the search heads not implimenting the replication_factor setting, because when I run the command "*splunk list shcluster-members*" I see the replication_count numbers vary between 3 - 5 - I should expect to see 2 here shouldnt I ?
If the default max_peer_rep_load = 5 and the replication_count of at least one of those search heads is showing 5 when I check them - then I assume this is what is causing the excess replication to occur ?
When I run the command "*splunk list shcluster-config*" against all the nodes - they correctly show [max_peer_rep_load = 5] and [replication_factor = 2]
Has anyone seen this in a search head cluster before ? - I have read https://answers.splunk.com/answers/242905/shc-troubleshooting-configurations-under-search-he.html and looked into the settings from the last comment from SplunkIT - but want to verify before I go tweaking configs before finding the root cause of the issue.
Thanks in advance for anyone's insight into what may be causing it and how to further troubleshoot / resolve.
10-15-2019 09:50:15.153 +0000 ERROR SHCMasterArtifactHandler - failed on handle async replicate request sid=scheduler___aGFpZ3NfZ2VuZXJhbA__RMD5f433e0e54b7570e0_at_1571133000_9209_4B87E9AA-3400-4FC0-ADFC-9C98728A2D1B err='srcPeer="", srcGuid="4B87E9AA-3400-4FC0-ADFC-9C98728A2D1B" cannot be valid source for artifactId=scheduler___aGFpZ3NfZ2VuZXJhbA__RMD5f433e0e54b7570e0_at_1571133000_9209_4B87E9AA-3400-4FC0-ADFC-9C98728A2D1B targetPeer="", targetGuid="7A3F5991-2373-40BA-998A-79193A40CF27" reason="active replication count >= max_peer_rep_load"'
10-15-2019 09:50:15.159 +0000 ERROR SHCRepJob - job=SHPAsyncReplicationJob aid=scheduler___aGFpZ3NfZ2VuZXJhbA__RMD5f433e0e54b7570e0_at_1571133000_9209_4B87E9AA-3400-4FC0-ADFC-9C98728A2D1B srcGuid=4B87E9AA-3400-4FC0-ADFC-9C98728A2D1B tgtGuid=7A3F5991-2373-40BA-998A-79193A40CF27 failed. reason failed method=POST path=/services/shcluster/captain/artifacts/scheduler___aGFpZ3NfZ2VuZXJhbA__RMD5f433e0e54b7570e0_at_1571133000_9209_4B87E9AA-3400-4FC0-ADFC-9C98728A2D1B/async_replicate captain=:8089 rc=0 actual_response_code=500 expected_response_code=200 status_line="Internal Server Error" transaction_error="\n \n failed on handle async replicate request sid=scheduler___aGFpZ3NfZ2VuZXJhbA__RMD5f433e0e54b7570e0_at_1571133000_9209_4B87E9AA-3400-4FC0-ADFC-9C98728A2D1B err='srcPeer="", srcGuid="4B87E9AA-3400-4FC0-ADFC-9C98728A2D1B" cannot be valid source for artifactId=scheduler___aGFpZ3NfZ2VuZXJhbA__RMD5f433e0e54b7570e0_at_1571133000_9209_4B87E9AA-3400-4FC0-ADFC-9C98728A2D1B targetPeer="", targetGuid="7A3F5991-2373-40BA-998A-79193A40CF27" reason="active replication count >= max_peer_rep_load"' \n \n \n"
10-15-2019 09:50:15.159 +0000 WARN SHCMasterHTTPProxy - Low Level http request failure err=failed method=POST path=/services/shcluster/captain/artifacts/scheduler___aGFpZ3NfZ2VuZXJhbA__RMD5f433e0e54b7570e0_at_1571133000_9209_4B87E9AA-3400-4FC0-ADFC-9C98728A2D1B/async_replicate captain=:8089 rc=0 actual_response_code=500 expected_response_code=200 status_line="Internal Server Error" transaction_error="\n \n failed on handle async replicate request sid=scheduler___aGFpZ3NfZ2VuZXJhbA__RMD5f433e0e54b7570e0_at_1571133000_9209_4B87E9AA-3400-4FC0-ADFC-9C98728A2D1B err='srcPeer="", srcGuid="4B87E9AA-3400-4FC0-ADFC-9C98728A2D1B" cannot be valid source for artifactId=scheduler___aGFpZ3NfZ2VuZXJhbA__RMD5f433e0e54b7570e0_at_1571133000_9209_4B87E9AA-3400-4FC0-ADFC-9C98728A2D1B targetPeer="", targetGuid="7A3F5991-2373-40BA-998A-79193A40CF27" reason="active replication count >= max_peer_rep_load"' \n \n \n"
↧
How to prevent duplicate alerts with multiple search heads
Hi Experts,
In my Splunk distributed environment, I have one load balancer and two search heads, and one deployment server.(No Shearch head pooling server configure).
I configured alert mail on both the search head, then I get duplicate alert mail from each search head.
Because of high availability of get alert mail I can`t enable alert mail only on one search head server, and nor I enable alert only on deployment server.
Please suggest how I get only single alerts.
Thanks in advance.
↧
↧
How can I connect two Splunk silos together ?
hi,
I have two teams each running their Splunk deployments and I need to have a centralized manner and woud like to be able of accessing the data and run my ML algorithms on subdata sets from both splunks.
how is it possible ?
how can I connect to two Splunk at the same time ?
thanks
-Bill
↧
pass username and password while search head cluster restart
We are using a Search Head cluster
Splunk Enterprise
Version:
7.2.6
When the CPU and Memory of servers goes high we restart the cluster and the cluster comes back to normal.
I wrote a shell script with 2 lines.
!#/bin/bash
sudo -u splunk /opt/splunk/bin/splunk rolling-restart shcluster-members
when the alert in SH is triggered it is failing as splunk username and passwd needs to be passed to the script manually.
Is there any way I can pass username and password for Splunk in the script to get triggered as per alert.
↧
Steps to Clean Up a search head in a search head cluster
Hi Guys,
It would be helpful if anyone shares knowledge/provide steps about cleaning up a Search head in a Search head cluster environment. Want to know what is cleaned up and what's the process and all.
Thanks in Advance!!
Sarah
↧
How to clean up a search head in a search head cluster?
Hi Guys,
It would be helpful if anyone shares knowledge/provide steps about cleaning up a search head in a search head cluster environment.
I want to know what is cleaned up and what's the process and all.
Thanks in Advance!!
Sarah
↧
↧
How to migrate KV store data from a search head standalone to a search head cluster ?
Hello,
I have a standalone search head with KVstores.
I want to migrate the KVstores to a search head cluster without, if possible, exporting all data (in csv or other format) and importing them again as it represents a large quantity of data (2-3GB) and many collections.
What I tryed :
- backup the kvstores from the standalone server using
./splunk backup kvstore
- Set the replication factor to 1 on one search head of the new cluster
- Clean kvstore db on this search head :
./splunk clean kvstore --local
./splunk clean kvstore --cluster
- Restore on the clustered SH the backuped kvstore from archive
./splunk restore kvstore archiveName
This step took a very long time (maybe its normal).
- I monitored this using
./splunk show shcluster-status
- The backupRestoreStatus finally moved to ready :
This member:
backupRestoreStatus : Ready
date : Fri Nov 29 13:34:12 2019
dateSec : 1575034452.206
disabled : 0
guid : 0C76D3C2-F11A-47FB-A705-3ECBC0CCE929
oplogEndTimestamp : Fri Nov 29 13:34:05 2019
oplogEndTimestampSec : 1575034445
oplogStartTimestamp : Fri Nov 29 10:11:49 2019
oplogStartTimestampSec : 1575022309
port : 8191
replicaSet : splunkrs
replicationStatus : KV store captain
standalone : 0
status : ready
Enabled KV store members:
spplsh01:8191
guid : 0C76D3C2-F11A-47FB-A705-3ECBC0CCE929
hostAndPort : sh01:8191
KV store members:
spplsh01:8191
configVersion : 1
electionDate : Fri Nov 29 13:24:26 2019
electionDateSec : 1575033866
hostAndPort : spplsh01:8191
optimeDate : Fri Nov 29 13:34:05 2019
optimeDateSec : 1575034445
replicationStatus : KV store captain
uptime : 608
But even if the kvstore status is all ok, when I search for data in the kvstores these are empty (even if there are lot of files in the mongo directory).
As this step is not ok, of course, I cannot go further trying to sync with another search head.
Has anyone already tried to do this ? maybe using another method ? for next steps, do I need to do the same on all SH of cluster or will the kvstores replicate automaticaly ?
Thanks in advance.
The used Splunk version is 7.3.2
↧
Deploy indexes.conf in a Search Head Cluster? How to avoid (and recover in case of) misconfiguration?
We have a Search Head Cluster connected to an Indexer Cluster. All indexes are on the clustered Indexers, and the Search Head Cluster members forward their local internal indexes to the Indexers. Is it best practice to still deploy a copy of the "master" indexes.conf (that gets distributed to the Indexers through the Cluster Master) to the Search Head Cluster members? If so, how?
And much more importantly: How do we recover from misconfigurations that stop the Search Head Cluster members from restarting correctly?
Scenario: we use the Deployer to deploy a version of indexes.conf that contains a reference to a volume e.g. in homePath that is not defined on the Search Head Cluster members. The Search Head Cluster members will initiate a rolling-restart but not come back online as Splunkd will notice that there are incorrectly defined indexes on the instance. How can we a) avoid this happening and b) if it happens, quickly revert?
↧
Docker image search cluster configuration fails in splunk-ansible: 'FAILED - RETRYING: Destructive sync search head'
We're using the docker images at https://hub.docker.com/r/splunk/splunk to install splunk in kubernetes. We're currently using 7.2.4, and are preparing to upgrade to 7.2.9.1.
The configuration stage (using splunk-ansible) of the search cluster is failing for at least the following versions:
- 7.2.9
- 7.3.3
The log for each of the search cluster members shows:
FAILED - RETRYING: Destructive sync search head
We have tested the following versions and found that they do not exhibit this behaviour, and deploy a working search cluster:
- 7.2.4
- 7.2.5
- 7.2.6
- 7.2.7
(7.2.8 is broken in a totally different way; all the containers die almost immediately with 'ERROR: Couldn't read "/opt/splunk/etc/splunk-launch.conf" ')
My question is - does anyone here have 7.2.9 or 7.3.3 working using the docker containers and with a search cluster, and can they please share the secret?
Thanks,
Rich
↧