Monday, July 11, 2011

PAN Silent Packet Drops in 3.1.8

This is more of an FYI. I want to share what my company is going through so we can all learn from each other.

We monitor our network by sending out pings every 500ms. We have multiple ping sources going to scores of endpoints. Then, we correlate and report on the data. We've been doing this for years. We've got a good understanding of what "normal" looks like on the network.

We upgraded a Palo Alto Networks PA-4020 (Threat Protection & URL Filtering, two vwires) from 3.1.4 to 3.1.8. Within hours, that firewall started experiencing "incidents". In each incident, the device would stop passing traffic for up to 15 seconds. Of course, the logs and counters don't show anything abnormal. Every few hours, the system would experience an incident. Sometimes at 03:00, but usually during business hours. It did seem to be somewhat load related. (High load on this box is a few hundred Mbit/sec)

Support didn't seem to believe us that this was a problem. After about a week / ten days, we gave up on getting support engaged to understand the problem, and we rolled back to 3.1.4. Everything has been fine since then.

Since then, we've got support engaged and looking at the problem. They're saying that there haven't been any fixes in 3.1.9 for issues like this. In other words, they're recommending we avoid 3.1.9 as we'll likely have the same problem.

Is anyone else running these versions of code? Do you have good monitoring like this? If I gave you some scripts, would you let me know how it goes?

We do have 3.1.8 on over a dozen other 4020's and it is working fine. Very different traffic loads on those devices and no URL filtering.

We played with 4.0 for a bit... and then went back to 3.1 for stability. Don't even get me started on 4060's: The solution to one of my tickets is "Just keep rebooting until it works".

If you're seeing silent packet drops in 3.1.8 or 3.1.9, you're not the only one.

This is cross posted to PAN's support forums.

No comments: