Data KonceptsApache's mod_rewriteData KonceptsApache's mod_rewrite |
||||
| Website monitor by killerwebstats.com |
Apache's mod_rewriteLast updated: July 14 2010. Apache's low cost and powerful set of features make it the server of choice around the world. One of its real treasures is the mod_rewrite module who's purpose is to redirect a visitor's request in the manner specified by a set of rules. This article will lead you through the Why, Installation and Test, Regex, RewriteCond(itions), Flags, Comments, Linking, Introduced Problems, Examples and will Summarize with the best references I've discovered. Why Redirect a URL?The simple answer is to make them human-readable (commonly called "user friendly" or "Search Engine Optimized"). URLs with query strings (the URL's text after a question mark) confuse most visitors and are difficult for them to type correctly. By changing the URL, you can make your site more "user-friendly." For example: http://www.example.com/display.php?country=USA&state=California&city=San_Diego could be changed to http://www.example.com/USA/California/San_Diego Other possible reasons might include:
mod_rewrite has other use, too, but let's get on to the basics first. Server SetupSome hosts do not have mod_rewrite enabled (it is, by default, not enabled). You can find out if yours server has mod_rewrite enabled by using a script with the simple PHP code: phpinfo(); Look in the Apache2Handler section and, if mod_rewrite is not listed, you will have to ask your host to enable it - or find a "good host" (most hosts will have it enabled). The following will describe how to enable and test mod_rewrite on your test server. First, you will need to change the default Apache configuration (this is in Apache's httpd.conf file) by removing the "#" at the beginning of the line # LoadModule rewrite_module modules/mod_rewrite.so While you're in the httpd.conf file, be sure that you have <Directory /> You will need to RESTART Apache for these changes to take effect. Apache will now be running with mod_rewrite as you will see with another look at the phpinfo() output. TestTo be sure that you have mod_rewrite installed and working properly, here is a simple test for you: Create three files, test.html, test.php and .htaccess. test.html: <h2>This is the HTML file.</h2> and ... test.php: <h2>This is the PHP file.</h2> Create the third file, .htaccess, with the following: RewriteEngine on If you are using Notepad, you may have to save it as htaccess.txt, upload and change the name to .htaccess on the server. Upload all three files (in ASCII mode) to your server and then type: http://www.example.com/test.html in the location box - using your domain, of course! If the page shows "This is the HTML file." You have got to start over. If it shows "This is the PHP file", it is working properly! Note, please, that the test.html URL has remained in the browser's location box. Specificity (your specification)Whether you're changing from one URI to another or creating a whole new file structure (e.g., renaming all files from .html to .php or eliminating the file extension), you must create a specification for what your redirection will accomplish (and what it must NOT accomplish). To amplify, your specification needs to tell you in an unambiguous manner exactly what you want to change so mod_rewrite can match ONLY that URI and the redirection must NOT create a loop. Matching: Do you want to match/redirect EVERYTHING (or NOTHING)? If not, eliminate (.*) NOW! If you want to remain at the same depth in your directory structure (highly recommended), eliminate /'s from your regex's character set. Uppercase letters? If not, you're pretty much left to lowercase letters, digits and the dot, dash and underscore (as allowed characters - ref: Uniform Resource Identifiers (URI): Generic Syntax by Tim Berners-Lee et al). Redirection: Will your redirection loop? That's the primary problem with (.*) - although it will also pass unexpected garbage (or nothing at all). ALWAYS check that the redirection cannot be matched by the regex and, if it can, specify an exclusion. (WordPress users will know that WP redirects EVERYTHING to index.php with the exclusion that it will not redirect existing directories or files.) mod-rewrite RegexNow we can begin with rewriting your URIs! Note: YOU must create the URI links in your new format; mod_rewrite's job is to rewrite it to something Apache can serve to your visitors! If you are not familiar with regular expressions (regex), there are many sites which provide excellent tutorials. At the end of this article, I have listed the best pages I have found: A tutorial, a "cheat sheet," a very nice text editor with regex capabilities and a test tool for your regex. If you are not able to follow my explanations, review the first two of those links. Problem: Display city information based on the country, state and city requested. To change http://www.example.com/USA/California/San_Diego to http://www.example.com/display.php?country=USA&state=California&city=San_Diego so your display script can read and parse the query string, you will need to use regex to tell mod_rewrite what to attempt to match. Too many people just use the (.*) to select (NOTHING OR) EVERYTHING in an "atom" (an Apache variable you can create and use within mod_rewrite) and try to pass that along to the redirection string. In this case, you would need three of these atoms separated by the subdirectory slashes ("/") so the regex would become: (.*)/(.*)/(.*) Note #1: (.*) combines two metacharacters, the dot character (which means ANY character) and the * character (which specifies ZERO or MORE of the preceding character). Thus, (.*) matches EVERYTHING in the {REQUEST_URI} string ({REQUEST_URI} is that part of the URL which follows the domain up to but not including the ? of a query string and is the ONLY Apache variable that a RewriteRule can attempt to match). With the above regex, the regex engine will progress to learn that you have required two slashes (anywhere) in the string. For our purposes, though, we need to capture the three values in the {REQUEST_URI} so I've used the slashes to separate them. To tell mod_rewrite that the URI should begin and end with this string, we add the start anchor (^) and end anchor ($) so the regex becomes: ^(.*)/(.*)/(.*)$ This allows TOO MUCH to be sent to your query string – often a security hazard – and, when used inappropriately, WILL cause mod_rewrite to loop! To avoid unnecessary problems, I'll change the EVERYTHING atoms to specify exactly the characters I will allow. Thus, the first atom (USA) can be matched by ([A-Z]+) which ONLY allows one or more uppercase letter (the "+" metacharacter specifies one or more of the preceding character while the "*" metacharacter specifies zero or more – I want to ensure at least one character in the range from A to Z). California contains both uppercase and lowercase letters so this atom becomes ([a-zA-Z]+). San_Diego also contains an underscore (replacing the space which would display as the "ugly" %20 in the URI) so this atom becomes ([a-zA-Z_]+) and, with the {REQUEST_URI}'s starting /, we have: ^/?([A-Z]+) / ([a-zA-Z]+) / ([a-zA-Z_]+) $ Note #2: Apache changed regex engines when it changed versions so that Apache 1.x requires the leading slash while Apache 2.x forbids it! I satisfy both versions by making the leading slash optional, i.e., ^/? (? is the metacharacter for zero or one of the preceding character). All that would be well and good if the only country was USA but we'll need to expand the regex for other countries and allow an underline to replace the spaces in the "North," South," "West" and "New" states so the regex would expand once again to: ^/?([a-zA-Z_]+)/([a-zA-Z_]+)/([a-zA-Z_]+)$ Note #3: If you have a short list of allowable countries, it would be best to avoid database problems by specifying the acceptable values with regex: ^/?(USA|Canada|Mexico)/([a-zA-Z_]+)/([a-zA-Z_]+)$ Note #4: If you are concerned about people typing in CAPS when your database is strictly lowercase, have regex ignore the case by adding the No Case flag ("[NC]") after the redirection. Just don't forget to convert to lowercase in your script after obtaining the $_GET array! More on flags later. Note #5: Since URLs can't have spaces (except as %20), use underlines or hyphens to replace them. If you ABSOLUTELY have to use spaces (%20) in your URIs, you can include them in your regex within a range definition as \{space}, i.e., ([a-zA-Z\ ]+). However, this is NOT advised. Note #6: If you are converting to/from a database field which does contain spaces, you should convert the spaces to some other character. Using PHP, you can use $state = str_replace ( ' ', '_', $state ); before placing $country in the link and reverse the process with $state = str_replace ( '_', ' ', $state ); before matching $state to the database field. Using _'s is better than -'s because text can often include the hyphen character which would be converted to a space by this code and is better than %20 in the URI as spaces require special treatment in the regex and redirection. With the regex in hand, you can now map the atoms to the query string: display.php?country=$1&state=$2&city=$3 where display.php is the name of the script, $1 is the first (country) atom, $2 is the second (state) atom and $3 is third (city) atom. Note that there can only be nine atoms created, $1 … $9. Almost there! Open a New document with EditPad (or your text editor) and type: RewriteEngine on Note #7: The RewriteRule must go on ONE line with one space between the RewriteRule, the regex and the redirection (and before any optional flags). NotePad indiscriminately inserts line returns in long lines so you're far better off using a good text editor (see references at the end). Note #8: If you won't always have the city or the state and city, then you can easily make them optional replacing the above with: RewriteEngine on where
If the optional atoms confused you, use three separate statements. Optional atoms are NOT mandatory, just an easy way to combine several statements into one. Save this as .htaccess in the directory where display.php resides. If you want to use digits (0, 1, ... 9) for, say, Congressional Districts, then you'll need to change an atom's specification from ([a-zA-Z_]+) to ([0-9]) to signify a single digit, ([0-9]{1,2}) for one or two digits (0 through 99) or ([0-9]+) for one or more digits (0 through ...; useful for database id's). The RewriteCond(ition) StatementNow that you have learned to match mod_rewrite's basic RewriteRule(s) with the {REQUEST_URI} string, it's time to learn to use conditionals to access other variables with the RewriteCond(ition). RewriteCond is similar in format to the RewriteRule in that you have the command name, RewriteCond, a variable to be matched, the regex and flags (the logical OR flag is a useful flag to keep in mind as RewriteCond and RewriteRules are ANDed until terminated by the Last ([L]) flag). The best list of Server Variables I've found is located here. For an example, let me assume that you want to force the www in your domain name (and you don't have subdomains to be concerned with). To do this, you will need to test the Apache {HTTP_HOST} variable to see if the www. is already there and, if not, redirect. RewriteEngine on Here, to denote that {HTTP_HOST} is an Apache variable, we must prepend a %. Then, the regex says to match the logical negation of (i.e., NOT) (start anchor to match the start of the {HTTP_HOST} string) www, an escaped dot (meaning that it ONLY matches the dot character), the domain name example, another escaped dot, and com (end anchor to match the end of the {HTTP_HOST} string). The No Case flag ([NC]) is necessary because a domain name is not case sensitive. AND … The RewriteRule says to match zero or one of anything then redirect to http://www.example.com with the original {REQUEST_URI}. The R=301 tells the browser (and search engines) that this is a permanent redirection and the Last flag tells mod_rewrite that you've completed this block statement. RewriteCond statements can also create atoms via their regex but these are denoted by %1 … %9 the same way that RewriteRule atoms are $1 … $9. You'll see these in operation in the Examples. Flagsmod_rewrite uses "flags" to give your mod_rewrite code additional power. I've used the Last, Redirect and No Case flags above but the main ones you'll need to be familiar with are:
There are other flags but you can get their definitions from Apache.org's mod_rewrite documentation. mod_rewrite CommentsWhile the RewriteEngine on statement tells Apache to "start your engines," it also serves to denote mod_rewrite comments. As a good programmer, you know how important comments are in your code. mod_rewrite allows comments after a // at the beginning of a line but it also allows you to comment out an entire block of mod_rewrite code by wrapping the code in RewriteEngine off and RewriteEngine on statements: RewriteEngine off RewriteEngine statements can be very helpful when developing new mod_rewrite code – just use them as you would the /* … */ wrapper for PHP comments. WARNING: Do not use RewriteEngine statements to hide your mod_rewrite code if you don't have mod_rewrite enabled as you will get the same "500" error as if you used the "foo directive" (merely placing foo on a line in your .htaccess file). This is a mod_rewrite directive. Note: You only need ONE RewriteEngine on statement per .htaccess file (unless you also include RewriteEngine off statement(s) for commenting blocks of code). mod_rewrite LinksAs a webmaster, it is for YOU to determine how your pages will be identified to visitors as well as how to rewrite those URIs so Apache can serve the appropriate content. Since nobody yet knows that you have made your links "user-friendly" (nor how you have formatted them), YOU have to create the links in your site's pages. You can use an editor (like Dreamweaver) which will perform multiple find and replace actions across your website (because you did not know about user-friendly URLs when you built it). In the example in the section above, I used countries, states and cities – items that would be unique in a database. As I build websites for clients to update themselves, it is not reasonable for me to insist that they provide unique names for all their articles so database articles are typically identified by an auto-incremented ID. That's all that's required to pick a single article out of that database! So long as you can use an unique key, you will be able to use any key in your query string. There have been many questions about how to use a database to redirect from a title (or other field) to an ID. Unless you have access to your httpd.conf (in order to create a RewriteMap application), forget about using a database for your redirections. Instead, make the field of choice unique and use that field to create your links. The only thing to remember is that spaces appear as %20 in URLs so convert them before creating the link and back after obtaining the string in the $_GET array – the str_replace() code I offered above is perfect for this. WARNING: There are other characters which are "reserved," "unreserved" or must be "escaped." There is a rather technical article which identifies the Uniform Resource Identifiers (URI): General Syntax. Obviously, you'll need to remove or escape these characters as appropriate. Relative Links Are Missing!Sorry, you are not ready yet, though, because, when you test your user friendly URLs, they work the same as the original links except that all your CSS, javascript files and images have disappeared! You can blame mod_rewrite if you like but it is your fault as you have used URLs that tell Apache that the script is in another directory (in my example, you are considered to be in the USA/California/ subdirectory – San_Diego would be the script's name) which is two subdirectories deeper into the website than display.php! To get around this seeming "bad feature" of mod_rewrite, you can use absolute links throughout your site instead of relative links OR use HTML's <base> tag to identify the real location: <head> Note that an absolute link (with either a leading / to denote DocumentRoot OR the full URL) is required as you are trying to "fix" the problem with relative links. ExamplesLet's get on to examples which combine these basic structures to so some useful work! Replace A CharacterOnce you've discovered that the hyphens (dashes) in your URLs are causing problems (with your regex as well as converting to and from your database fields), you'll want to change them to underscores (the underline character). The problem is that you don't know how many hyphens you have in your URLs so you'll use regex to repetitively replace the hyphen: RewriteEngine on The Next flag tells Apache to restart the mod_rewrite rules (upon successful match and redirection). Unfortunately, you'll need to do further processing to be able to use an R=301 on the resultant redirection so that others will know you've changed your URL format so do this first. Unlimited key/value pairsIf you followed the Regex section above to it's conclusion, you might guess that there is a limit to the number of key/value pairs. There is: As already explained, the number of Apache variables that can be created is nine. If you need more, however, don't despair! Using the Next flag, I've just demonstrated how to change an unlimited number of -'s to _'s. We'll now extend that to unlimited key/value pairs. RewriteEngine on This will capture a new key/value pair with the first two atoms ($1 and $2), anything "leftover" with $3 (which includes the trailing /) and redirect to the "leftover" with the redirect.php script remaining as the target with the key/value pair ADDED to any existing query string by the Query String Append flag before the process is restarted by the Next flag - the Last flag ensures that the mod_rewrite statement is terminated (not ANDed with any following statements). If you don't want to show the redirect script in the URL, you'll need to account for the final redirection another way. RewriteEngine on Here, I've captured the key and value pairs with the first two atoms again and used the third to capture anything else. Assuming that the atoms are properly paired, the result will be a query string in the DocumentRoot. Assuming that the DirectoryIndex (normally index.php or index.html) is not the target of your redirection (and does not receive a query string), the existance of a query string (as denoted by finding an = within the query string) is used as a marker to effect a redirection to the script which will handle the redirect. WARNING: Do NOT exceed 255 characters in your URI. (I recall 255 as the limit but I can't find the source to confirm.) Force www for a Domain[repeated from above] If you want to force a browser to use the full domain with the www. prefix, you will need to test the Apache {HTTP_HOST} variable to see if it already exists and, if not, redirect. RewriteEngine on If you have subdomains, however, preserve the subdomain like this: RewriteEngine on Capture the optional subdomain and, if it does not start with www., redirect with www. prepended to the subdomain and domain with the original {REQUEST_URI}. Eliminate www for a DomainGoing the other way (getting rid of the www prefix)? RewriteEngine on Get rid of the www but preserve a subdomain with: RewriteEngine on Here, the subdomain is captured in %2 (the inner atom) but, since it's optional and already captured in the %1 Apache variable, all you need is the %1 for the subdomain and domain without the leading www. Prevent Image HotlinkingIf some unscrupulous webmaster is stealing your bandwidth (leeching) by linking to images on your site to post on his: RewriteEngine on This example uses the optional list to select just GIF and JPG images – do not allow a space in that list and remember, example.com is your site! If you are upset enough at these pirates, you could change the image and feed something to let his visitors know he's hotlinking: RewriteEngine on Of course, these both require the visitor to have his HTTP_REFERER enabled (most browsers do by default). Block specific hotlinkers with: RewriteEngine on This blocks visitors coming from the leecher's site to view GIF and JPF files. Rather allow (or forbid) visitors from a specific IP Addresses? Use {REMOTE_ADDR} instead like: RewriteEngine on Redirect to a 404 PageIf your host doesn't provide for a "file not found" redirection, create it yourself! # you SHOULD be using This script checks to see that the requested filename does not exist and then that it is not really a directory before it redirects to the DocumentRoot's 404.php script. Extend this just a bit by including the URI in a query string by adding ?url=$1 immediately after the /404.php: RewriteEngine on Rename Your DirectoriesYou've shifted files around on your site changing directory name(s): # mod_alias can do this faster without the regex engine Note that I've included the dot character (not the "any character" metacharacter) inside the range to allow file extensions but the a-z will accept only lowercase characters. If you need uppercase, you know from above how to modify this code. Convert .html Links to .php LinksUpdating your website but need to be sure that bookmarked links will still work? RewriteEngine on This is not a permanent redirection so it will be invisible to your visitors. To make it permanent (and visible), change the flag to [R=301,L]. Obviously, this will also work for changing any file extension from one to another by changing the html and php above. Extensionless LinksNeed to make your links easier to remember or just want to hide your file types? Typically you're only using either .html or .php files so: RewriteEngine on Someone has asked about using extensionless URIs for both .html and .php files. Requiring that both php and html extensions be considered requires that you use RewriteCond statements to check whether the filename with either extension exists as a file: RewriteEngine on As in the 404 example, the -f checks for the existence of a file. Check for Key in Query StringIf you need to have a specific key's value in your query string, you can check for its existence with RewriteCond: RewriteCond %{QUERY_STRING} !uniquekey= ... will check the {QUERY_STRING} variable for lack of the key "uniquekey" and, if the {REQUEST_URI} is the script_that_requires_uniquekey, it will redirect. If you are looking for an unique value, remove the in the RewriteCond statement. RewriteCond %{QUERY_STRING} !uniquevalue Delete the Query StringApache's mod_rewrite automatically passes-through a query string UNLESS you
Redirect TO New FormatI have fielded questions where someone wanted to redirect their real URIs to extensionless URIs so search engines would update to their new, extensionless format? Okay, Apache can do that but it can not serve scripts in the new format (they have to be redirected back to the real link!). Have I got your head spinning? I do NOT recommend this (unless you're on a dedicated server with low volume) as it requires additional processing by Apache. The key to this is to add a "marker" in the query string that will NOT be seen by visitors, i.e., redirect from the "real link" to the "extensionless format" ONLY if the "marker" is NOT present in a query string THEN redirect from the "extensionless" format to the "usable link" AND add a "marker" to the query string (be sure not to eliminate any existing query string by using the QSA flag!). # Assumes "usable link" is index.php?id=alphanumeric Here, the original http://www.example.com/index.php?id=something does not contain the marker so it is redirected to http://www.example.com/something. Then, the second RewriteRule finds the something and redirects it back to index.php adding marker AND id=something in a new query string and the mod_rewrite process is started over. The second iteration, the marker is matched so the first RewriteRule is ignored and, since there is a dot character in index.php?marker&id=something, the second RewriteRule is also ignored … finis! Enforce Secure ServerApache can determine whether you're using a secure server in two ways: Using the {HTTPS}and {SERVER_PORT} (which is 443 for a secure server). So, these two bits will redirect to a secure server is you're not already there: RewriteEngine on OR RewriteEngine on Since I don't believe that there is an {HTTPS} variable when you've not requested a secure server, I use the {SERVER_PORT} option. Selective Enforce Secure Server(where the secure and unsecure domains share the DocumentRoot) This requires a RewriteCond statement to check whether the secure server port is being used and, if not AND the requested script is one in the list requiring a secure server, redirect. RewriteEngine on And, to redirect pages not requiring a secure server, RewriteEngine on will force the http ({SERVER_PORT}=80) mode. WARNING: Mixing these two (force HTTPS and HTTP at the same time) will force non-script files to be served in HTTP protocol, i.e., not encrypted. The result WILL BE a warning that some content has NOT been authenticated. To avoid the "mixing" problem of the force non-secure pages, target the scripts rather than all files: RewriteEngine on Another method utilizes a regex "trick" (this comes from extras at the SitePoint Apache forum): RewriteEngine on Explanation: I needed an explanation so here it is:
In short, this is a trick to replace on with s and similar techniques can be used in other situations, too. Since the first version is far simpler to understand, I recommend that one. Summarymod_rewrite is primarily used to allow "Search Engine Optimization" / "User-Friendly" URLs but it is an extremely flexible webmaster tool for other important redirection tasks. Reference Links
|
|||
|
||||